[libvirt test] 167809: regressions - FAIL

2022-01-24 Thread osstest service owner
flight 167809 libvirt real [real]
http://logs.test-lab.xenproject.org/osstest/logs/167809/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-arm64-libvirt   6 libvirt-buildfail REGR. vs. 151777
 build-i386-libvirt6 libvirt-buildfail REGR. vs. 151777
 build-armhf-libvirt   6 libvirt-buildfail REGR. vs. 151777
 build-amd64-libvirt   6 libvirt-buildfail REGR. vs. 151777

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-libvirt  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-pair  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a
 test-amd64-amd64-libvirt-vhd  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt   1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-pair  1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a
 test-amd64-i386-libvirt-raw   1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-xsm   1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-qcow2  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-raw  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-raw  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-qcow2  1 build-check(1)   blocked  n/a

version targeted for testing:
 libvirt  fa6e29f9789f16fb3262b3d4de5b90b705f603fb
baseline version:
 libvirt  2c846fa6bcc11929c9fb857a22430fb9945654ad

Last test of basis   151777  2020-07-10 04:19:19 Z  564 days
Failing since151818  2020-07-11 04:18:52 Z  563 days  545 attempts
Testing same since   167809  2022-01-25 04:20:14 Z0 days1 attempts


People who touched revisions under test:
Adolfo Jayme Barrientos 
  Aleksandr Alekseev 
  Aleksei Zakharov 
  Andika Triwidada 
  Andrea Bolognani 
  Ani Sinha 
  Balázs Meskó 
  Barrett Schonefeld 
  Bastian Germann 
  Bastien Orivel 
  BiaoXiang Ye 
  Bihong Yu 
  Binfeng Wu 
  Bjoern Walk 
  Boris Fiuczynski 
  Brian Turek 
  Bruno Haible 
  Chris Mayo 
  Christian Borntraeger 
  Christian Ehrhardt 
  Christian Kirbach 
  Christian Schoenebeck 
  Cole Robinson 
  Collin Walling 
  Cornelia Huck 
  Cédric Bosdonnat 
  Côme Borsoi 
  Daniel Henrique Barboza 
  Daniel Letai 
  Daniel P. Berrange 
  Daniel P. Berrangé 
  Didik Supriadi 
  dinglimin 
  Divya Garg 
  Dmitrii Shcherbakov 
  Dmytro Linkin 
  Eiichi Tsukata 
  Eric Farman 
  Erik Skultety 
  Fabian Affolter 
  Fabian Freyer 
  Fabiano Fidêncio 
  Fangge Jin 
  Farhan Ali 
  Fedora Weblate Translation 
  Franck Ridel 
  Gavi Teitz 
  gongwei 
  Guoyi Tu
  Göran Uddeborg 
  Halil Pasic 
  Han Han 
  Hao Wang 
  Hela Basa 
  Helmut Grohne 
  Hiroki Narukawa 
  Hyman Huang(黄勇) 
  Ian Wienand 
  Ioanna Alifieraki 
  Ivan Teterevkov 
  Jakob Meng 
  Jamie Strandboge 
  Jamie Strandboge 
  Jan Kuparinen 
  jason lee 
  Jean-Baptiste Holcroft 
  Jia Zhou 
  Jianan Gao 
  Jim Fehlig 
  Jin Yan 
  Jinsheng Zhang 
  Jiri Denemark 
  Joachim Falk 
  John Ferlan 
  Jonathan Watt 
  Jonathon Jongsma 
  Julio Faracco 
  Justin Gatzen 
  Ján Tomko 
  Kashyap Chamarthy 
  Kevin Locke 
  Koichi Murase 
  Kristina Hanicova 
  Laine Stump 
  Laszlo Ersek 
  Lee Yarwood 
  Lei Yang 
  Liao Pingfang 
  Lin Ma 
  Lin Ma 
  Lin Ma 
  Liu Yiding 
  Luke Yue 
  Luyao Zhong 
  Marc Hartmayer 
  Marc-André Lureau 
  Marek Marczykowski-Górecki 
  Markus Schade 
  Martin Kletzander 
  Masayoshi Mizuma 
  Matej Cepl 
  Matt Coleman 
  Matt Coleman 
  Mauro Matteo Cascella 
  Meina Li 
  Michal Privoznik 
  Michał Smyk 
  Milo Casagrande 
  Moshe Levi 
  Muha Aliss 
  Nathan 
  Neal Gompa 
  Nick Chevsky 
  Nick Shyrokovskiy 
  Nickys Music Group 
  Nico Pache 
  Nikolay Shirokovskiy 
  Olaf Hering 
  Olesya Gerasimenko 
  Or Ozeri 
  Orion Poplawski 
  Pany 
  Patrick Magauran 
  Paulo de Rezende Pinatti 
  Pavel Hrdina 
  Peng Liang 
  Peter Krempa 
  Pino Toscano 
  Pino Toscano 
  Piotr Drąg 
  Prathamesh Chavan 
  Praveen K Paladugu 
  Richard W.M. Jones 
  Ricky Tigg 
  Robin Lee 
  Rohit Kumar 
  Roman Bogorodskiy 
  Roman Bolshakov 
  Ryan Gahagan 
  Ryan Schmidt 
  Sam Hartman 
  Scott Shambarger 
  Sebastian Mitterle 
  SeongHyun Jo 
  Shalini Chellathurai Saroja 
  Shaojun Yang 
  shenjiatong 
  Shi Lei 
  simmon 
  Simon Chopin 
  Simon Gaiser 
  Simon Rowe 
  Stefan Bader 
  Stefan Berger 
  

Re: [XEN PATCH v2 2/5] xen: export get_free_port

2022-01-24 Thread Stefano Stabellini
On Sun, 23 Jan 2022, Julien Grall wrote:
> > diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c
> > index da88ad141a..5b0bcaaad4 100644
> > --- a/xen/common/event_channel.c
> > +++ b/xen/common/event_channel.c
> > @@ -232,7 +232,7 @@ int evtchn_allocate_port(struct domain *d, evtchn_port_t
> > port)
> >   return 0;
> >   }
> >   -static int get_free_port(struct domain *d)
> > +int get_free_port(struct domain *d)
> 
> I dislike the idea to expose get_free_port() (or whichever name we decide)
> because this can be easily misused.
> 
> In fact looking at your next patch (#3), you are misusing it as it is meant to
> be called with d->event_lock. I know this doesn't much matter
> in your situation because this is done at boot with no other domains running
> (or potentially any event channel allocation). However, I still think we
> should get the API right.
> 
> I am also not entirely happy of open-coding the allocation in domain_build.c.
> Instead, I would prefer if we provide a new helper to allocate an unbound
> event channel. This would be similar to your v1 (I still need to review the
> patch though).

I am happy to go back to v1 and address feedback on that patch. However,
I am having difficulties with the implementation. Jan pointed out:


> > -
> > -chn->state = ECS_UNBOUND;
> 
> This cannot be pulled ahead of the XSM check (or in general anything
> potentially resulting in an error), as check_free_port() relies on
> ->state remaining ECS_FREE until it is known that the calling function
> can't fail anymore.

This makes it difficult to reuse _evtchn_alloc_unbound for the
implementation of evtchn_alloc_unbound. In fact, I couldn't find a way
to do it.

Instead, I just create a new public function called
"evtchn_alloc_unbound" and renamed the existing funtion to
"_evtchn_alloc_unbound" (this to addresses Jan's feedback that the
static function should be the one starting with "_"). So the function
names are inverted compared to v1.

Please let me know if you have any better suggestions.


diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c
index da88ad141a..c6b7dd7fbd 100644
--- a/xen/common/event_channel.c
+++ b/xen/common/event_channel.c
@@ -18,6 +18,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -284,7 +285,27 @@ void evtchn_free(struct domain *d, struct evtchn *chn)
 xsm_evtchn_close_post(chn);
 }
 
-static int evtchn_alloc_unbound(evtchn_alloc_unbound_t *alloc)
+struct evtchn *evtchn_alloc_unbound(struct domain *d, domid_t remote_dom)
+{
+struct evtchn *chn;
+int port;
+
+if ( (port = get_free_port(d)) < 0 )
+return ERR_PTR(port);
+chn = evtchn_from_port(d, port);
+
+evtchn_write_lock(chn);
+
+chn->state = ECS_UNBOUND;
+chn->u.unbound.remote_domid = remote_dom;
+evtchn_port_init(d, chn);
+
+evtchn_write_unlock(chn);
+
+return chn;
+}
+
+static int _evtchn_alloc_unbound(evtchn_alloc_unbound_t *alloc)
 {
 struct evtchn *chn;
 struct domain *d;
@@ -1195,7 +1216,7 @@ long do_event_channel_op(int cmd, 
XEN_GUEST_HANDLE_PARAM(void) arg)
 struct evtchn_alloc_unbound alloc_unbound;
 if ( copy_from_guest(_unbound, arg, 1) != 0 )
 return -EFAULT;
-rc = evtchn_alloc_unbound(_unbound);
+rc = _evtchn_alloc_unbound(_unbound);
 if ( !rc && __copy_to_guest(arg, _unbound, 1) )
 rc = -EFAULT; /* Cleaning up here would be a mess! */
 break;
diff --git a/xen/include/xen/event.h b/xen/include/xen/event.h
index 21c95e14fd..85dcf1d0c4 100644
--- a/xen/include/xen/event.h
+++ b/xen/include/xen/event.h
@@ -68,6 +68,9 @@ int evtchn_close(struct domain *d1, int port1, bool guest);
 /* Free an event channel. */
 void evtchn_free(struct domain *d, struct evtchn *chn);
 
+/* Create a new event channel port */
+struct evtchn *evtchn_alloc_unbound(struct domain *d, domid_t remote_dom);
+
 /* Allocate a specific event channel port. */
 int evtchn_allocate_port(struct domain *d, unsigned int port);
 



Re: [RFC v1 3/5] xen/arm: introduce SCMI-SMC mediator driver

2022-01-24 Thread Stefano Stabellini
On Mon, 24 Jan 2022, Julien Grall wrote:
> On 24/01/2022 19:06, Stefano Stabellini wrote:
> > It looks like XEN_DOMCTL_host_node_by_path and
> > XEN_DOMCTL_find_host_compatible_node would also solve the problem but I
> > think that a single hypercall that retrieves the entire host DTB would
> > be easier to implement
> 
> DOMCTL should only be used to handle per-domain information. If we want to
> create a new sub-hypercall of either __HYPERVISOR_platform_op or
> __HYPERVISOR_sysctl_op (not sure which one).
> 
> AFAICT, both are versioned.
> 
> > and more robust in the long term. >
> > hypfs has the advantage that it would create an interface more similar
> > to the one people are already used to on Linux systems
> > (/proc/device-tree). xl/libxl would have to scan the whole hypfs tree,
> > which intuitively I think it would be slower.
> 
> Even if you have the binary blob, you would still have to scan the
> device-tree. That said, it is probably going to be potentially a bit faster
> because you have less hypercall.
> 
> However, here this is a trade-off between memory use and speed. If you want
> speed, then you may have to transfer up to 2MB every time. So the question is
> do we care more about speed or memory usage?
> 
> > Also the feature might be
> > harder to implement but I am not sure.
> > 
> > I don't have a strong preference and this is not a stable interface (we
> > don't have to be extra paranoid about forward and backward
> > compatibility). So I am fine either way. Let's see what the others think
> > as well.
> 
> My preference would be to use hypfs as this is cleaner than exposing a blob.

That's also fine by me. Probably the hypfs implementation shouldn't be
much more difficult than something like
XEN_DOMCTL_host_node_by_path/XEN_DOMCTL_find_host_compatible_node.


> However, are we sure we can simply copy the content of the host Device-Tree to
> the guest Device-Tree for SCMI? For instance, I know that for device
> passthrough there are some property that needs to be altered for some devices.
> Hence, why it is not present. Although, I vaguely recalled to have written a
> PoC, not sure if it was posted on the ML.

The SCMI node cannot be copied "as is" from host to guest. It needs a
couple of changes but they seem feasible as they are limited to the
channels exposed to the guest. (The generic device passthrough case is a
lot more difficult.)



Re: [PATCH 11/19] rnbd-srv: remove struct rnbd_dev_blk_io

2022-01-24 Thread Jinpu Wang
On Mon, Jan 24, 2022 at 10:11 AM Christoph Hellwig  wrote:
>
> Only the priv field of rnbd_dev_blk_io is used, so store the value of
> that in bio->bi_private directly and remove the entire bio_set overhead.
>
> Signed-off-by: Christoph Hellwig 
Reviewed-by: Jack Wang 
Thanks!
> ---
>  drivers/block/rnbd/rnbd-srv-dev.c |  4 +---
>  drivers/block/rnbd/rnbd-srv-dev.h | 13 ++---
>  drivers/block/rnbd/rnbd-srv.c | 27 ---
>  drivers/block/rnbd/rnbd-srv.h |  1 -
>  4 files changed, 7 insertions(+), 38 deletions(-)
>
> diff --git a/drivers/block/rnbd/rnbd-srv-dev.c 
> b/drivers/block/rnbd/rnbd-srv-dev.c
> index 98d3e591a0885..c5d0a03911659 100644
> --- a/drivers/block/rnbd/rnbd-srv-dev.c
> +++ b/drivers/block/rnbd/rnbd-srv-dev.c
> @@ -12,8 +12,7 @@
>  #include "rnbd-srv-dev.h"
>  #include "rnbd-log.h"
>
> -struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t flags,
> -  struct bio_set *bs)
> +struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t flags)
>  {
> struct rnbd_dev *dev;
> int ret;
> @@ -30,7 +29,6 @@ struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t 
> flags,
>
> dev->blk_open_flags = flags;
> bdevname(dev->bdev, dev->name);
> -   dev->ibd_bio_set = bs;
>
> return dev;
>
> diff --git a/drivers/block/rnbd/rnbd-srv-dev.h 
> b/drivers/block/rnbd/rnbd-srv-dev.h
> index 1a14ece0be726..2c3df02b5e8ec 100644
> --- a/drivers/block/rnbd/rnbd-srv-dev.h
> +++ b/drivers/block/rnbd/rnbd-srv-dev.h
> @@ -14,25 +14,16 @@
>
>  struct rnbd_dev {
> struct block_device *bdev;
> -   struct bio_set  *ibd_bio_set;
> fmode_t blk_open_flags;
> charname[BDEVNAME_SIZE];
>  };
>
> -struct rnbd_dev_blk_io {
> -   struct rnbd_dev *dev;
> -   void *priv;
> -   /* have to be last member for front_pad usage of bioset_init */
> -   struct bio  bio;
> -};
> -
>  /**
>   * rnbd_dev_open() - Open a device
> + * @path:  path to open
>   * @flags: open flags
> - * @bs:bio_set to use during block io,
>   */
> -struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t flags,
> -  struct bio_set *bs);
> +struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t flags);
>
>  /**
>   * rnbd_dev_close() - Close a device
> diff --git a/drivers/block/rnbd/rnbd-srv.c b/drivers/block/rnbd/rnbd-srv.c
> index 6d228af1dcc35..ff9b389976078 100644
> --- a/drivers/block/rnbd/rnbd-srv.c
> +++ b/drivers/block/rnbd/rnbd-srv.c
> @@ -116,9 +116,7 @@ rnbd_get_sess_dev(int dev_id, struct rnbd_srv_session 
> *srv_sess)
>
>  static void rnbd_dev_bi_end_io(struct bio *bio)
>  {
> -   struct rnbd_dev_blk_io *io = bio->bi_private;
> -
> -   rnbd_endio(io->priv, blk_status_to_errno(bio->bi_status));
> +   rnbd_endio(bio->bi_private, blk_status_to_errno(bio->bi_status));
> bio_put(bio);
>  }
>
> @@ -131,7 +129,6 @@ static int process_rdma(struct rnbd_srv_session *srv_sess,
> struct rnbd_srv_sess_dev *sess_dev;
> u32 dev_id;
> int err;
> -   struct rnbd_dev_blk_io *io;
> struct bio *bio;
> short prio;
>
> @@ -152,7 +149,7 @@ static int process_rdma(struct rnbd_srv_session *srv_sess,
> priv->sess_dev = sess_dev;
> priv->id = id;
>
> -   bio = bio_alloc_bioset(GFP_KERNEL, 1, 
> sess_dev->rnbd_dev->ibd_bio_set);
> +   bio = bio_alloc(GFP_KERNEL, 1);
> if (bio_add_page(bio, virt_to_page(data), datalen,
> offset_in_page(data)) != datalen) {
> rnbd_srv_err(sess_dev, "Failed to map data to bio\n");
> @@ -160,12 +157,8 @@ static int process_rdma(struct rnbd_srv_session 
> *srv_sess,
> goto bio_put;
> }
>
> -   io = container_of(bio, struct rnbd_dev_blk_io, bio);
> -   io->dev = sess_dev->rnbd_dev;
> -   io->priv = priv;
> -
> bio->bi_end_io = rnbd_dev_bi_end_io;
> -   bio->bi_private = io;
> +   bio->bi_private = priv;
> bio->bi_opf = rnbd_to_bio_flags(le32_to_cpu(msg->rw));
> bio->bi_iter.bi_sector = le64_to_cpu(msg->sector);
> bio->bi_iter.bi_size = le32_to_cpu(msg->bi_size);
> @@ -260,7 +253,6 @@ static void destroy_sess(struct rnbd_srv_session 
> *srv_sess)
>
>  out:
> xa_destroy(_sess->index_idr);
> -   bioset_exit(_sess->sess_bio_set);
>
> pr_info("RTRS Session %s disconnected\n", srv_sess->sessname);
>
> @@ -289,16 +281,6 @@ static int create_sess(struct rtrs_srv_sess *rtrs)
> return -ENOMEM;
>
> srv_sess->queue_depth = rtrs_srv_get_queue_depth(rtrs);
> -   err = bioset_init(_sess->sess_bio_set, srv_sess->queue_depth,
> - offsetof(struct rnbd_dev_blk_io, bio),
> - BIOSET_NEED_BVECS);
> -   if (err) {
> -   pr_err("Allocating srv_session for path %s failed\n",
> -  

[xen-unstable test] 167806: tolerable FAIL - PUSHED

2022-01-24 Thread osstest service owner
flight 167806 xen-unstable real [real]
flight 167807 xen-unstable real-retest [real]
http://logs.test-lab.xenproject.org/osstest/logs/167806/
http://logs.test-lab.xenproject.org/osstest/logs/167807/

Failures :-/ but no regressions.

Tests which are failing intermittently (not blocking):
 test-amd64-amd64-xl-qemuu-debianhvm-amd64 20 guest-start/debianhvm.repeat fail 
pass in 167807-retest

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-xl-rtds 18 guest-start/debian.repeatfail  like 167790
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 167802
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 167802
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 167802
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 167802
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 167802
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 167802
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 167802
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 167802
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 167802
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 167802
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 167802
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check   fail like 167802
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-raw  14 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 

Re: [RFC v1 3/5] xen/arm: introduce SCMI-SMC mediator driver

2022-01-24 Thread Julien Grall

Hi,

On 24/01/2022 19:06, Stefano Stabellini wrote:

It looks like XEN_DOMCTL_host_node_by_path and
XEN_DOMCTL_find_host_compatible_node would also solve the problem but I
think that a single hypercall that retrieves the entire host DTB would
be easier to implement


DOMCTL should only be used to handle per-domain information. If we want 
to create a new sub-hypercall of either __HYPERVISOR_platform_op or 
__HYPERVISOR_sysctl_op (not sure which one).


AFAICT, both are versioned.


and more robust in the long term. >
hypfs has the advantage that it would create an interface more similar
to the one people are already used to on Linux systems
(/proc/device-tree). xl/libxl would have to scan the whole hypfs tree,
which intuitively I think it would be slower.


Even if you have the binary blob, you would still have to scan the 
device-tree. That said, it is probably going to be potentially a bit 
faster because you have less hypercall.


However, here this is a trade-off between memory use and speed. If you 
want speed, then you may have to transfer up to 2MB every time. So the 
question is do we care more about speed or memory usage?



Also the feature might be
harder to implement but I am not sure.

I don't have a strong preference and this is not a stable interface (we
don't have to be extra paranoid about forward and backward
compatibility). So I am fine either way. Let's see what the others think
as well.


My preference would be to use hypfs as this is cleaner than exposing a blob.

However, are we sure we can simply copy the content of the host 
Device-Tree to the guest Device-Tree for SCMI? For instance, I know that 
for device passthrough there are some property that needs to be altered 
for some devices. Hence, why it is not present. Although, I vaguely 
recalled to have written a PoC, not sure if it was posted on the ML.


Cheeers,

--
Julien Grall



Re: [RFC v1 3/5] xen/arm: introduce SCMI-SMC mediator driver

2022-01-24 Thread Stefano Stabellini
On Mon, 24 Jan 2022, Oleksii Moisieiev wrote:
> On Fri, Jan 21, 2022 at 12:49:55PM -0800, Stefano Stabellini wrote:
> > On Fri, 21 Jan 2022, Oleksii Moisieiev wrote:
> > > On Thu, Jan 20, 2022 at 02:29:41PM -0800, Stefano Stabellini wrote:
> > > > On Thu, 20 Jan 2022, Oleksii Moisieiev wrote:
> > > > > On Wed, Jan 19, 2022 at 05:28:21PM -0800, Stefano Stabellini wrote:
> > > > > > On Wed, 19 Jan 2022, Oleksii Moisieiev wrote:
> > > > > > > On Wed, Dec 22, 2021 at 06:23:24PM -0800, Stefano Stabellini 
> > > > > > > wrote:
> > > > > > > > On Wed, 22 Dec 2021, Oleksii Moisieiev wrote:
> > > > > > > > > On Tue, Dec 21, 2021 at 01:22:50PM -0800, Stefano Stabellini 
> > > > > > > > > wrote:
> > > > > > > > > > On Tue, 21 Dec 2021, Oleksii Moisieiev wrote:
> > > > > > > > > > > Hi Stefano,
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Dec 20, 2021 at 04:52:01PM -0800, Stefano 
> > > > > > > > > > > Stabellini wrote:
> > > > > > > > > > > > On Mon, 20 Dec 2021, Oleksii Moisieiev wrote:
> > > > > > > > > > > > > Hi Stefano,
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Dec 17, 2021 at 06:14:55PM -0800, Stefano 
> > > > > > > > > > > > > Stabellini wrote:
> > > > > > > > > > > > > > On Tue, 14 Dec 2021, Oleksii Moisieiev wrote:
> > > > > > > > > > > > > > > This is the implementation of SCI interface, 
> > > > > > > > > > > > > > > called SCMI-SMC driver,
> > > > > > > > > > > > > > > which works as the mediator between XEN Domains 
> > > > > > > > > > > > > > > and Firmware (SCP, ATF etc).
> > > > > > > > > > > > > > > This allows devices from the Domains to work with 
> > > > > > > > > > > > > > > clocks, resets and
> > > > > > > > > > > > > > > power-domains without access to CPG.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The following features are implemented:
> > > > > > > > > > > > > > > - request SCMI channels from ATF and pass 
> > > > > > > > > > > > > > > channels to Domains;
> > > > > > > > > > > > > > > - set device permissions for Domains based on the 
> > > > > > > > > > > > > > > Domain partial
> > > > > > > > > > > > > > > device-tree. Devices with permissions are able to 
> > > > > > > > > > > > > > > work with clocks,
> > > > > > > > > > > > > > > resets and power-domains via SCMI;
> > > > > > > > > > > > > > > - redirect scmi messages from Domains to ATF.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Signed-off-by: Oleksii Moisieiev 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > >  xen/arch/arm/Kconfig  |   2 +
> > > > > > > > > > > > > > >  xen/arch/arm/sci/Kconfig  |  10 +
> > > > > > > > > > > > > > >  xen/arch/arm/sci/Makefile |   1 +
> > > > > > > > > > > > > > >  xen/arch/arm/sci/scmi_smc.c   | 795 
> > > > > > > > > > > > > > > ++
> > > > > > > > > > > > > > >  xen/include/public/arch-arm.h |   1 +
> > > > > > > > > > > > > > >  5 files changed, 809 insertions(+)
> > > > > > > > > > > > > > >  create mode 100644 xen/arch/arm/sci/Kconfig
> > > > > > > > > > > > > > >  create mode 100644 xen/arch/arm/sci/scmi_smc.c
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > diff --git a/xen/arch/arm/Kconfig 
> > > > > > > > > > > > > > > b/xen/arch/arm/Kconfig
> > > > > > > > > > > > > > > index 186e1db389..02d96c6cfc 100644
> > > > > > > > > > > > > > > --- a/xen/arch/arm/Kconfig
> > > > > > > > > > > > > > > +++ b/xen/arch/arm/Kconfig
> > > > > > > > > > > > > > > @@ -114,6 +114,8 @@ config SCI
> > > > > > > > > > > > > > > support. It allows guests to control system 
> > > > > > > > > > > > > > > resourcess via one of
> > > > > > > > > > > > > > > SCI mediators implemented in XEN.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > +source "arch/arm/sci/Kconfig"
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > >  endmenu
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >  menu "ARM errata workaround via the alternative 
> > > > > > > > > > > > > > > framework"
> > > > > > > > > > > > > > > diff --git a/xen/arch/arm/sci/Kconfig 
> > > > > > > > > > > > > > > b/xen/arch/arm/sci/Kconfig
> > > > > > > > > > > > > > > new file mode 100644
> > > > > > > > > > > > > > > index 00..9563067ddc
> > > > > > > > > > > > > > > --- /dev/null
> > > > > > > > > > > > > > > +++ b/xen/arch/arm/sci/Kconfig
> > > > > > > > > > > > > > > @@ -0,0 +1,10 @@
> > > > > > > > > > > > > > > +config SCMI_SMC
> > > > > > > > > > > > > > > + bool "Enable SCMI-SMC mediator driver"
> > > > > > > > > > > > > > > + default n
> > > > > > > > > > > > > > > + depends on SCI
> > > > > > > > > > > > > > > + ---help---
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > + Enables mediator in XEN to pass SCMI requests 
> > > > > > > > > > > > > > > from Domains to ATF.
> > > > > > > > > > > > > > > + This feature allows drivers from Domains to 
> > > > > > > > > > > > > > > work with System
> > > > > > > > > > > > > > > + Controllers 

Re: [XEN v3] xen/arm64: io: Decode ldr/str post-indexing instructions

2022-01-24 Thread Stefano Stabellini
On Mon, 24 Jan 2022, Ayan Kumar Halder wrote:
> Hi Andre,
> 
> Thanks forn your comments.
> 
> On 24/01/2022 14:36, Andre Przywara wrote:
> > On Mon, 24 Jan 2022 12:07:42 +
> > Ayan Kumar Halder  wrote:
> > 
> > Hi Ayan,
> > 
> > > Many thanks for your feedback. I have one clarification :-
> > > 
> > > On 22/01/2022 01:30, Andre Przywara wrote:
> > > > On Thu, 20 Jan 2022 21:55:27 +
> > > > Ayan Kumar Halder  wrote:
> > > > 
> > > > Hi,
> > > >   
> > > > > At the moment, Xen is only handling data abort with valid syndrome
> > > > > (i.e.
> > > > > ISV=0). Unfortunately, this doesn't cover all the instructions a
> > > > > domain
> > > > > could use to access MMIO regions.
> > > > > 
> > > > > For instance, a baremetal OS can use any of the following
> > > > > instructions, where
> > > > > x1 contains the address of the MMIO region:
> > > > > 
> > > > > 1.  ldr x2,[x1],#4
> > > > That looks dodgy, since is misaligns the pointer afterwards. MMIO
> > > > access typically go to device memory, which must be naturally aligned.
> > > > Just don't give a bad example here and change that to a multiple of 8.
> > > >   
> > > > > 2.  ldr w2,[x1],#-4
> > > > (this one is fine, btw, because it's a 32-bit read)
> > > >   
> > > > > 3.  ldr x2,[x1],#-8
> > > > > 4.  ldr w2,[x1],#4
> > > > > 5.  ldrhw2,[x1],#8
> > > > > 6.  ldrbw2,[x1],#16
> > > > More naturally I'd use the data size of the postindex value ...
> > > > ldr  x2 ... #-8
> > > > ldr  w2 ... #4
> > > > ldrh w2 ... #2
> > > > ldrb w2 ... #1
> > > >   
> > > > > 7.  str x2,[x1],#4
> > > > This is a again problematic, because x1 is not 8-byte aligned anymore
> > > > after that.
> > > >   
> > > > > 8.  str w2,[x1],#-4
> > > > > 9.  strhw2,[x1],#8
> > > > > 10. strbw2,[x1],#16
> > > > > 
> > > > > In the following two instructions, sp contains the address of the MMIO
> > > > > region:-
> > > > Really? I don't think you should give people funny ideas, just mention
> > > > that the Rn register could theoretically be the stack pointer.
> > > >   
> > > > > 11. ldrbw2,[sp],#16
> > > > > 12. ldrbwzr,   [sp],#16
> > > > > 
> > > > > In order to handle post-indexing store/load instructions (like those
> > > > > mentioned
> > > > > above), Xen will need to fetch and decode the instruction.
> > > > > 
> > > > > This patch only cover post-index store/load instructions from AArch64
> > > > > mode.
> > > > > For now, this is left unimplemented for trap from AArch32 mode.
> > > > > 
> > > > > Signed-off-by: Ayan Kumar Halder 
> > > > > ---
> > > > > 
> > > > > Changelog :-
> > > > > v2 - 1. Updated the rn register after reading from it. (Pointed by
> > > > > Julien,
> > > > >   Stefano)
> > > > >2. Used a union to represent the instruction opcode (Suggestd
> > > > > by Bertrand)
> > > > >3. Fixed coding style issues (Pointed by Julien)
> > > > >4. In the previous patch, I was updating dabt->sign based on
> > > > > the signedness
> > > > >   of imm9. This was incorrect. As mentioned in ARMv8 ARM  DDI
> > > > > 0487G.b,
> > > > >   Page 3221, SSE indicates the signedness of the data item
> > > > > loaded. In our
> > > > >   case, the data item loaded is always unsigned.
> > > > > 
> > > > > v3- 1. Handled all the variants of ldr/str (ie 64, 32, 16, 8 bit
> > > > > variants).
> > > > >  Thus, I have removed the check for "instr->code.opc == 0"
> > > > > (Suggested by
> > > > >  Andre)
> > > > >   2. Handled the scenario when rn = SP, rt = XZR (Suggested by
> > > > > Jan, Andre)
> > > > >   3. Added restriction for "rt != rn" (Suggested by Andre)
> > > > >   4. Moved union ldr_str_instr_class {} to decode.h. This is the
> > > > > header included
> > > > >  by io.c and decode.c (where the union is referred).
> > > > > (Suggested by Jan)
> > > > >   5. Indentation and typo fixes (Suggested by Jan)
> > > > > 
> > > > > Changes suggested but could not be considered due to reasons :-
> > > > >   1. Using accessor macros instead of bitfields for
> > > > > "ldr_str_instr_class". (Andre)
> > > > >  Reason - I could not find a simple way to represent 9 bit
> > > > > signed integer
> > > > >  (ie imm9) without using bitfields. If I use accessor macros,
> > > > > then I need
> > > > >  to manually calculate two's complement to obtain the value
> > > > > when signed
> > > > >  bit is present.
> > > > > 
> > > > >   2. I/D cache cohenerncy (Andre)
> > > > >  Reason :- I could not see any instruction to flush the I
> > > > > cache.
> > > > First, please try to avoid the term "flush", because it is somewhat
> > > > overloaded. The architecture speaks of "clean" and "invalidate", which
> > > > are more precise.
> > > > Assuming you mean "clean" here: 

Re: [RFC v1 3/5] xen/arm: introduce SCMI-SMC mediator driver

2022-01-24 Thread Oleksii Moisieiev
On Fri, Jan 21, 2022 at 12:49:55PM -0800, Stefano Stabellini wrote:
> On Fri, 21 Jan 2022, Oleksii Moisieiev wrote:
> > On Thu, Jan 20, 2022 at 02:29:41PM -0800, Stefano Stabellini wrote:
> > > On Thu, 20 Jan 2022, Oleksii Moisieiev wrote:
> > > > On Wed, Jan 19, 2022 at 05:28:21PM -0800, Stefano Stabellini wrote:
> > > > > On Wed, 19 Jan 2022, Oleksii Moisieiev wrote:
> > > > > > On Wed, Dec 22, 2021 at 06:23:24PM -0800, Stefano Stabellini wrote:
> > > > > > > On Wed, 22 Dec 2021, Oleksii Moisieiev wrote:
> > > > > > > > On Tue, Dec 21, 2021 at 01:22:50PM -0800, Stefano Stabellini 
> > > > > > > > wrote:
> > > > > > > > > On Tue, 21 Dec 2021, Oleksii Moisieiev wrote:
> > > > > > > > > > Hi Stefano,
> > > > > > > > > >
> > > > > > > > > > On Mon, Dec 20, 2021 at 04:52:01PM -0800, Stefano 
> > > > > > > > > > Stabellini wrote:
> > > > > > > > > > > On Mon, 20 Dec 2021, Oleksii Moisieiev wrote:
> > > > > > > > > > > > Hi Stefano,
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Dec 17, 2021 at 06:14:55PM -0800, Stefano 
> > > > > > > > > > > > Stabellini wrote:
> > > > > > > > > > > > > On Tue, 14 Dec 2021, Oleksii Moisieiev wrote:
> > > > > > > > > > > > > > This is the implementation of SCI interface, called 
> > > > > > > > > > > > > > SCMI-SMC driver,
> > > > > > > > > > > > > > which works as the mediator between XEN Domains and 
> > > > > > > > > > > > > > Firmware (SCP, ATF etc).
> > > > > > > > > > > > > > This allows devices from the Domains to work with 
> > > > > > > > > > > > > > clocks, resets and
> > > > > > > > > > > > > > power-domains without access to CPG.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The following features are implemented:
> > > > > > > > > > > > > > - request SCMI channels from ATF and pass channels 
> > > > > > > > > > > > > > to Domains;
> > > > > > > > > > > > > > - set device permissions for Domains based on the 
> > > > > > > > > > > > > > Domain partial
> > > > > > > > > > > > > > device-tree. Devices with permissions are able to 
> > > > > > > > > > > > > > work with clocks,
> > > > > > > > > > > > > > resets and power-domains via SCMI;
> > > > > > > > > > > > > > - redirect scmi messages from Domains to ATF.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Signed-off-by: Oleksii Moisieiev 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > >  xen/arch/arm/Kconfig  |   2 +
> > > > > > > > > > > > > >  xen/arch/arm/sci/Kconfig  |  10 +
> > > > > > > > > > > > > >  xen/arch/arm/sci/Makefile |   1 +
> > > > > > > > > > > > > >  xen/arch/arm/sci/scmi_smc.c   | 795 
> > > > > > > > > > > > > > ++
> > > > > > > > > > > > > >  xen/include/public/arch-arm.h |   1 +
> > > > > > > > > > > > > >  5 files changed, 809 insertions(+)
> > > > > > > > > > > > > >  create mode 100644 xen/arch/arm/sci/Kconfig
> > > > > > > > > > > > > >  create mode 100644 xen/arch/arm/sci/scmi_smc.c
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > diff --git a/xen/arch/arm/Kconfig 
> > > > > > > > > > > > > > b/xen/arch/arm/Kconfig
> > > > > > > > > > > > > > index 186e1db389..02d96c6cfc 100644
> > > > > > > > > > > > > > --- a/xen/arch/arm/Kconfig
> > > > > > > > > > > > > > +++ b/xen/arch/arm/Kconfig
> > > > > > > > > > > > > > @@ -114,6 +114,8 @@ config SCI
> > > > > > > > > > > > > >   support. It allows guests to control system 
> > > > > > > > > > > > > > resourcess via one of
> > > > > > > > > > > > > >   SCI mediators implemented in XEN.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > +source "arch/arm/sci/Kconfig"
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > >  endmenu
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  menu "ARM errata workaround via the alternative 
> > > > > > > > > > > > > > framework"
> > > > > > > > > > > > > > diff --git a/xen/arch/arm/sci/Kconfig 
> > > > > > > > > > > > > > b/xen/arch/arm/sci/Kconfig
> > > > > > > > > > > > > > new file mode 100644
> > > > > > > > > > > > > > index 00..9563067ddc
> > > > > > > > > > > > > > --- /dev/null
> > > > > > > > > > > > > > +++ b/xen/arch/arm/sci/Kconfig
> > > > > > > > > > > > > > @@ -0,0 +1,10 @@
> > > > > > > > > > > > > > +config SCMI_SMC
> > > > > > > > > > > > > > +   bool "Enable SCMI-SMC mediator driver"
> > > > > > > > > > > > > > +   default n
> > > > > > > > > > > > > > +   depends on SCI
> > > > > > > > > > > > > > +   ---help---
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +   Enables mediator in XEN to pass SCMI requests 
> > > > > > > > > > > > > > from Domains to ATF.
> > > > > > > > > > > > > > +   This feature allows drivers from Domains to 
> > > > > > > > > > > > > > work with System
> > > > > > > > > > > > > > +   Controllers (such as power,resets,clock etc.). 
> > > > > > > > > > > > > > SCP is used as transport
> > > > > > > > > > > > > > +   for communication.
> > > > > > > > > > > > > > diff --git a/xen/arch/arm/sci/Makefile 
> > > 

Re: [XEN v3] xen/arm64: io: Decode ldr/str post-indexing instructions

2022-01-24 Thread Julien Grall

Hi,

On 24/01/2022 17:27, Ayan Kumar Halder wrote:

Thanks forn your comments.

On 24/01/2022 14:36, Andre Przywara wrote:

On Mon, 24 Jan 2022 12:07:42 +
Ayan Kumar Halder  wrote:

Hi Ayan,


Many thanks for your feedback. I have one clarification :-

On 22/01/2022 01:30, Andre Przywara wrote:

On Thu, 20 Jan 2022 21:55:27 +
Ayan Kumar Halder  wrote:

Hi,
At the moment, Xen is only handling data abort with valid syndrome 
(i.e.
ISV=0). Unfortunately, this doesn't cover all the instructions a 
domain

could use to access MMIO regions.

For instance, a baremetal OS can use any of the following 
instructions, where

x1 contains the address of the MMIO region:

1.  ldr x2,    [x1],    #4

That looks dodgy, since is misaligns the pointer afterwards. MMIO
access typically go to device memory, which must be naturally aligned.
Just don't give a bad example here and change that to a multiple of 8.

2.  ldr w2,    [x1],    #-4

(this one is fine, btw, because it's a 32-bit read)

3.  ldr x2,    [x1],    #-8
4.  ldr w2,    [x1],    #4
5.  ldrh    w2,    [x1],    #8
6.  ldrb    w2,    [x1],    #16

More naturally I'd use the data size of the postindex value ...
ldr  x2 ... #-8
ldr  w2 ... #4
ldrh w2 ... #2
ldrb w2 ... #1

7.  str x2,    [x1],    #4

This is a again problematic, because x1 is not 8-byte aligned anymore
after that.

8.  str w2,    [x1],    #-4
9.  strh    w2,    [x1],    #8
10. strb    w2,    [x1],    #16

In the following two instructions, sp contains the address of the 
MMIO region:-

Really? I don't think you should give people funny ideas, just mention
that the Rn register could theoretically be the stack pointer.

11. ldrb    w2,    [sp],    #16
12. ldrb    wzr,   [sp],    #16

In order to handle post-indexing store/load instructions (like 
those mentioned

above), Xen will need to fetch and decode the instruction.

This patch only cover post-index store/load instructions from 
AArch64 mode.

For now, this is left unimplemented for trap from AArch32 mode.

Signed-off-by: Ayan Kumar Halder 
---

Changelog :-
v2 - 1. Updated the rn register after reading from it. (Pointed by 
Julien,

  Stefano)
   2. Used a union to represent the instruction opcode 
(Suggestd by Bertrand)

   3. Fixed coding style issues (Pointed by Julien)
   4. In the previous patch, I was updating dabt->sign based on 
the signedness
  of imm9. This was incorrect. As mentioned in ARMv8 ARM  
DDI 0487G.b,
  Page 3221, SSE indicates the signedness of the data item 
loaded. In our

  case, the data item loaded is always unsigned.

v3- 1. Handled all the variants of ldr/str (ie 64, 32, 16, 8 bit 
variants).
 Thus, I have removed the check for "instr->code.opc == 0" 
(Suggested by

 Andre)
  2. Handled the scenario when rn = SP, rt = XZR (Suggested by 
Jan, Andre)

  3. Added restriction for "rt != rn" (Suggested by Andre)
  4. Moved union ldr_str_instr_class {} to decode.h. This is 
the header included
 by io.c and decode.c (where the union is referred). 
(Suggested by Jan)

  5. Indentation and typo fixes (Suggested by Jan)

Changes suggested but could not be considered due to reasons :-
  1. Using accessor macros instead of bitfields for 
"ldr_str_instr_class". (Andre)
 Reason - I could not find a simple way to represent 9 bit 
signed integer
 (ie imm9) without using bitfields. If I use accessor 
macros, then I need
 to manually calculate two's complement to obtain the value 
when signed

 bit is present.

  2. I/D cache cohenerncy (Andre)
 Reason :- I could not see any instruction to flush the I 
cache.

First, please try to avoid the term "flush", because it is somewhat
overloaded. The architecture speaks of "clean" and "invalidate", which
are more precise.
Assuming you mean "clean" here: conceptually there is no such thing for
the I cache, because it's always clean. The I$ will only be read from
the CPU side - from the instruction fetcher - there is nothing written
back through it. Every store goes through the data path - always.
That is the problem that I tried to sketch you previously: you don't
have a guarantee that the instruction you read from memory is the same
that the CPU executed. The guest could have changed the instruction
after the I$ fetched that. So the CPU will execute (and trap) on
instruction X, but you will read Y. I leave it up to your imagination
if that could be exploited.

I see what you mean.

Refer Armv8 Arm DDI 0487G.b Page D1-2476, it says that (for instr/data
abort) the faulting virtual address and IPA is saved in FAR_ELx and
HPFAR_EL2 respectively. But, I do not see if the faulting instruction is
saved in any special register. Is there something I am missing ?

No, indeed there is no such thing. You get the address, but not the
faulting instruction. It would indeed be nice to have from a 

Re: [XEN v3] xen/arm64: io: Decode ldr/str post-indexing instructions

2022-01-24 Thread Julien Grall

Hi Andre,

On 24/01/2022 14:36, Andre Przywara wrote:

On Mon, 24 Jan 2022 12:07:42 +

Also, if an instruction is being modified by the guest (after it has
been loaded in the I cache), and if the guest does not invalidate the I
cache + ISB, then this is a malicious behavior by the guest. Is my
understanding correct ?


I wouldn't say malicious per se, there might be legitimate reasons to do
so, but in the Xen context this is mostly irrelevant, since we don't trust
the guest anyway. So whether it's malicious or accidental, the hypervisor
might be mislead.


I agree the hypervisor will be mislead to execute the wrong instruction. 
But, in reality, I don't see how this is a massive problem as this 
thread seems to imply. At best the guest will shoot itself in the foot.


IOW, for now, I think it is fine to assume that the guest will have 
invalidated the cache instruction before executing any instruction that 
may fault with ISV=0. This could be revisted if we have use-cases where 
we really need to know what the guest executed.


Cheers,

--
Julien Grall



Re: [PATCH v4 05/11] xen/arm: introduce direct-map for domUs

2022-01-24 Thread Julien Grall

Hi,

On 13/01/2022 22:53, Stefano Stabellini wrote:

+kinfo->mem.nr_banks = nr_banks;
+
+/*
+ * The property 'memory' should match the amount of memory given to
+ * the guest.
+ * Currently, it is only possible to either acquire static memory or
+ * let Xen allocate. *Mixing* is not supported.
+ */
+if ( kinfo->unassigned_mem != 0 )
+{
+printk(XENLOG_ERR
+   "Size of \"memory\" property doesn't match up with the sum-up of 
\"xen,static-mem\". Unsupported configuration.\n");


This line would benefit from being broken down, but I am also OK if we
leave it as is


We usually keep the message in a single line because (even if it is more 
than 80 characters) because it helps to find the line afterwards.


Looking at the message, I would drop "Unsupported configuration" because 
it implies that this is because some code is missing (IOW it will be 
supported in the future). However, this is a requirement.


Cheers,

--
Julien Grall



Re: [XEN v3] xen/arm64: io: Decode ldr/str post-indexing instructions

2022-01-24 Thread Ayan Kumar Halder

Hi Andre,

Thanks forn your comments.

On 24/01/2022 14:36, Andre Przywara wrote:

On Mon, 24 Jan 2022 12:07:42 +
Ayan Kumar Halder  wrote:

Hi Ayan,


Many thanks for your feedback. I have one clarification :-

On 22/01/2022 01:30, Andre Przywara wrote:

On Thu, 20 Jan 2022 21:55:27 +
Ayan Kumar Halder  wrote:

Hi,
  

At the moment, Xen is only handling data abort with valid syndrome (i.e.
ISV=0). Unfortunately, this doesn't cover all the instructions a domain
could use to access MMIO regions.

For instance, a baremetal OS can use any of the following instructions, where
x1 contains the address of the MMIO region:

1.  ldr x2,[x1],#4

That looks dodgy, since is misaligns the pointer afterwards. MMIO
access typically go to device memory, which must be naturally aligned.
Just don't give a bad example here and change that to a multiple of 8.
  

2.  ldr w2,[x1],#-4

(this one is fine, btw, because it's a 32-bit read)
  

3.  ldr x2,[x1],#-8
4.  ldr w2,[x1],#4
5.  ldrhw2,[x1],#8
6.  ldrbw2,[x1],#16

More naturally I'd use the data size of the postindex value ...
ldr  x2 ... #-8
ldr  w2 ... #4
ldrh w2 ... #2
ldrb w2 ... #1
  

7.  str x2,[x1],#4

This is a again problematic, because x1 is not 8-byte aligned anymore
after that.
  

8.  str w2,[x1],#-4
9.  strhw2,[x1],#8
10. strbw2,[x1],#16

In the following two instructions, sp contains the address of the MMIO region:-

Really? I don't think you should give people funny ideas, just mention
that the Rn register could theoretically be the stack pointer.
  

11. ldrbw2,[sp],#16
12. ldrbwzr,   [sp],#16

In order to handle post-indexing store/load instructions (like those mentioned
above), Xen will need to fetch and decode the instruction.

This patch only cover post-index store/load instructions from AArch64 mode.
For now, this is left unimplemented for trap from AArch32 mode.

Signed-off-by: Ayan Kumar Halder 
---

Changelog :-
v2 - 1. Updated the rn register after reading from it. (Pointed by Julien,
  Stefano)
   2. Used a union to represent the instruction opcode (Suggestd by 
Bertrand)
   3. Fixed coding style issues (Pointed by Julien)
   4. In the previous patch, I was updating dabt->sign based on the 
signedness
  of imm9. This was incorrect. As mentioned in ARMv8 ARM  DDI 0487G.b,
  Page 3221, SSE indicates the signedness of the data item loaded. In 
our
  case, the data item loaded is always unsigned.

v3- 1. Handled all the variants of ldr/str (ie 64, 32, 16, 8 bit variants).
 Thus, I have removed the check for "instr->code.opc == 0" (Suggested by
 Andre)
  2. Handled the scenario when rn = SP, rt = XZR (Suggested by Jan, Andre)
  3. Added restriction for "rt != rn" (Suggested by Andre)
  4. Moved union ldr_str_instr_class {} to decode.h. This is the header 
included
 by io.c and decode.c (where the union is referred). (Suggested by Jan)
  5. Indentation and typo fixes (Suggested by Jan)

Changes suggested but could not be considered due to reasons :-
  1. Using accessor macros instead of bitfields for "ldr_str_instr_class". 
(Andre)
 Reason - I could not find a simple way to represent 9 bit signed 
integer
 (ie imm9) without using bitfields. If I use accessor macros, then I 
need
 to manually calculate two's complement to obtain the value when signed
 bit is present.

  2. I/D cache cohenerncy (Andre)
 Reason :- I could not see any instruction to flush the I cache.

First, please try to avoid the term "flush", because it is somewhat
overloaded. The architecture speaks of "clean" and "invalidate", which
are more precise.
Assuming you mean "clean" here: conceptually there is no such thing for
the I cache, because it's always clean. The I$ will only be read from
the CPU side - from the instruction fetcher - there is nothing written
back through it. Every store goes through the data path - always.
That is the problem that I tried to sketch you previously: you don't
have a guarantee that the instruction you read from memory is the same
that the CPU executed. The guest could have changed the instruction
after the I$ fetched that. So the CPU will execute (and trap) on
instruction X, but you will read Y. I leave it up to your imagination
if that could be exploited.

I see what you mean.

Refer Armv8 Arm DDI 0487G.b Page D1-2476, it says that (for instr/data
abort) the faulting virtual address and IPA is saved in FAR_ELx and
HPFAR_EL2 respectively. But, I do not see if the faulting instruction is
saved in any special register. Is there something I am missing ?

No, indeed there is no such thing. You get the address, but not the
faulting instruction. It would indeed be nice to have from a software
developer's point of view, but the 

Re: [PATCH 12/37] xen/x86: decouple nodes_cover_memory from E820 map

2022-01-24 Thread Jan Beulich
On 23.09.2021 14:02, Wei Chen wrote:
> We will reuse nodes_cover_memory for Arm to check its bootmem
> info. So we introduce two arch helpers to get memory map's
> entry number and specified entry's range:
> arch_get_memory_bank_number
> arch_get_memory_bank_range

I'm sorry, but personally I see no way for you to introduce the term
"memory bank" into x86 code.

> --- a/xen/arch/x86/numa.c
> +++ b/xen/arch/x86/numa.c
> @@ -378,6 +378,24 @@ unsigned int arch_have_default_dmazone(void)
>  return ( num_online_nodes() > 1 ) ? 1 : 0;
>  }
>  
> +uint32_t __init arch_meminfo_get_nr_bank(void)

unsigned int (also elsewhere)

> +{
> + return e820.nr_map;
> +}
> +
> +int __init arch_meminfo_get_ram_bank_range(uint32_t bank,
> + paddr_t *start, paddr_t *end)
> +{
> + if (e820.map[bank].type != E820_RAM || !start || !end) {

I see no reason for the checking of start and end.

> + return -1;
> + }

No need for braces here.

> + *start = e820.map[bank].addr;
> + *end = e820.map[bank].addr + e820.map[bank].size;
> +
> + return 0;
> +}
> +
>  static void dump_numa(unsigned char key)
>  {
>  s_time_t now = NOW();

Judging by just the two pieces of patch context you're inserting
a Linux-style code fragment in the middle of two Xen-style ones.

Various other comments given for earlier patches apply here as well.

Jan




Re: [PATCH 11/37] xen/x86: abstract neutral code from acpi_numa_memory_affinity_init

2022-01-24 Thread Jan Beulich
On 23.09.2021 14:02, Wei Chen wrote:
> There is some code in acpi_numa_memory_affinity_init to update node
> memory range and update node_memblk_range array. This code is not
> ACPI specific, it can be shared by other NUMA implementation, like
> device tree based NUMA implementation.
> 
> So in this patch, we abstract this memory range and blocks relative
> code to a new function. This will avoid exporting static variables
> like node_memblk_range. And the PXM in neutral code print messages
> have been replaced by NODE, as PXM is ACPI specific.
> 
> Signed-off-by: Wei Chen 

SRAT is an ACPI concept, which I assume has no meaning with DT. Hence
any generically usable logic here wants, I think, separating out into
a file which is not SRAT-specific (peeking ahead, specifically not a
file named "numa_srat.c"). This may in turn require some more though
regarding the proper split between the stuff remaining in srat.c and
the stuff becoming kind of library code. In particular this may mean
moving some of the static variables as well, and with them perhaps
some further functions (while I did peek ahead, I didn't look closely
at the later patch doing the actual movement). And it is then hard to
see why the separation needs to happen in two steps - you could move
the generically usable code to a new file right away.

> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -104,6 +104,14 @@ nodeid_t setup_node(unsigned pxm)
>   return node;
>  }
>  
> +bool __init numa_memblks_available(void)
> +{
> + if (num_node_memblks < NR_NODE_MEMBLKS)
> + return true;
> +
> + return false;
> +}

Please can you avoid expressing things in more complex than necessary
ways? Here I don't see why it can't just be

bool __init numa_memblks_available(void)
{
return num_node_memblks < NR_NODE_MEMBLKS;
}

> @@ -301,69 +309,35 @@ static bool __init is_node_memory_continuous(nodeid_t 
> nid,
>   return true;
>  }
>  
> -/* Callback for parsing of the Proximity Domain <-> Memory Area mappings */
> -void __init
> -acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
> +/* Neutral NUMA memory affinity init function for ACPI and DT */
> +int __init numa_update_node_memblks(nodeid_t node,
> + paddr_t start, paddr_t size, bool hotplug)

Indentation.

>  {
> - paddr_t start, end;
> - unsigned pxm;
> - nodeid_t node;
> + paddr_t end = start + size;
>   int i;
>  
> - if (srat_disabled())
> - return;
> - if (ma->header.length != sizeof(struct acpi_srat_mem_affinity)) {
> - bad_srat();
> - return;
> - }
> - if (!(ma->flags & ACPI_SRAT_MEM_ENABLED))
> - return;
> -
> - start = ma->base_address;
> - end = start + ma->length;
> - /* Supplement the heuristics in l1tf_calculations(). */
> - l1tf_safe_maddr = max(l1tf_safe_maddr, ROUNDUP(end, PAGE_SIZE));
> -
> - if (num_node_memblks >= NR_NODE_MEMBLKS)
> - {
> - dprintk(XENLOG_WARNING,
> -"Too many numa entry, try bigger NR_NODE_MEMBLKS \n");
> - bad_srat();
> - return;
> - }
> -
> - pxm = ma->proximity_domain;
> - if (srat_rev < 2)
> - pxm &= 0xff;
> - node = setup_node(pxm);
> - if (node == NUMA_NO_NODE) {
> - bad_srat();
> - return;
> - }
> - /* It is fine to add this area to the nodes data it will be used later*/
> + /* It is fine to add this area to the nodes data it will be used later 
> */
>   i = conflicting_memblks(start, end);
>   if (i < 0)
>   /* everything fine */;
>   else if (memblk_nodeid[i] == node) {
> - bool mismatch = !(ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) !=
> - !test_bit(i, memblk_hotplug);
> + bool mismatch = !hotplug != !test_bit(i, memblk_hotplug);
>  
> - printk("%sSRAT: PXM %u (%"PRIpaddr"-%"PRIpaddr") overlaps with 
> itself (%"PRIpaddr"-%"PRIpaddr")\n",
> -mismatch ? KERN_ERR : KERN_WARNING, pxm, start, end,
> + printk("%sSRAT: NODE %u (%"PRIpaddr"-%"PRIpaddr") overlaps with 
> itself (%"PRIpaddr"-%"PRIpaddr")\n",

Nit: Unlike PXM, which is an acronym, "node" doesn't want to be all upper
case.

Also did you check that the node <-> PXM association is known to a reader
of a log at this point in time?

> +mismatch ? KERN_ERR : KERN_WARNING, node, start, end,
>  node_memblk_range[i].start, node_memblk_range[i].end);
>   if (mismatch) {
> - bad_srat();
> - return;
> + return -1;
>   }
>   } else {
>   printk(KERN_ERR
> -"SRAT: PXM %u (%"PRIpaddr"-%"PRIpaddr") overlaps with 
> PXM %u (%"PRIpaddr"-%"PRIpaddr")\n",
> -pxm, start, end, node_to_pxm(memblk_nodeid[i]),
> +

Re: [PATCH v2] xen-mapcache: Avoid entry->lock overflow

2022-01-24 Thread Stefano Stabellini
On Mon, 24 Jan 2022, Ross Lagerwall wrote:
> In some cases, a particular mapcache entry may be mapped 256 times
> causing the lock field to wrap to 0. For example, this may happen when
> using emulated NVME and the guest submits a large scatter-gather write.
> At this point, the entry map be remapped causing QEMU to write the wrong
> data or crash (since remap is not atomic).
> 
> Avoid this overflow by increasing the lock field to a uint32_t and also
> detect it and abort rather than continuing regardless.
> 
> Signed-off-by: Ross Lagerwall 

Reviewed-by: Stefano Stabellini 


> ---
> Changes in v2: Change type to uint32_t since there is a hole there
> anyway. The struct size remains at 48 bytes on x86_64.
> 
>  hw/i386/xen/xen-mapcache.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/i386/xen/xen-mapcache.c b/hw/i386/xen/xen-mapcache.c
> index bd47c3d672..f2ef977963 100644
> --- a/hw/i386/xen/xen-mapcache.c
> +++ b/hw/i386/xen/xen-mapcache.c
> @@ -52,7 +52,7 @@ typedef struct MapCacheEntry {
>  hwaddr paddr_index;
>  uint8_t *vaddr_base;
>  unsigned long *valid_mapping;
> -uint8_t lock;
> +uint32_t lock;
>  #define XEN_MAPCACHE_ENTRY_DUMMY (1 << 0)
>  uint8_t flags;
>  hwaddr size;
> @@ -355,6 +355,12 @@ tryagain:
>  if (lock) {
>  MapCacheRev *reventry = g_malloc0(sizeof(MapCacheRev));
>  entry->lock++;
> +if (entry->lock == 0) {
> +fprintf(stderr,
> +"mapcache entry lock overflow: "TARGET_FMT_plx" -> %p\n",
> +entry->paddr_index, entry->vaddr_base);
> +abort();
> +}
>  reventry->dma = dma;
>  reventry->vaddr_req = mapcache->last_entry->vaddr_base + 
> address_offset;
>  reventry->paddr_index = mapcache->last_entry->paddr_index;
> -- 
> 2.27.0
> 



Re: [PATCH 10/37] xen/x86: use helpers to access/update mem_hotplug

2022-01-24 Thread Jan Beulich
On 23.09.2021 14:02, Wei Chen wrote:
> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -391,8 +391,8 @@ acpi_numa_memory_affinity_init(const struct 
> acpi_srat_mem_affinity *ma)
>   memblk_nodeid[num_node_memblks] = node;
>   if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) {
>   __set_bit(num_node_memblks, memblk_hotplug);
> - if (end > mem_hotplug)
> - mem_hotplug = end;
> + if (end > mem_hotplug_boundary())
> + mem_hotplug_update_boundary(end);

Can the if() please be folded into mem_hotplug_update_boundary(),
eliminating (at least for the purpose here) the need for the
separate mem_hotplug_boundary()? As said on the previous patch,
I think the two want folding.

Jan




Re: [PATCH 09/37] xen/x86: introduce two helpers to access memory hotplug end

2022-01-24 Thread Jan Beulich
On 23.09.2021 14:02, Wei Chen wrote:
> x86 provides a mem_hotplug to maintain the end of memory hotplug
> end address. This variable can be accessed out of mm.c. We want
> some code out of mm.c can be reused by other architectures without
> memory hotplug ability. So in this patch, we introduce these two
> helpers to replace mem_hotplug direct access. This will give the
> ability to stub these two API.
> 
> Signed-off-by: Wei Chen 
> ---
>  xen/include/asm-x86/mm.h | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
> index cb90527499..af2fc4b0cd 100644
> --- a/xen/include/asm-x86/mm.h
> +++ b/xen/include/asm-x86/mm.h
> @@ -475,6 +475,16 @@ static inline int get_page_and_type(struct page_info 
> *page,
>  
>  extern paddr_t mem_hotplug;
>  
> +static inline void mem_hotplug_update_boundary(paddr_t end)
> +{
> +mem_hotplug = end;
> +}
> +
> +static inline paddr_t mem_hotplug_boundary(void)
> +{
> +return mem_hotplug;
> +}
> +
>  
> /**
>   * With shadow pagetables, the different kinds of address start
>   * to get get confusing.

Imo for this to make sense you want to also use the new functions
right away in the place(s) where the direct access(es) get(s) in
your way.

Jan




[PATCH v2] x86/pvh: fix population of the low 1MB for dom0

2022-01-24 Thread Roger Pau Monne
RMRRs are setup ahead of populating the p2m and hence the ASSERT when
populating the low 1MB needs to be relaxed when it finds an existing
entry: it's either RAM or a RMRR resulting from the IOMMU setup.

Rework the logic a bit and introduce a local mfn variable in order to
assert that if the gfn is populated and not RAM it is an identity map.

Fixes: 6b4f6a31ac ('x86/PVH: de-duplicate mappings for first Mb of Dom0 memory')
Signed-off-by: Roger Pau Monné 
Reviewed-by: Jan Beulich 
---
Changes since v1:
 - Fix indentation.
 - Expand existing assert.
---
 xen/arch/x86/hvm/dom0_build.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index b00e45885c..63dceb2116 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -466,11 +466,16 @@ static int __init pvh_populate_p2m(struct domain *d)
 for ( i = rc = 0; i < MB1_PAGES; ++i )
 {
 p2m_type_t p2mt;
+mfn_t mfn = get_gfn_query(d, i, );
 
-if ( mfn_eq(get_gfn_query(d, i, ), INVALID_MFN) )
+if ( mfn_eq(mfn, INVALID_MFN) )
 rc = set_mmio_p2m_entry(d, _gfn(i), _mfn(i), PAGE_ORDER_4K);
 else
-ASSERT(p2mt == p2m_ram_rw);
+/*
+ * If the p2m entry is already set it must belong to a RMRR and
+ * already be identity mapped, or be a RAM region.
+ */
+ASSERT(p2mt == p2m_ram_rw || mfn_eq(mfn, _mfn(i)));
 put_gfn(d, i);
 if ( rc )
 {
-- 
2.34.1




Re: [PATCH] x86/pvh: fix population of the low 1MB for dom0

2022-01-24 Thread Roger Pau Monné
On Mon, Jan 24, 2022 at 05:01:04PM +0100, Jan Beulich wrote:
> On 24.01.2022 16:23, Roger Pau Monne wrote:
> > RMRRs are setup ahead of populating the p2m and hence the ASSERT when
> > populating the low 1MB needs to be relaxed when it finds an existing
> > entry: it's either RAM or a RMRR resulting from the IOMMU setup.
> > 
> > Rework the logic a bit and introduce a local mfn variable in order to
> > assert that if the gfn is populated and not RAM it is an identity map.
> > 
> > Fixes: 6b4f6a31ac ('x86/PVH: de-duplicate mappings for first Mb of Dom0 
> > memory')
> > Signed-off-by: Roger Pau Monné 
> 
> Reviewed-by: Jan Beulich 
> albeit ...
> 
> > --- a/xen/arch/x86/hvm/dom0_build.c
> > +++ b/xen/arch/x86/hvm/dom0_build.c
> > @@ -466,11 +466,17 @@ static int __init pvh_populate_p2m(struct domain *d)
> >  for ( i = rc = 0; i < MB1_PAGES; ++i )
> >  {
> >  p2m_type_t p2mt;
> > +mfn_t mfn;
> >  
> > -if ( mfn_eq(get_gfn_query(d, i, ), INVALID_MFN) )
> > +mfn = get_gfn_query(d, i, );
> 
> ... preferably with this becoming the initializer of the variable then
> and ...
> 
> > +if ( mfn_eq(mfn, INVALID_MFN) )
> >  rc = set_mmio_p2m_entry(d, _gfn(i), _mfn(i), PAGE_ORDER_4K);
> > -else
> > -ASSERT(p2mt == p2m_ram_rw);
> > +else if ( p2mt != p2m_ram_rw && !mfn_eq(mfn, _mfn(i)) )
> > +/*
> > + * If the p2m entry is already set it must belong to a 
> > RMRR and
> > + * already be identity mapped, or be a RAM region.
> > + */
> > +ASSERT_UNREACHABLE();
> 
> ... (not just preferably) indentation reduced by a level here. (I wonder
> why you didn't simply extend the existing ASSERT() - ASSERT_UNREACHABLE()
> is nice in certain cases, but the expression it logs is entirely unhelpful
> for disambiguating the reason without looking at the source.)

Indeed, that's better. Sorry about the indentation, not sure what my
editor has done here.

Thanks, Roger.



Re: [PATCH] x86/pvh: print dom0 memory map

2022-01-24 Thread Jan Beulich
On 24.01.2022 16:51, Roger Pau Monne wrote:
> I find it useful for debugging certain issues to have the memory map
> dom0 is using, so print it when using `dom0=verbose` on the command
> line.
> 
> Signed-off-by: Roger Pau Monné 

Reviewed-by: Jan Beulich 
with one further request:

> --- a/xen/arch/x86/e820.c
> +++ b/xen/arch/x86/e820.c
> @@ -88,7 +88,7 @@ static void __init add_memory_region(unsigned long long 
> start,
>  e820.nr_map++;
>  }
>  
> -static void __init print_e820_memory_map(struct e820entry *map, unsigned int 
> entries)
> +void __init print_e820_memory_map(struct e820entry *map, unsigned int 
> entries)

While making this accessible from the outside, could you please
make "map" pointer-to-const?

Jan




[PATCH] libxl: force netback to wait for hotplug execution before connecting

2022-01-24 Thread Roger Pau Monne
By writing an empty "hotplug-status" xenstore node in the backend path
libxl can force Linux netback to wait for hotplug script execution
before proceeding to the 'connected' state.

This is required so that netback doesn't skip state 2 (InitWait) and
thus blocks libxl waiting for such state in order to launch the
hotplug script (see libxl__wait_device_connection).

Reported-by: James Dingwall 
Signed-off-by: Roger Pau Monné 
Tested-by: James Dingwall 
---
Cc: Wei Liu 
Cc: Paul Durrant 
---
 tools/libs/light/libxl_nic.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/tools/libs/light/libxl_nic.c b/tools/libs/light/libxl_nic.c
index 0b45469dca..0b9e70c9d1 100644
--- a/tools/libs/light/libxl_nic.c
+++ b/tools/libs/light/libxl_nic.c
@@ -248,6 +248,13 @@ static int libxl__set_xenstore_nic(libxl__gc *gc, uint32_t 
domid,
 flexarray_append(ro_front, "mtu");
 flexarray_append(ro_front, GCSPRINTF("%u", nic->mtu));
 
+/*
+ * Force backend to wait for hotplug script execution before switching to
+ * connected state.
+ */
+flexarray_append(back, "hotplug-status");
+flexarray_append(back, "");
+
 return 0;
 }
 
-- 
2.34.1




Re: [PATCH] x86/pvh: fix population of the low 1MB for dom0

2022-01-24 Thread Jan Beulich
On 24.01.2022 16:23, Roger Pau Monne wrote:
> RMRRs are setup ahead of populating the p2m and hence the ASSERT when
> populating the low 1MB needs to be relaxed when it finds an existing
> entry: it's either RAM or a RMRR resulting from the IOMMU setup.
> 
> Rework the logic a bit and introduce a local mfn variable in order to
> assert that if the gfn is populated and not RAM it is an identity map.
> 
> Fixes: 6b4f6a31ac ('x86/PVH: de-duplicate mappings for first Mb of Dom0 
> memory')
> Signed-off-by: Roger Pau Monné 

Reviewed-by: Jan Beulich 
albeit ...

> --- a/xen/arch/x86/hvm/dom0_build.c
> +++ b/xen/arch/x86/hvm/dom0_build.c
> @@ -466,11 +466,17 @@ static int __init pvh_populate_p2m(struct domain *d)
>  for ( i = rc = 0; i < MB1_PAGES; ++i )
>  {
>  p2m_type_t p2mt;
> +mfn_t mfn;
>  
> -if ( mfn_eq(get_gfn_query(d, i, ), INVALID_MFN) )
> +mfn = get_gfn_query(d, i, );

... preferably with this becoming the initializer of the variable then
and ...

> +if ( mfn_eq(mfn, INVALID_MFN) )
>  rc = set_mmio_p2m_entry(d, _gfn(i), _mfn(i), PAGE_ORDER_4K);
> -else
> -ASSERT(p2mt == p2m_ram_rw);
> +else if ( p2mt != p2m_ram_rw && !mfn_eq(mfn, _mfn(i)) )
> +/*
> + * If the p2m entry is already set it must belong to a RMRR 
> and
> + * already be identity mapped, or be a RAM region.
> + */
> +ASSERT_UNREACHABLE();

... (not just preferably) indentation reduced by a level here. (I wonder
why you didn't simply extend the existing ASSERT() - ASSERT_UNREACHABLE()
is nice in certain cases, but the expression it logs is entirely unhelpful
for disambiguating the reason without looking at the source.)

Jan




[PATCH] x86/pvh: print dom0 memory map

2022-01-24 Thread Roger Pau Monne
I find it useful for debugging certain issues to have the memory map
dom0 is using, so print it when using `dom0=verbose` on the command
line.

Signed-off-by: Roger Pau Monné 
---
 xen/arch/x86/e820.c | 2 +-
 xen/arch/x86/hvm/dom0_build.c   | 6 ++
 xen/arch/x86/include/asm/e820.h | 1 +
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/e820.c b/xen/arch/x86/e820.c
index aa602773bb..82e9ac46a0 100644
--- a/xen/arch/x86/e820.c
+++ b/xen/arch/x86/e820.c
@@ -88,7 +88,7 @@ static void __init add_memory_region(unsigned long long start,
 e820.nr_map++;
 }
 
-static void __init print_e820_memory_map(struct e820entry *map, unsigned int 
entries)
+void __init print_e820_memory_map(struct e820entry *map, unsigned int entries)
 {
 unsigned int i;
 
diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index 739bb8adb6..93ebe4f404 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -1271,6 +1271,12 @@ int __init dom0_construct_pvh(struct domain *d, const 
module_t *image,
 return rc;
 }
 
+if ( opt_dom0_verbose )
+{
+printk("Dom%u memory map:\n", d->domain_id);
+print_e820_memory_map(d->arch.e820, d->arch.nr_e820);
+}
+
 printk("WARNING: PVH is an experimental mode with limited 
functionality\n");
 return 0;
 }
diff --git a/xen/arch/x86/include/asm/e820.h b/xen/arch/x86/include/asm/e820.h
index 9d8f1ba960..7fcfde3b66 100644
--- a/xen/arch/x86/include/asm/e820.h
+++ b/xen/arch/x86/include/asm/e820.h
@@ -32,6 +32,7 @@ extern int e820_change_range_type(
 extern int e820_add_range(
 struct e820map *, uint64_t s, uint64_t e, uint32_t type);
 extern unsigned long init_e820(const char *, struct e820map *);
+extern void print_e820_memory_map(struct e820entry *map, unsigned int entries);
 extern struct e820map e820;
 extern struct e820map e820_raw;
 
-- 
2.34.1




[PATCH] x86/pvh: fix population of the low 1MB for dom0

2022-01-24 Thread Roger Pau Monne
RMRRs are setup ahead of populating the p2m and hence the ASSERT when
populating the low 1MB needs to be relaxed when it finds an existing
entry: it's either RAM or a RMRR resulting from the IOMMU setup.

Rework the logic a bit and introduce a local mfn variable in order to
assert that if the gfn is populated and not RAM it is an identity map.

Fixes: 6b4f6a31ac ('x86/PVH: de-duplicate mappings for first Mb of Dom0 memory')
Signed-off-by: Roger Pau Monné 
---
 xen/arch/x86/hvm/dom0_build.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index b00e45885c..739bb8adb6 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -466,11 +466,17 @@ static int __init pvh_populate_p2m(struct domain *d)
 for ( i = rc = 0; i < MB1_PAGES; ++i )
 {
 p2m_type_t p2mt;
+mfn_t mfn;
 
-if ( mfn_eq(get_gfn_query(d, i, ), INVALID_MFN) )
+mfn = get_gfn_query(d, i, );
+if ( mfn_eq(mfn, INVALID_MFN) )
 rc = set_mmio_p2m_entry(d, _gfn(i), _mfn(i), PAGE_ORDER_4K);
-else
-ASSERT(p2mt == p2m_ram_rw);
+else if ( p2mt != p2m_ram_rw && !mfn_eq(mfn, _mfn(i)) )
+/*
+ * If the p2m entry is already set it must belong to a RMRR and
+ * already be identity mapped, or be a RAM region.
+ */
+ASSERT_UNREACHABLE();
 put_gfn(d, i);
 if ( rc )
 {
-- 
2.34.1




Re: [PATCH RFC v2 4/5] x86/mwait-idle: enable interrupts before C1 on Xeons

2022-01-24 Thread Jan Beulich
On 20.01.2022 16:52, Roger Pau Monné wrote:
> On Thu, Jan 20, 2022 at 03:04:39PM +0100, Jan Beulich wrote:
>> From: Artem Bityutskiy 
>>
>> Enable local interrupts before requesting C1 on the last two generations
>> of Intel Xeon platforms: Sky Lake, Cascade Lake, Cooper Lake, Ice Lake.
>> This decreases average C1 interrupt latency by about 5-10%, as measured
>> with the 'wult' tool.
>>
>> The '->enter()' function of the driver enters C-states with local
>> interrupts disabled by executing the 'monitor' and 'mwait' pair of
>> instructions. If an interrupt happens, the CPU exits the C-state and
>> continues executing instructions after 'mwait'. It does not jump to
>> the interrupt handler, because local interrupts are disabled. The
>> cpuidle subsystem enables interrupts a bit later, after doing some
>> housekeeping.
>>
>> With this patch, we enable local interrupts before requesting C1. In
>> this case, if the CPU wakes up because of an interrupt, it will jump
>> to the interrupt handler right away. The cpuidle housekeeping will be
>> done after the pending interrupt(s) are handled.
>>
>> Enabling interrupts before entering a C-state has measurable impact
>> for faster C-states, like C1. Deeper, but slower C-states like C6 do
>> not really benefit from this sort of change, because their latency is
>> a lot higher comparing to the delay added by cpuidle housekeeping.
>>
>> This change was also tested with cyclictest and dbench. In case of Ice
>> Lake, the average cyclictest latency decreased by 5.1%, and the average
>> 'dbench' throughput increased by about 0.8%. Both tests were run for 4
>> hours with only C1 enabled (all other idle states, including 'POLL',
>> were disabled). CPU frequency was pinned to HFM, and uncore frequency
>> was pinned to the maximum value. The other platforms had similar
>> single-digit percentage improvements.
>>
>> It is worth noting that this patch affects 'cpuidle' statistics a tiny
>> bit.  Before this patch, C1 residency did not include the interrupt
>> handling time, but with this patch, it will include it. This is similar
>> to what happens in case of the 'POLL' state, which also runs with
>> interrupts enabled.
>>
>> Suggested-by: Len Brown 
>> Signed-off-by: Artem Bityutskiy 
>> [Linux commit: c227233ad64c77e57db738ab0e46439db71822a3]
>>
>> We don't have a pointer into cpuidle_state_table[] readily available.
>> To compensate, make use of the new flag only appearing for C1 and hence
>> only in the first table entry.
> 
> We could likely use the 8 padding bits in acpi_processor_cx between
> entry_method and method to store a flags field?

When looking there initially, I thought it wouldn't be good to add
field there. But looking again, I now don't see why I thought what
I thought. (We actually have an 8-bit padding slot there and a
32-bit one.)

> It would seems more resilient, as I guess at some point we could
> enable interrupts for further states?

C1E maybe; I don't think anything beyond, for having higher wakeup
latency anyway.

>> Unlike Linux we want to disable IRQs again after MWAITing, as
>> subsequently invoked functions assume so.
> 
> I'm also wondering whether there could be interrupts that rely on some
> of the housekeeping that's done when returning from mwait. I guess
> it's unlikely for an interrupt handler to have anything to do with the
> TSC not having been restored.

Actually this is a good point you make: We don't want to enable
IRQs when cstate_restore_tsc() is not a no-op, or else we might
confuse the time rendezvous. (I thought that I would remember
TSC to stop only in deeper C-states, but maybe I'm mixing this up
with the APIC timer.)

>> Signed-off-by: Jan Beulich 
> 
> Acked-by: Roger Pau Monné 

Thanks, but as per above - not just yet.

Jan




Re: [XEN v3] xen/arm64: io: Decode ldr/str post-indexing instructions

2022-01-24 Thread Andre Przywara
On Mon, 24 Jan 2022 12:07:42 +
Ayan Kumar Halder  wrote:

Hi Ayan,

> Many thanks for your feedback. I have one clarification :-
> 
> On 22/01/2022 01:30, Andre Przywara wrote:
> > On Thu, 20 Jan 2022 21:55:27 +
> > Ayan Kumar Halder  wrote:
> >
> > Hi,
> >  
> >> At the moment, Xen is only handling data abort with valid syndrome (i.e.
> >> ISV=0). Unfortunately, this doesn't cover all the instructions a domain
> >> could use to access MMIO regions.
> >>
> >> For instance, a baremetal OS can use any of the following instructions, 
> >> where
> >> x1 contains the address of the MMIO region:
> >>
> >> 1.  ldr x2,[x1],#4  
> > That looks dodgy, since is misaligns the pointer afterwards. MMIO
> > access typically go to device memory, which must be naturally aligned.
> > Just don't give a bad example here and change that to a multiple of 8.
> >  
> >> 2.  ldr w2,[x1],#-4  
> > (this one is fine, btw, because it's a 32-bit read)
> >  
> >> 3.  ldr x2,[x1],#-8
> >> 4.  ldr w2,[x1],#4
> >> 5.  ldrhw2,[x1],#8
> >> 6.  ldrbw2,[x1],#16  
> > More naturally I'd use the data size of the postindex value ...
> > ldr  x2 ... #-8
> > ldr  w2 ... #4
> > ldrh w2 ... #2
> > ldrb w2 ... #1
> >  
> >> 7.  str x2,[x1],#4  
> > This is a again problematic, because x1 is not 8-byte aligned anymore
> > after that.
> >  
> >> 8.  str w2,[x1],#-4
> >> 9.  strhw2,[x1],#8
> >> 10. strbw2,[x1],#16
> >>
> >> In the following two instructions, sp contains the address of the MMIO 
> >> region:-  
> > Really? I don't think you should give people funny ideas, just mention
> > that the Rn register could theoretically be the stack pointer.
> >  
> >> 11. ldrbw2,[sp],#16
> >> 12. ldrbwzr,   [sp],#16
> >>
> >> In order to handle post-indexing store/load instructions (like those 
> >> mentioned
> >> above), Xen will need to fetch and decode the instruction.
> >>
> >> This patch only cover post-index store/load instructions from AArch64 mode.
> >> For now, this is left unimplemented for trap from AArch32 mode.
> >>
> >> Signed-off-by: Ayan Kumar Halder 
> >> ---
> >>
> >> Changelog :-
> >> v2 - 1. Updated the rn register after reading from it. (Pointed by Julien,
> >>  Stefano)
> >>   2. Used a union to represent the instruction opcode (Suggestd by 
> >> Bertrand)
> >>   3. Fixed coding style issues (Pointed by Julien)
> >>   4. In the previous patch, I was updating dabt->sign based on the 
> >> signedness
> >>  of imm9. This was incorrect. As mentioned in ARMv8 ARM  DDI 
> >> 0487G.b,
> >>  Page 3221, SSE indicates the signedness of the data item loaded. 
> >> In our
> >>  case, the data item loaded is always unsigned.
> >>
> >> v3- 1. Handled all the variants of ldr/str (ie 64, 32, 16, 8 bit variants).
> >> Thus, I have removed the check for "instr->code.opc == 0" 
> >> (Suggested by
> >> Andre)
> >>  2. Handled the scenario when rn = SP, rt = XZR (Suggested by Jan, 
> >> Andre)
> >>  3. Added restriction for "rt != rn" (Suggested by Andre)
> >>  4. Moved union ldr_str_instr_class {} to decode.h. This is the header 
> >> included
> >> by io.c and decode.c (where the union is referred). (Suggested by 
> >> Jan)
> >>  5. Indentation and typo fixes (Suggested by Jan)
> >>
> >> Changes suggested but could not be considered due to reasons :-
> >>  1. Using accessor macros instead of bitfields for 
> >> "ldr_str_instr_class". (Andre)
> >> Reason - I could not find a simple way to represent 9 bit signed 
> >> integer
> >> (ie imm9) without using bitfields. If I use accessor macros, then 
> >> I need
> >> to manually calculate two's complement to obtain the value when 
> >> signed
> >> bit is present.
> >>
> >>  2. I/D cache cohenerncy (Andre)
> >> Reason :- I could not see any instruction to flush the I cache.  
> > First, please try to avoid the term "flush", because it is somewhat
> > overloaded. The architecture speaks of "clean" and "invalidate", which
> > are more precise.
> > Assuming you mean "clean" here: conceptually there is no such thing for
> > the I cache, because it's always clean. The I$ will only be read from
> > the CPU side - from the instruction fetcher - there is nothing written
> > back through it. Every store goes through the data path - always.
> > That is the problem that I tried to sketch you previously: you don't
> > have a guarantee that the instruction you read from memory is the same
> > that the CPU executed. The guest could have changed the instruction
> > after the I$ fetched that. So the CPU will execute (and trap) on
> > instruction X, but you will read Y. I leave it up to your imagination
> > if that could be exploited.  
> 
> I see what you mean.
> 
> Refer Armv8 Arm DDI 0487G.b Page D1-2476, 

Re: [PATCH 2/3] x86/vmsi: add support for extended destination ID in address field

2022-01-24 Thread Jan Beulich
On 20.01.2022 16:23, Roger Pau Monne wrote:
> Both QEMU/KVM and HyperV support using bits 11:5 from the MSI address
> field in order to store the high part of the target APIC ID. This
> allows expanding the maximum APID ID usable without interrupt
> remapping support from 255 to 32768.
> 
> Note the interface used by QEMU for emulated devices (via the
> XEN_DMOP_inject_msi hypercall) already passes both the address and
> data fields into Xen for processing, so there's no need for any change
> to QEMU there.
> 
> However for PCI passthrough devices QEMU uses the
> XEN_DOMCTL_bind_pt_irq hypercall which does need an addition to the
> gflags field in order to pass the high bits of the APIC destination
> ID.
> 
> Introduce a new CPUID flag to signal the support for the feature. The
> introduced flag covers both the support for extended ID for the
> IO-APIC RTE and the MSI address registers. Such flag is currently only
> exposed when the domain is using vPCI (ie: a PVH dom0).

Because of also covering the IO-APIC side, I think the CPUID aspect of
this really wants splitting into a 3rd patch. That way the MSI and
IO-APIC parts could in principle go in independently, and only the
CPUID one needs to remain at the tail.

> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -66,7 +66,7 @@ static void vmsi_inj_irq(
>  
>  int vmsi_deliver(
>  struct domain *d, int vector,
> -uint8_t dest, uint8_t dest_mode,
> +unsigned int dest, unsigned int dest_mode,

If you change the type of dest_mode, then to "bool" please - see its
only call site.

> @@ -123,7 +125,8 @@ void vmsi_deliver_pirq(struct domain *d, const struct 
> hvm_pirq_dpci *pirq_dpci)
>  }
>  
>  /* Return value, -1 : multi-dests, non-negative value: dest_vcpu_id */
> -int hvm_girq_dest_2_vcpu_id(struct domain *d, uint8_t dest, uint8_t 
> dest_mode)
> +int hvm_girq_dest_2_vcpu_id(struct domain *d, unsigned int dest,
> +unsigned int dest_mode)

Same here then.

> --- a/xen/arch/x86/include/asm/msi.h
> +++ b/xen/arch/x86/include/asm/msi.h
> @@ -54,6 +54,7 @@
>  #define MSI_ADDR_DEST_ID_SHIFT   12
>  #define   MSI_ADDR_DEST_ID_MASK  0x00ff000
>  #define  MSI_ADDR_DEST_ID(dest)  (((dest) << 
> MSI_ADDR_DEST_ID_SHIFT) & MSI_ADDR_DEST_ID_MASK)
> +#define   MSI_ADDR_EXT_DEST_ID_MASK  0xfe0

Especially the immediately preceding macro now becomes kind of stale.

> --- a/xen/drivers/passthrough/x86/hvm.c
> +++ b/xen/drivers/passthrough/x86/hvm.c
> @@ -269,7 +269,7 @@ int pt_irq_create_bind(
>  {
>  case PT_IRQ_TYPE_MSI:
>  {
> -uint8_t dest, delivery_mode;
> +unsigned int dest, delivery_mode;
>  bool dest_mode;

If you touch delivery_mode's type, wouldn't that better become bool?

> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -588,6 +588,7 @@ struct xen_domctl_bind_pt_irq {
>  #define XEN_DOMCTL_VMSI_X86_DELIV_MASK   0x007000
>  #define XEN_DOMCTL_VMSI_X86_TRIG_MASK0x008000
>  #define XEN_DOMCTL_VMSI_X86_UNMASKED 0x01
> +#define XEN_DOMCTL_VMSI_X86_EXT_DEST_ID_MASK 0xfe

I'm not convinced it is a good idea to limit the overall destination
ID width to 15 bits here - at the interface level we could as well
permit more bits right away; the implementation would reject too high
a value, of course. Not only with this I further wonder whether the
field shouldn't be unsplit while extending it. You won't get away
without bumping XEN_DOMCTL_INTERFACE_VERSION anyway (unless it was
bumped already for 4.17) since afaics the unused bits of this field
previously weren't checked for being zero. We could easily have 8
bits vector, 16 bits flags, and 32 bits destination ID in the struct.
And there would then still be 8 unused bits (which from now on we
ought to check for being zero).

Jan




Re: possible kernel/libxl race with xl network-attach

2022-01-24 Thread James Dingwall
On Mon, Jan 24, 2022 at 10:07:54AM +0100, Roger Pau Monné wrote:
> On Fri, Jan 21, 2022 at 03:05:07PM +, James Dingwall wrote:
> > On Fri, Jan 21, 2022 at 03:00:29PM +0100, Roger Pau Monné wrote:
> > > On Fri, Jan 21, 2022 at 01:34:54PM +, James Dingwall wrote:
> > > > On 2022-01-13 16:11, Roger Pau Monné wrote:
> > > > > On Thu, Jan 13, 2022 at 11:19:46AM +, James Dingwall wrote:
> > > > > > 
> > > > > > I have been trying to debug a problem where a vif with the backend
> > > > > > in a
> > > > > > driver domain is added to dom0.  Intermittently the hotplug script 
> > > > > > is
> > > > > > not invoked by libxl (running as xl devd) in the driver domain.  By
> > > > > > enabling some debug for the driver domain kernel and libxl I have
> > > > > > these
> > > > > > messages:
> > > > > > 
> > > > > > driver domain kernel (Ubuntu 5.4.0-92-generic):
> > > > > > 
> > > > > > [Thu Jan 13 01:39:31 2022] [1408] 564: vif vif-0-0 vif0.0:
> > > > > > Successfully created xenvif
> > > > > > [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed:
> > > > > > /local/domain/0/device/vif/0 -> Initialising
> > > > > > [Thu Jan 13 01:39:31 2022] [26] 470:
> > > > > > xen_netback:backend_switch_state: backend/vif/0/0 -> InitWait
> > > > > > [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed:
> > > > > > /local/domain/0/device/vif/0 -> Connected
> > > > > > [Thu Jan 13 01:39:31 2022] vif vif-0-0 vif0.0: Guest Rx ready
> > > > > > [Thu Jan 13 01:39:31 2022] [26] 470:
> > > > > > xen_netback:backend_switch_state: backend/vif/0/0 -> Connected
> > > > > > 
> > > > > > xl devd (Xen 4.14.3):
> > > > > > 
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528
> > > > > > wpath=/local/domain/2/backend token=3/0: event
> > > > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac569700:
> > > > > > nested ao, parent 0x5633ac567f90
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180
> > > > > > wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event
> > > > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:1055:devstate_callback: backend
> > > > > > /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting
> > > > > > state 4
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528
> > > > > > wpath=/local/domain/2/backend token=3/0: event
> > > > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac56a220:
> > > > > > nested ao, parent 0x5633ac567f90
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180
> > > > > > wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event
> > > > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:1055:devstate_callback: backend
> > > > > > /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting
> > > > > > state 4
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_aoutils.c:88:xswait_timeout_callback: backend
> > > > > > /local/domain/2/backend/vif/0/0/state (hoping for state change to
> > > > > > 2): xswait timeout (path=/local/domain/2/backend/vif/0/0/state)
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_event.c:850:libxl__ev_xswatch_deregister: watch
> > > > > > w=0x5633ac569180 wpath=/local/domain/2/backend/vif/0/0/state
> > > > > > token=2/1: deregister slotnum=2
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_event.c:1039:devstate_callback: backend
> > > > > > /local/domain/2/backend/vif/0/0/state wanted state 2  timed out
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch
> > > > > > w=0x5633ac569180: deregister unregistered
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_device.c:1092:device_backend_callback: calling
> > > > > > device_backend_cleanup
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch
> > > > > > w=0x5633ac569180: deregister unregistered
> > > > > > 2022-01-13 01:39:51 UTC libxl: error:
> > > > > > libxl_device.c:1105:device_backend_callback: unable to add device
> > > > > > with path /local/domain/2/backend/vif/0/0
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch
> > > > > > w=0x5633ac569280: deregister unregistered
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_device.c:1470:device_complete: 

Re: [PATCH 1/3] xen/vioapic: add support for the extended destination ID field

2022-01-24 Thread Jan Beulich
On 20.01.2022 16:23, Roger Pau Monne wrote:
> Such field uses bits 55:48, but for the purposes the register will be
> used use bits 55:49 instead. Bit 48 is used to signal an RTE entry is
> in remappable format which is not supported by the vIO-APIC.

Neither here nor in the cover letter you point at a formal specification
of this mode of operation. What I'm aware of are vague indications of
this mode's existence in some of Intel's chipset data sheets. Yet that
leaves open, for example, whether indeed bit 48 cannot be used here.

> --- a/xen/arch/x86/hvm/vioapic.c
> +++ b/xen/arch/x86/hvm/vioapic.c
> @@ -412,7 +412,8 @@ static void ioapic_inj_irq(
>  
>  static void vioapic_deliver(struct hvm_vioapic *vioapic, unsigned int pin)
>  {
> -uint16_t dest = vioapic->redirtbl[pin].fields.dest_id;
> +uint16_t dest = vioapic->redirtbl[pin].fields.dest_id |
> +(vioapic->redirtbl[pin].fields.ext_dest_id << 8);

What if an existing guest has been writing non-zero in these bits? Can
you really use them here without any further indication by the guest?

Jan




[xen-unstable test] 167802: tolerable FAIL

2022-01-24 Thread osstest service owner
flight 167802 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/167802/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 167796
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 167796
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 167796
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 167796
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 167796
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 167796
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 167796
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 167796
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 167796
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 167796
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 167796
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check   fail like 167796
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-raw  14 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass

version targeted for testing:
 xen  fe9be76d880b1d43b9dca471f45af3fd380ecb00
baseline version:
 xen  

Re: [XEN v3] xen/arm64: io: Decode ldr/str post-indexing instructions

2022-01-24 Thread Ayan Kumar Halder

Hi Andre,

Many thanks for your feedback. I have one clarification :-

On 22/01/2022 01:30, Andre Przywara wrote:

On Thu, 20 Jan 2022 21:55:27 +
Ayan Kumar Halder  wrote:

Hi,


At the moment, Xen is only handling data abort with valid syndrome (i.e.
ISV=0). Unfortunately, this doesn't cover all the instructions a domain
could use to access MMIO regions.

For instance, a baremetal OS can use any of the following instructions, where
x1 contains the address of the MMIO region:

1.  ldr x2,[x1],#4

That looks dodgy, since is misaligns the pointer afterwards. MMIO
access typically go to device memory, which must be naturally aligned.
Just don't give a bad example here and change that to a multiple of 8.


2.  ldr w2,[x1],#-4

(this one is fine, btw, because it's a 32-bit read)


3.  ldr x2,[x1],#-8
4.  ldr w2,[x1],#4
5.  ldrhw2,[x1],#8
6.  ldrbw2,[x1],#16

More naturally I'd use the data size of the postindex value ...
ldr  x2 ... #-8
ldr  w2 ... #4
ldrh w2 ... #2
ldrb w2 ... #1


7.  str x2,[x1],#4

This is a again problematic, because x1 is not 8-byte aligned anymore
after that.


8.  str w2,[x1],#-4
9.  strhw2,[x1],#8
10. strbw2,[x1],#16

In the following two instructions, sp contains the address of the MMIO region:-

Really? I don't think you should give people funny ideas, just mention
that the Rn register could theoretically be the stack pointer.


11. ldrbw2,[sp],#16
12. ldrbwzr,   [sp],#16

In order to handle post-indexing store/load instructions (like those mentioned
above), Xen will need to fetch and decode the instruction.

This patch only cover post-index store/load instructions from AArch64 mode.
For now, this is left unimplemented for trap from AArch32 mode.

Signed-off-by: Ayan Kumar Halder 
---

Changelog :-
v2 - 1. Updated the rn register after reading from it. (Pointed by Julien,
 Stefano)
  2. Used a union to represent the instruction opcode (Suggestd by Bertrand)
  3. Fixed coding style issues (Pointed by Julien)
  4. In the previous patch, I was updating dabt->sign based on the 
signedness
 of imm9. This was incorrect. As mentioned in ARMv8 ARM  DDI 0487G.b,
 Page 3221, SSE indicates the signedness of the data item loaded. In our
 case, the data item loaded is always unsigned.

v3- 1. Handled all the variants of ldr/str (ie 64, 32, 16, 8 bit variants).
Thus, I have removed the check for "instr->code.opc == 0" (Suggested by
Andre)
 2. Handled the scenario when rn = SP, rt = XZR (Suggested by Jan, Andre)
 3. Added restriction for "rt != rn" (Suggested by Andre)
 4. Moved union ldr_str_instr_class {} to decode.h. This is the header 
included
by io.c and decode.c (where the union is referred). (Suggested by Jan)
 5. Indentation and typo fixes (Suggested by Jan)

Changes suggested but could not be considered due to reasons :-
 1. Using accessor macros instead of bitfields for "ldr_str_instr_class". 
(Andre)
Reason - I could not find a simple way to represent 9 bit signed integer
(ie imm9) without using bitfields. If I use accessor macros, then I need
to manually calculate two's complement to obtain the value when signed
bit is present.

 2. I/D cache cohenerncy (Andre)
Reason :- I could not see any instruction to flush the I cache.

First, please try to avoid the term "flush", because it is somewhat
overloaded. The architecture speaks of "clean" and "invalidate", which
are more precise.
Assuming you mean "clean" here: conceptually there is no such thing for
the I cache, because it's always clean. The I$ will only be read from
the CPU side - from the instruction fetcher - there is nothing written
back through it. Every store goes through the data path - always.
That is the problem that I tried to sketch you previously: you don't
have a guarantee that the instruction you read from memory is the same
that the CPU executed. The guest could have changed the instruction
after the I$ fetched that. So the CPU will execute (and trap) on
instruction X, but you will read Y. I leave it up to your imagination
if that could be exploited.


I see what you mean.

Refer Armv8 Arm DDI 0487G.b Page D1-2476, it says that (for instr/data 
abort) the faulting virtual address and IPA is saved in FAR_ELx and 
HPFAR_EL2 respectively. But, I do not see if the faulting instruction is 
saved in any special register. Is there something I am missing ?


Else, :( this is a limitation of the architecture (imo). A hypervisor 
can be interested to see which instruction caused the abort when ISV = 0.


Also, if an instruction is being modified by the guest (after it has 
been loaded in the I cache), and if the guest does not invalidate the I 
cache + ISB, then this is a malicious behavior by the guest. Is my 

[xen-unstable-smoke test] 167805: tolerable all pass - PUSHED

2022-01-24 Thread osstest service owner
flight 167805 xen-unstable-smoke real [real]
http://logs.test-lab.xenproject.org/osstest/logs/167805/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  ec32910f4f871dce0f0e32dfb36f218fa1a2e869
baseline version:
 xen  fe9be76d880b1d43b9dca471f45af3fd380ecb00

Last test of basis   167789  2022-01-21 20:02:56 Z2 days
Testing same since   167805  2022-01-24 08:01:41 Z0 days1 attempts


People who touched revisions under test:
  Andrew Cooper 
  Artem Bityutskiy 
  Jan Beulich 
  Rafael J. Wysocki 
  Roger Pau Monné 

jobs:
 build-arm64-xsm  pass
 build-amd64  pass
 build-armhf  pass
 build-amd64-libvirt  pass
 test-armhf-armhf-xl  pass
 test-arm64-arm64-xl-xsm  pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64pass
 test-amd64-amd64-libvirt pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/xen.git
   fe9be76d88..ec32910f4f  ec32910f4f871dce0f0e32dfb36f218fa1a2e869 -> smoke



[libvirt test] 167804: regressions - FAIL

2022-01-24 Thread osstest service owner
flight 167804 libvirt real [real]
http://logs.test-lab.xenproject.org/osstest/logs/167804/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-arm64-libvirt   6 libvirt-buildfail REGR. vs. 151777
 build-i386-libvirt6 libvirt-buildfail REGR. vs. 151777
 build-armhf-libvirt   6 libvirt-buildfail REGR. vs. 151777
 build-amd64-libvirt   6 libvirt-buildfail REGR. vs. 151777

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-libvirt  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-pair  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a
 test-amd64-amd64-libvirt-vhd  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt   1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-pair  1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a
 test-amd64-i386-libvirt-raw   1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-xsm   1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-qcow2  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-raw  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-raw  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-qcow2  1 build-check(1)   blocked  n/a

version targeted for testing:
 libvirt  eee062d7a263a2c26c375812aeee6736bac53dd0
baseline version:
 libvirt  2c846fa6bcc11929c9fb857a22430fb9945654ad

Last test of basis   151777  2020-07-10 04:19:19 Z  563 days
Failing since151818  2020-07-11 04:18:52 Z  562 days  544 attempts
Testing same since   167791  2022-01-22 04:18:47 Z2 days3 attempts


People who touched revisions under test:
Adolfo Jayme Barrientos 
  Aleksandr Alekseev 
  Aleksei Zakharov 
  Andika Triwidada 
  Andrea Bolognani 
  Ani Sinha 
  Balázs Meskó 
  Barrett Schonefeld 
  Bastian Germann 
  Bastien Orivel 
  BiaoXiang Ye 
  Bihong Yu 
  Binfeng Wu 
  Bjoern Walk 
  Boris Fiuczynski 
  Brian Turek 
  Bruno Haible 
  Chris Mayo 
  Christian Borntraeger 
  Christian Ehrhardt 
  Christian Kirbach 
  Christian Schoenebeck 
  Cole Robinson 
  Collin Walling 
  Cornelia Huck 
  Cédric Bosdonnat 
  Côme Borsoi 
  Daniel Henrique Barboza 
  Daniel Letai 
  Daniel P. Berrange 
  Daniel P. Berrangé 
  Didik Supriadi 
  dinglimin 
  Divya Garg 
  Dmitrii Shcherbakov 
  Dmytro Linkin 
  Eiichi Tsukata 
  Eric Farman 
  Erik Skultety 
  Fabian Affolter 
  Fabian Freyer 
  Fabiano Fidêncio 
  Fangge Jin 
  Farhan Ali 
  Fedora Weblate Translation 
  Franck Ridel 
  Gavi Teitz 
  gongwei 
  Guoyi Tu
  Göran Uddeborg 
  Halil Pasic 
  Han Han 
  Hao Wang 
  Hela Basa 
  Helmut Grohne 
  Hiroki Narukawa 
  Hyman Huang(黄勇) 
  Ian Wienand 
  Ioanna Alifieraki 
  Ivan Teterevkov 
  Jakob Meng 
  Jamie Strandboge 
  Jamie Strandboge 
  Jan Kuparinen 
  jason lee 
  Jean-Baptiste Holcroft 
  Jia Zhou 
  Jianan Gao 
  Jim Fehlig 
  Jin Yan 
  Jinsheng Zhang 
  Jiri Denemark 
  Joachim Falk 
  John Ferlan 
  Jonathan Watt 
  Jonathon Jongsma 
  Julio Faracco 
  Justin Gatzen 
  Ján Tomko 
  Kashyap Chamarthy 
  Kevin Locke 
  Koichi Murase 
  Kristina Hanicova 
  Laine Stump 
  Laszlo Ersek 
  Lee Yarwood 
  Lei Yang 
  Liao Pingfang 
  Lin Ma 
  Lin Ma 
  Lin Ma 
  Liu Yiding 
  Luke Yue 
  Luyao Zhong 
  Marc Hartmayer 
  Marc-André Lureau 
  Marek Marczykowski-Górecki 
  Markus Schade 
  Martin Kletzander 
  Masayoshi Mizuma 
  Matej Cepl 
  Matt Coleman 
  Matt Coleman 
  Mauro Matteo Cascella 
  Meina Li 
  Michal Privoznik 
  Michał Smyk 
  Milo Casagrande 
  Moshe Levi 
  Muha Aliss 
  Nathan 
  Neal Gompa 
  Nick Chevsky 
  Nick Shyrokovskiy 
  Nickys Music Group 
  Nico Pache 
  Nikolay Shirokovskiy 
  Olaf Hering 
  Olesya Gerasimenko 
  Or Ozeri 
  Orion Poplawski 
  Pany 
  Patrick Magauran 
  Paulo de Rezende Pinatti 
  Pavel Hrdina 
  Peng Liang 
  Peter Krempa 
  Pino Toscano 
  Pino Toscano 
  Piotr Drąg 
  Prathamesh Chavan 
  Praveen K Paladugu 
  Richard W.M. Jones 
  Ricky Tigg 
  Robin Lee 
  Rohit Kumar 
  Roman Bogorodskiy 
  Roman Bolshakov 
  Ryan Gahagan 
  Ryan Schmidt 
  Sam Hartman 
  Scott Shambarger 
  Sebastian Mitterle 
  SeongHyun Jo 
  Shalini Chellathurai Saroja 
  Shaojun Yang 
  Shi Lei 
  simmon 
  Simon Chopin 
  Simon Gaiser 
  Simon Rowe 
  Stefan Bader 
  Stefan Berger 
  Stefan Berger 
  

[PATCH v2] xen-mapcache: Avoid entry->lock overflow

2022-01-24 Thread Ross Lagerwall
In some cases, a particular mapcache entry may be mapped 256 times
causing the lock field to wrap to 0. For example, this may happen when
using emulated NVME and the guest submits a large scatter-gather write.
At this point, the entry map be remapped causing QEMU to write the wrong
data or crash (since remap is not atomic).

Avoid this overflow by increasing the lock field to a uint32_t and also
detect it and abort rather than continuing regardless.

Signed-off-by: Ross Lagerwall 
---
Changes in v2: Change type to uint32_t since there is a hole there
anyway. The struct size remains at 48 bytes on x86_64.

 hw/i386/xen/xen-mapcache.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/i386/xen/xen-mapcache.c b/hw/i386/xen/xen-mapcache.c
index bd47c3d672..f2ef977963 100644
--- a/hw/i386/xen/xen-mapcache.c
+++ b/hw/i386/xen/xen-mapcache.c
@@ -52,7 +52,7 @@ typedef struct MapCacheEntry {
 hwaddr paddr_index;
 uint8_t *vaddr_base;
 unsigned long *valid_mapping;
-uint8_t lock;
+uint32_t lock;
 #define XEN_MAPCACHE_ENTRY_DUMMY (1 << 0)
 uint8_t flags;
 hwaddr size;
@@ -355,6 +355,12 @@ tryagain:
 if (lock) {
 MapCacheRev *reventry = g_malloc0(sizeof(MapCacheRev));
 entry->lock++;
+if (entry->lock == 0) {
+fprintf(stderr,
+"mapcache entry lock overflow: "TARGET_FMT_plx" -> %p\n",
+entry->paddr_index, entry->vaddr_base);
+abort();
+}
 reventry->dma = dma;
 reventry->vaddr_req = mapcache->last_entry->vaddr_base + 
address_offset;
 reventry->paddr_index = mapcache->last_entry->paddr_index;
-- 
2.27.0




[linux-linus test] 167801: tolerable FAIL - PUSHED

2022-01-24 Thread osstest service owner
flight 167801 linux-linus real [real]
http://logs.test-lab.xenproject.org/osstest/logs/167801/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 167684
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 167684
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 167684
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 167684
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 167684
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 167684
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check   fail like 167684
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 167684
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass

version targeted for testing:
 linuxdd81e1c7d5fb126e5fbc5c9e334d7b3ec29a16a0
baseline version:
 linux455e73a07f6e288b0061dfcf4fcf54fa9fe06458

Last test of basis   167684  2022-01-13 07:47:42 Z   11 days
Failing since167693  2022-01-13 22:41:04 Z   10 days   20 attempts
Testing same since   167801  2022-01-23 22:40:37 Z0 days1 attempts


936 people touched revisions under test,
not listing them all

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 

[PATCH 17/19] block: pass a block_device and opf to bio_alloc

2022-01-24 Thread Christoph Hellwig
Pass the block_device and operation that we plan to use this bio for to
bio_alloc to optimize the assignment.  NULL/0 can be passed, both for the
passthrough case on a raw request_queue and to temporarily avoid
refactoring some nasty code.

Also move the gfp_mask argument after the nr_vecs argument for a much
more logical calling convention matching what most of the kernel does.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Chaitanya Kulkarni 
---
 block/bio.c |  5 +
 block/fops.c|  4 +---
 drivers/block/drbd/drbd_receiver.c  | 10 --
 drivers/block/rnbd/rnbd-srv.c   |  5 ++---
 drivers/block/xen-blkback/blkback.c | 11 +--
 drivers/block/zram/zram_drv.c   | 11 ---
 drivers/md/dm-log-writes.c  | 21 -
 drivers/md/dm-thin.c|  9 -
 drivers/md/dm-zoned-metadata.c  | 15 ++-
 drivers/nvdimm/nd_virtio.c  |  6 +++---
 drivers/nvme/target/io-cmd-bdev.c   | 12 ++--
 drivers/nvme/target/passthru.c  |  5 +++--
 drivers/nvme/target/zns.c   |  6 +++---
 drivers/scsi/ufs/ufshpb.c   |  4 ++--
 drivers/target/target_core_iblock.c |  5 ++---
 fs/btrfs/disk-io.c  |  6 +++---
 fs/buffer.c | 14 ++
 fs/crypto/bio.c | 13 +++--
 fs/direct-io.c  |  5 +
 fs/erofs/zdata.c|  5 ++---
 fs/ext4/page-io.c   |  3 +--
 fs/ext4/readpage.c  |  8 
 fs/gfs2/lops.c  |  8 +++-
 fs/gfs2/meta_io.c   |  4 +---
 fs/gfs2/ops_fstype.c|  4 +---
 fs/hfsplus/wrapper.c|  4 +---
 fs/iomap/buffered-io.c  | 16 
 fs/iomap/direct-io.c|  8 ++--
 fs/jfs/jfs_logmgr.c | 11 ++-
 fs/jfs/jfs_metapage.c   |  9 +++--
 fs/mpage.c  |  7 +++
 fs/nfs/blocklayout/blocklayout.c|  4 +---
 fs/nilfs2/segbuf.c  |  4 ++--
 fs/ntfs3/fsntfs.c   |  8 ++--
 fs/ocfs2/cluster/heartbeat.c|  4 +---
 fs/squashfs/block.c | 11 ++-
 fs/xfs/xfs_bio_io.c | 10 --
 fs/xfs/xfs_buf.c|  4 +---
 fs/zonefs/super.c   |  5 ++---
 include/linux/bio.h |  5 +++--
 kernel/power/swap.c |  5 ++---
 mm/page_io.c| 10 --
 42 files changed, 130 insertions(+), 194 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 6c3efb0fd12b1..b73c9babd5835 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -347,10 +347,7 @@ EXPORT_SYMBOL(bio_chain);
 struct bio *blk_next_bio(struct bio *bio, struct block_device *bdev,
unsigned int nr_pages, unsigned int opf, gfp_t gfp)
 {
-   struct bio *new = bio_alloc(gfp, nr_pages);
-
-   bio_set_dev(new, bdev);
-   new->bi_opf = opf;
+   struct bio *new = bio_alloc(bdev, nr_pages, opf, gfp);
 
if (bio) {
bio_chain(bio, new);
diff --git a/block/fops.c b/block/fops.c
index 3a62b8b912750..c683596847731 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -256,9 +256,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, 
struct iov_iter *iter,
}
atomic_inc(>ref);
submit_bio(bio);
-   bio = bio_alloc(GFP_KERNEL, nr_pages);
-   bio_set_dev(bio, bdev);
-   bio->bi_opf = opf;
+   bio = bio_alloc(bdev, nr_pages, opf, GFP_KERNEL);
}
 
blk_finish_plug();
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index fb59b263deeef..04e3ec12d8b49 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1279,7 +1279,8 @@ static void one_flush_endio(struct bio *bio)
 
 static void submit_one_flush(struct drbd_device *device, struct 
issue_flush_context *ctx)
 {
-   struct bio *bio = bio_alloc(GFP_NOIO, 0);
+   struct bio *bio = bio_alloc(device->ldev->backing_bdev, 0,
+   REQ_OP_FLUSH | REQ_PREFLUSH, GFP_NOIO);
struct one_flush_context *octx = kmalloc(sizeof(*octx), GFP_NOIO);
 
if (!octx) {
@@ -1297,10 +1298,8 @@ static void submit_one_flush(struct drbd_device *device, 
struct issue_flush_cont
 
octx->device = device;
octx->ctx = ctx;
-   bio_set_dev(bio, device->ldev->backing_bdev);
bio->bi_private = octx;
bio->bi_end_io = one_flush_endio;
-   bio->bi_opf = REQ_OP_FLUSH | REQ_PREFLUSH;
 
device->flush_jif = jiffies;
set_bit(FLUSH_PENDING, >flags);
@@ -1685,11 +1684,10 @@ int drbd_submit_peer_request(struct drbd_device *device,
 * generated bio, but a bio allocated on behalf of the peer.
 */
 next_bio:
-   bio = 

[PATCH 18/19] block: pass a block_device and opf to bio_init

2022-01-24 Thread Christoph Hellwig
Pass the block_device that we plan to use this bio for and the
operation to bio_init to optimize the assignment.  A NULL block_device
can be passed, both for the passthrough case on a raw request_queue and
to temporarily avoid refactoring some nasty code.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Chaitanya Kulkarni 
---
 block/bio.c   | 27 +--
 block/blk-flush.c |  4 +---
 block/blk-zoned.c |  5 +
 block/fops.c  | 18 +-
 drivers/block/floppy.c|  4 +---
 drivers/block/zram/zram_drv.c |  5 ++---
 drivers/md/bcache/io.c|  3 ++-
 drivers/md/bcache/journal.c   |  4 +---
 drivers/md/bcache/movinggc.c  |  4 ++--
 drivers/md/bcache/request.c   |  2 +-
 drivers/md/bcache/super.c |  8 +++-
 drivers/md/bcache/writeback.c |  4 ++--
 drivers/md/dm.c   |  5 ++---
 drivers/md/md-multipath.c |  2 +-
 drivers/md/md.c   |  8 +++-
 drivers/md/raid5-cache.c  |  2 +-
 drivers/md/raid5-ppl.c|  2 +-
 drivers/md/raid5.c|  4 ++--
 drivers/nvme/target/io-cmd-bdev.c | 10 --
 drivers/nvme/target/passthru.c|  4 ++--
 drivers/nvme/target/zns.c |  4 ++--
 fs/iomap/buffered-io.c|  4 +---
 fs/xfs/xfs_bio_io.c   |  4 +---
 fs/xfs/xfs_log.c  | 14 +++---
 fs/zonefs/super.c |  4 +---
 include/linux/bio.h   |  4 ++--
 26 files changed, 68 insertions(+), 91 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index b73c9babd5835..b2133d86e885e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -249,12 +249,12 @@ static void bio_free(struct bio *bio)
  * they must remember to pair any call to bio_init() with bio_uninit()
  * when IO has completed, or when the bio is released.
  */
-void bio_init(struct bio *bio, struct bio_vec *table,
- unsigned short max_vecs)
+void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec 
*table,
+ unsigned short max_vecs, unsigned int opf)
 {
bio->bi_next = NULL;
-   bio->bi_bdev = NULL;
-   bio->bi_opf = 0;
+   bio->bi_bdev = bdev;
+   bio->bi_opf = opf;
bio->bi_flags = 0;
bio->bi_ioprio = 0;
bio->bi_write_hint = 0;
@@ -268,6 +268,8 @@ void bio_init(struct bio *bio, struct bio_vec *table,
 #ifdef CONFIG_BLK_CGROUP
bio->bi_blkg = NULL;
bio->bi_issue.value = 0;
+   if (bdev)
+   bio_associate_blkg(bio);
 #ifdef CONFIG_BLK_CGROUP_IOCOST
bio->bi_iocost_cost = 0;
 #endif
@@ -504,17 +506,14 @@ struct bio *bio_alloc_bioset(struct block_device *bdev, 
unsigned short nr_vecs,
if (unlikely(!bvl))
goto err_free;
 
-   bio_init(bio, bvl, nr_vecs);
+   bio_init(bio, bdev, bvl, nr_vecs, opf);
} else if (nr_vecs) {
-   bio_init(bio, bio->bi_inline_vecs, BIO_INLINE_VECS);
+   bio_init(bio, bdev, bio->bi_inline_vecs, BIO_INLINE_VECS, opf);
} else {
-   bio_init(bio, NULL, 0);
+   bio_init(bio, bdev, NULL, 0, opf);
}
 
bio->bi_pool = bs;
-   if (bdev)
-   bio_set_dev(bio, bdev);
-   bio->bi_opf = opf;
return bio;
 
 err_free:
@@ -542,7 +541,8 @@ struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned short 
nr_iovecs)
bio = kmalloc(struct_size(bio, bi_inline_vecs, nr_iovecs), gfp_mask);
if (unlikely(!bio))
return NULL;
-   bio_init(bio, nr_iovecs ? bio->bi_inline_vecs : NULL, nr_iovecs);
+   bio_init(bio, NULL, nr_iovecs ? bio->bi_inline_vecs : NULL, nr_iovecs,
+0);
bio->bi_pool = NULL;
return bio;
 }
@@ -1756,9 +1756,8 @@ struct bio *bio_alloc_kiocb(struct kiocb *kiocb, struct 
block_device *bdev,
cache->free_list = bio->bi_next;
cache->nr--;
put_cpu();
-   bio_init(bio, nr_vecs ? bio->bi_inline_vecs : NULL, nr_vecs);
-   bio_set_dev(bio, bdev);
-   bio->bi_opf = opf;
+   bio_init(bio, bdev, nr_vecs ? bio->bi_inline_vecs : NULL,
+nr_vecs, opf);
bio->bi_pool = bs;
bio_set_flag(bio, BIO_PERCPU_CACHE);
return bio;
diff --git a/block/blk-flush.c b/block/blk-flush.c
index e4df894189ced..c689687248706 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -460,9 +460,7 @@ int blkdev_issue_flush(struct block_device *bdev)
 {
struct bio bio;
 
-   bio_init(, NULL, 0);
-   bio_set_dev(, bdev);
-   bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
+   bio_init(, bdev, NULL, 0, REQ_OP_WRITE | REQ_PREFLUSH);
return submit_bio_wait();
 }
 EXPORT_SYMBOL(blkdev_issue_flush);
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 

[PATCH 15/19] block: pass a block_device and opf to bio_alloc_bioset

2022-01-24 Thread Christoph Hellwig
Pass the block_device and operation that we plan to use this bio for to
bio_alloc_bioset to optimize the assigment.  NULL/0 can be passed, both
for the passthrough case on a raw request_queue and to temporarily avoid
refactoring some nasty code.

Also move the gfp_mask argument after the nr_vecs argument for a much
more logical calling convention matching what most of the kernel does.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Chaitanya Kulkarni 
---
 block/bio.c | 30 +
 block/bounce.c  |  6 ++
 drivers/block/drbd/drbd_actlog.c|  5 ++---
 drivers/block/drbd/drbd_bitmap.c|  7 +++
 drivers/md/bcache/request.c | 12 +---
 drivers/md/dm-crypt.c   |  5 ++---
 drivers/md/dm-io.c  |  5 ++---
 drivers/md/dm-writecache.c  |  7 ---
 drivers/md/dm.c |  5 +++--
 drivers/md/md.c | 16 +++
 drivers/md/raid1.c  |  3 ++-
 drivers/md/raid10.c |  6 ++
 drivers/md/raid5-cache.c|  8 
 drivers/md/raid5-ppl.c  | 11 +--
 drivers/target/target_core_iblock.c |  6 ++
 fs/btrfs/extent_io.c|  2 +-
 fs/f2fs/data.c  |  7 +++
 fs/iomap/buffered-io.c  |  6 +++---
 include/linux/bio.h |  7 ---
 19 files changed, 75 insertions(+), 79 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index a0166f29a05c3..9afc0c2aca6e4 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -417,8 +417,10 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
 
 /**
  * bio_alloc_bioset - allocate a bio for I/O
+ * @bdev:  block device to allocate the bio for (can be %NULL)
+ * @nr_vecs:   number of bvecs to pre-allocate
+ * @opf:   operation and flags for bio
  * @gfp_mask:   the GFP_* mask given to the slab allocator
- * @nr_iovecs: number of iovecs to pre-allocate
  * @bs:the bio_set to allocate from.
  *
  * Allocate a bio from the mempools in @bs.
@@ -447,15 +449,16 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
  *
  * Returns: Pointer to new bio on success, NULL on failure.
  */
-struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned short nr_iovecs,
+struct bio *bio_alloc_bioset(struct block_device *bdev, unsigned short nr_vecs,
+unsigned int opf, gfp_t gfp_mask,
 struct bio_set *bs)
 {
gfp_t saved_gfp = gfp_mask;
struct bio *bio;
void *p;
 
-   /* should not use nobvec bioset for nr_iovecs > 0 */
-   if (WARN_ON_ONCE(!mempool_initialized(>bvec_pool) && nr_iovecs > 0))
+   /* should not use nobvec bioset for nr_vecs > 0 */
+   if (WARN_ON_ONCE(!mempool_initialized(>bvec_pool) && nr_vecs > 0))
return NULL;
 
/*
@@ -492,26 +495,29 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned 
short nr_iovecs,
return NULL;
 
bio = p + bs->front_pad;
-   if (nr_iovecs > BIO_INLINE_VECS) {
+   if (nr_vecs > BIO_INLINE_VECS) {
struct bio_vec *bvl = NULL;
 
-   bvl = bvec_alloc(>bvec_pool, _iovecs, gfp_mask);
+   bvl = bvec_alloc(>bvec_pool, _vecs, gfp_mask);
if (!bvl && gfp_mask != saved_gfp) {
punt_bios_to_rescuer(bs);
gfp_mask = saved_gfp;
-   bvl = bvec_alloc(>bvec_pool, _iovecs, gfp_mask);
+   bvl = bvec_alloc(>bvec_pool, _vecs, gfp_mask);
}
if (unlikely(!bvl))
goto err_free;
 
-   bio_init(bio, bvl, nr_iovecs);
-   } else if (nr_iovecs) {
+   bio_init(bio, bvl, nr_vecs);
+   } else if (nr_vecs) {
bio_init(bio, bio->bi_inline_vecs, BIO_INLINE_VECS);
} else {
bio_init(bio, NULL, 0);
}
 
bio->bi_pool = bs;
+   if (bdev)
+   bio_set_dev(bio, bdev);
+   bio->bi_opf = opf;
return bio;
 
 err_free:
@@ -767,7 +773,7 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, 
struct bio_set *bs)
 {
struct bio *b;
 
-   b = bio_alloc_bioset(gfp_mask, 0, bs);
+   b = bio_alloc_bioset(NULL, 0, 0, gfp_mask, bs);
if (!b)
return NULL;
 
@@ -1743,7 +1749,7 @@ struct bio *bio_alloc_kiocb(struct kiocb *kiocb, unsigned 
short nr_vecs,
struct bio *bio;
 
if (!(kiocb->ki_flags & IOCB_ALLOC_CACHE) || nr_vecs > BIO_INLINE_VECS)
-   return bio_alloc_bioset(GFP_KERNEL, nr_vecs, bs);
+   return bio_alloc_bioset(NULL, nr_vecs, 0, GFP_KERNEL, bs);
 
cache = per_cpu_ptr(bs->cache, get_cpu());
if (cache->free_list) {
@@ -1757,7 +1763,7 @@ struct bio *bio_alloc_kiocb(struct kiocb *kiocb, unsigned 
short nr_vecs,
return bio;
}

[PATCH 16/19] block: pass a block_device and opf to bio_alloc_kiocb

2022-01-24 Thread Christoph Hellwig
Pass the block_device and operation that we plan to use this bio for to
bio_alloc_kiocb to optimize the assigment.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Chaitanya Kulkarni 
---
 block/bio.c | 12 
 block/fops.c| 17 -
 include/linux/bio.h |  4 ++--
 3 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 9afc0c2aca6e4..6c3efb0fd12b1 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1731,7 +1731,9 @@ EXPORT_SYMBOL(bioset_init_from_src);
 /**
  * bio_alloc_kiocb - Allocate a bio from bio_set based on kiocb
  * @kiocb: kiocb describing the IO
+ * @bdev:  block device to allocate the bio for (can be %NULL)
  * @nr_vecs:   number of iovecs to pre-allocate
+ * @opf:   operation and flags for bio
  * @bs:bio_set to allocate from
  *
  * Description:
@@ -1742,14 +1744,14 @@ EXPORT_SYMBOL(bioset_init_from_src);
  *MUST be done from process context, not hard/soft IRQ.
  *
  */
-struct bio *bio_alloc_kiocb(struct kiocb *kiocb, unsigned short nr_vecs,
-   struct bio_set *bs)
+struct bio *bio_alloc_kiocb(struct kiocb *kiocb, struct block_device *bdev,
+   unsigned short nr_vecs, unsigned int opf, struct bio_set *bs)
 {
struct bio_alloc_cache *cache;
struct bio *bio;
 
if (!(kiocb->ki_flags & IOCB_ALLOC_CACHE) || nr_vecs > BIO_INLINE_VECS)
-   return bio_alloc_bioset(NULL, nr_vecs, 0, GFP_KERNEL, bs);
+   return bio_alloc_bioset(bdev, nr_vecs, opf, GFP_KERNEL, bs);
 
cache = per_cpu_ptr(bs->cache, get_cpu());
if (cache->free_list) {
@@ -1758,12 +1760,14 @@ struct bio *bio_alloc_kiocb(struct kiocb *kiocb, 
unsigned short nr_vecs,
cache->nr--;
put_cpu();
bio_init(bio, nr_vecs ? bio->bi_inline_vecs : NULL, nr_vecs);
+   bio_set_dev(bio, bdev);
+   bio->bi_opf = opf;
bio->bi_pool = bs;
bio_set_flag(bio, BIO_PERCPU_CACHE);
return bio;
}
put_cpu();
-   bio = bio_alloc_bioset(NULL, nr_vecs, 0, GFP_KERNEL, bs);
+   bio = bio_alloc_bioset(bdev, nr_vecs, opf, GFP_KERNEL, bs);
bio_set_flag(bio, BIO_PERCPU_CACHE);
return bio;
 }
diff --git a/block/fops.c b/block/fops.c
index 26bf15c770d21..3a62b8b912750 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -190,6 +190,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, 
struct iov_iter *iter,
struct blkdev_dio *dio;
struct bio *bio;
bool is_read = (iov_iter_rw(iter) == READ), is_sync;
+   unsigned int opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
loff_t pos = iocb->ki_pos;
int ret = 0;
 
@@ -197,7 +198,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, 
struct iov_iter *iter,
(bdev_logical_block_size(bdev) - 1))
return -EINVAL;
 
-   bio = bio_alloc_kiocb(iocb, nr_pages, _dio_pool);
+   bio = bio_alloc_kiocb(iocb, bdev, nr_pages, opf, _dio_pool);
 
dio = container_of(bio, struct blkdev_dio, bio);
atomic_set(>ref, 1);
@@ -223,7 +224,6 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, 
struct iov_iter *iter,
blk_start_plug();
 
for (;;) {
-   bio_set_dev(bio, bdev);
bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
bio->bi_write_hint = iocb->ki_hint;
bio->bi_private = dio;
@@ -238,11 +238,9 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, 
struct iov_iter *iter,
}
 
if (is_read) {
-   bio->bi_opf = REQ_OP_READ;
if (dio->flags & DIO_SHOULD_DIRTY)
bio_set_pages_dirty(bio);
} else {
-   bio->bi_opf = dio_bio_write_op(iocb);
task_io_account_write(bio->bi_iter.bi_size);
}
if (iocb->ki_flags & IOCB_NOWAIT)
@@ -259,6 +257,8 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, 
struct iov_iter *iter,
atomic_inc(>ref);
submit_bio(bio);
bio = bio_alloc(GFP_KERNEL, nr_pages);
+   bio_set_dev(bio, bdev);
+   bio->bi_opf = opf;
}
 
blk_finish_plug();
@@ -311,6 +311,8 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
unsigned int nr_pages)
 {
struct block_device *bdev = iocb->ki_filp->private_data;
+   bool is_read = iov_iter_rw(iter) == READ;
+   unsigned int opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
struct blkdev_dio *dio;
struct bio *bio;
loff_t pos = iocb->ki_pos;
@@ -320,11 +322,10 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb 
*iocb,
(bdev_logical_block_size(bdev) - 1))
return -EINVAL;
 
- 

[PATCH 19/19] block: pass a block_device and opf to bio_reset

2022-01-24 Thread Christoph Hellwig
Pass the block_device that we plan to use this bio for and the
operation to bio_reset to optimize the assigment.  A NULL block_device
can be passed, both for the passthrough case on a raw request_queue and
to temporarily avoid refactoring some nasty code.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Chaitanya Kulkarni 
---
 block/bio.c |  6 +-
 drivers/block/pktcdvd.c |  8 ++--
 drivers/md/bcache/journal.c | 12 
 drivers/md/bcache/request.c |  4 ++--
 drivers/md/raid1.c  |  5 ++---
 drivers/md/raid10.c |  8 +++-
 drivers/md/raid5-cache.c|  9 +++--
 drivers/md/raid5.c  |  8 
 fs/btrfs/disk-io.c  |  4 +---
 fs/crypto/bio.c |  8 ++--
 include/linux/bio.h |  9 +
 11 files changed, 29 insertions(+), 52 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index b2133d86e885e..03cefe81950f2 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -295,6 +295,8 @@ EXPORT_SYMBOL(bio_init);
 /**
  * bio_reset - reinitialize a bio
  * @bio:   bio to reset
+ * @bdev:  block device to use the bio for
+ * @opf:   operation and flags for bio
  *
  * Description:
  *   After calling bio_reset(), @bio will be in the same state as a freshly
@@ -302,11 +304,13 @@ EXPORT_SYMBOL(bio_init);
  *   preserved are the ones that are initialized by bio_alloc_bioset(). See
  *   comment in struct bio.
  */
-void bio_reset(struct bio *bio)
+void bio_reset(struct bio *bio, struct block_device *bdev, unsigned int opf)
 {
bio_uninit(bio);
memset(bio, 0, BIO_RESET_BYTES);
atomic_set(>__bi_remaining, 1);
+   bio->bi_bdev = bdev;
+   bio->bi_opf = opf;
 }
 EXPORT_SYMBOL(bio_reset);
 
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 2b6b70a39e760..3aa5954429462 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -1020,9 +1020,8 @@ static void pkt_gather_data(struct pktcdvd_device *pd, 
struct packet_data *pkt)
continue;
 
bio = pkt->r_bios[f];
-   bio_reset(bio);
+   bio_reset(bio, pd->bdev, REQ_OP_READ);
bio->bi_iter.bi_sector = pkt->sector + f * (CD_FRAMESIZE >> 9);
-   bio_set_dev(bio, pd->bdev);
bio->bi_end_io = pkt_end_io_read;
bio->bi_private = pkt;
 
@@ -1034,7 +1033,6 @@ static void pkt_gather_data(struct pktcdvd_device *pd, 
struct packet_data *pkt)
BUG();
 
atomic_inc(>io_wait);
-   bio_set_op_attrs(bio, REQ_OP_READ, 0);
pkt_queue_bio(pd, bio);
frames_read++;
}
@@ -1235,9 +1233,8 @@ static void pkt_start_write(struct pktcdvd_device *pd, 
struct packet_data *pkt)
 {
int f;
 
-   bio_reset(pkt->w_bio);
+   bio_reset(pkt->w_bio, pd->bdev, REQ_OP_WRITE);
pkt->w_bio->bi_iter.bi_sector = pkt->sector;
-   bio_set_dev(pkt->w_bio, pd->bdev);
pkt->w_bio->bi_end_io = pkt_end_io_packet_write;
pkt->w_bio->bi_private = pkt;
 
@@ -1270,7 +1267,6 @@ static void pkt_start_write(struct pktcdvd_device *pd, 
struct packet_data *pkt)
 
/* Start the write request */
atomic_set(>io_wait, 1);
-   bio_set_op_attrs(pkt->w_bio, REQ_OP_WRITE, 0);
pkt_queue_bio(pd, pkt->w_bio);
 }
 
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 6d26c5b06e2b6..7c2ca52ca3e43 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -53,14 +53,12 @@ static int journal_read_bucket(struct cache *ca, struct 
list_head *list,
 reread:left = ca->sb.bucket_size - offset;
len = min_t(unsigned int, left, PAGE_SECTORS << JSET_BITS);
 
-   bio_reset(bio);
+   bio_reset(bio, ca->bdev, REQ_OP_READ);
bio->bi_iter.bi_sector  = bucket + offset;
-   bio_set_dev(bio, ca->bdev);
bio->bi_iter.bi_size= len << 9;
 
bio->bi_end_io  = journal_read_endio;
bio->bi_private = 
-   bio_set_op_attrs(bio, REQ_OP_READ, 0);
bch_bio_map(bio, data);
 
closure_bio_submit(ca->set, bio, );
@@ -771,16 +769,14 @@ static void journal_write_unlocked(struct closure *cl)
 
atomic_long_add(sectors, >meta_sectors_written);
 
-   bio_reset(bio);
+   bio_reset(bio, ca->bdev, REQ_OP_WRITE | 
+ REQ_SYNC | REQ_META | REQ_PREFLUSH | REQ_FUA);
+   bch_bio_map(bio, w->data);
bio->bi_iter.bi_sector  = PTR_OFFSET(k, i);
-   bio_set_dev(bio, ca->bdev);
bio->bi_iter.bi_size = sectors << 9;
 
bio->bi_end_io  = journal_write_endio;
bio->bi_private = w;
-   bio_set_op_attrs(bio, REQ_OP_WRITE,
-REQ_SYNC|REQ_META|REQ_PREFLUSH|REQ_FUA);
- 

[PATCH 14/19] block: pass a block_device and opf to blk_next_bio

2022-01-24 Thread Christoph Hellwig
From: Chaitanya Kulkarni 

All callers need to set the block_device and operation, so lift that into
the common code.

Signed-off-by: Chaitanya Kulkarni 
Signed-off-by: Christoph Hellwig 
---
 block/bio.c   |  6 +-
 block/blk-lib.c   | 19 +--
 block/blk-zoned.c |  9 +++--
 block/blk.h   |  2 --
 drivers/nvme/target/zns.c |  6 +++---
 include/linux/bio.h   |  3 ++-
 6 files changed, 18 insertions(+), 27 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 1536579ed490a..a0166f29a05c3 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -344,10 +344,14 @@ void bio_chain(struct bio *bio, struct bio *parent)
 }
 EXPORT_SYMBOL(bio_chain);
 
-struct bio *blk_next_bio(struct bio *bio, unsigned int nr_pages, gfp_t gfp)
+struct bio *blk_next_bio(struct bio *bio, struct block_device *bdev,
+   unsigned int nr_pages, unsigned int opf, gfp_t gfp)
 {
struct bio *new = bio_alloc(gfp, nr_pages);
 
+   bio_set_dev(new, bdev);
+   new->bi_opf = opf;
+
if (bio) {
bio_chain(bio, new);
submit_bio(bio);
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 9245b300ef73e..1b8ced45e4e55 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -82,11 +82,8 @@ int __blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
 
WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
 
-   bio = blk_next_bio(bio, 0, gfp_mask);
+   bio = blk_next_bio(bio, bdev, 0, op, gfp_mask);
bio->bi_iter.bi_sector = sector;
-   bio_set_dev(bio, bdev);
-   bio_set_op_attrs(bio, op, 0);
-
bio->bi_iter.bi_size = req_sects << 9;
sector += req_sects;
nr_sects -= req_sects;
@@ -176,14 +173,12 @@ static int __blkdev_issue_write_same(struct block_device 
*bdev, sector_t sector,
max_write_same_sectors = bio_allowed_max_sectors(q);
 
while (nr_sects) {
-   bio = blk_next_bio(bio, 1, gfp_mask);
+   bio = blk_next_bio(bio, bdev, 1, REQ_OP_WRITE_SAME, gfp_mask);
bio->bi_iter.bi_sector = sector;
-   bio_set_dev(bio, bdev);
bio->bi_vcnt = 1;
bio->bi_io_vec->bv_page = page;
bio->bi_io_vec->bv_offset = 0;
bio->bi_io_vec->bv_len = bdev_logical_block_size(bdev);
-   bio_set_op_attrs(bio, REQ_OP_WRITE_SAME, 0);
 
if (nr_sects > max_write_same_sectors) {
bio->bi_iter.bi_size = max_write_same_sectors << 9;
@@ -252,10 +247,8 @@ static int __blkdev_issue_write_zeroes(struct block_device 
*bdev,
return -EOPNOTSUPP;
 
while (nr_sects) {
-   bio = blk_next_bio(bio, 0, gfp_mask);
+   bio = blk_next_bio(bio, bdev, 0, REQ_OP_WRITE_ZEROES, gfp_mask);
bio->bi_iter.bi_sector = sector;
-   bio_set_dev(bio, bdev);
-   bio->bi_opf = REQ_OP_WRITE_ZEROES;
if (flags & BLKDEV_ZERO_NOUNMAP)
bio->bi_opf |= REQ_NOUNMAP;
 
@@ -303,11 +296,9 @@ static int __blkdev_issue_zero_pages(struct block_device 
*bdev,
return -EPERM;
 
while (nr_sects != 0) {
-   bio = blk_next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects),
-  gfp_mask);
+   bio = blk_next_bio(bio, bdev, 
__blkdev_sectors_to_bio_pages(nr_sects),
+  REQ_OP_WRITE, gfp_mask);
bio->bi_iter.bi_sector = sector;
-   bio_set_dev(bio, bdev);
-   bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
 
while (nr_sects != 0) {
sz = min((sector_t) PAGE_SIZE, nr_sects << 9);
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 774ecc598bee2..5ab755d792c81 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -215,9 +215,8 @@ static int blkdev_zone_reset_all_emulated(struct 
block_device *bdev,
continue;
}
 
-   bio = blk_next_bio(bio, 0, gfp_mask);
-   bio_set_dev(bio, bdev);
-   bio->bi_opf = REQ_OP_ZONE_RESET | REQ_SYNC;
+   bio = blk_next_bio(bio, bdev, 0, REQ_OP_ZONE_RESET | REQ_SYNC,
+  gfp_mask);
bio->bi_iter.bi_sector = sector;
sector += zone_sectors;
 
@@ -306,9 +305,7 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum 
req_opf op,
}
 
while (sector < end_sector) {
-   bio = blk_next_bio(bio, 0, gfp_mask);
-   bio_set_dev(bio, bdev);
-   bio->bi_opf = op | REQ_SYNC;
+   bio = blk_next_bio(bio, bdev, 0, op | REQ_SYNC, gfp_mask);
bio->bi_iter.bi_sector = sector;
sector += zone_sectors;
 
diff --git a/block/blk.h b/block/blk.h
index 

[PATCH 13/19] block: move blk_next_bio to bio.c

2022-01-24 Thread Christoph Hellwig
Keep blk_next_bio next to the core bio infrastructure.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Chaitanya Kulkarni 
---
 block/bio.c | 13 +
 block/blk-lib.c | 13 -
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 4312a8085396b..1536579ed490a 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -344,6 +344,19 @@ void bio_chain(struct bio *bio, struct bio *parent)
 }
 EXPORT_SYMBOL(bio_chain);
 
+struct bio *blk_next_bio(struct bio *bio, unsigned int nr_pages, gfp_t gfp)
+{
+   struct bio *new = bio_alloc(gfp, nr_pages);
+
+   if (bio) {
+   bio_chain(bio, new);
+   submit_bio(bio);
+   }
+
+   return new;
+}
+EXPORT_SYMBOL_GPL(blk_next_bio);
+
 static void bio_alloc_rescue(struct work_struct *work)
 {
struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 9f09beadcbe30..9245b300ef73e 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -10,19 +10,6 @@
 
 #include "blk.h"
 
-struct bio *blk_next_bio(struct bio *bio, unsigned int nr_pages, gfp_t gfp)
-{
-   struct bio *new = bio_alloc(gfp, nr_pages);
-
-   if (bio) {
-   bio_chain(bio, new);
-   submit_bio(bio);
-   }
-
-   return new;
-}
-EXPORT_SYMBOL_GPL(blk_next_bio);
-
 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, int flags,
struct bio **biop)
-- 
2.30.2




[PATCH 12/19] xen-blkback: bio_alloc can't fail if it is allow to sleep

2022-01-24 Thread Christoph Hellwig
Remove handling of NULL returns from sleeping bio_alloc calls given that
those can't fail.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/xen-blkback/blkback.c | 14 --
 1 file changed, 14 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c 
b/drivers/block/xen-blkback/blkback.c
index 14e452896d04c..6bb2ad7692065 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -1327,9 +1327,6 @@ static int dispatch_rw_block_io(struct xen_blkif_ring 
*ring,
 seg[i].nsec << 9,
 seg[i].offset) == 0)) {
bio = bio_alloc(GFP_KERNEL, bio_max_segs(nseg - i));
-   if (unlikely(bio == NULL))
-   goto fail_put_bio;
-
biolist[nbio++] = bio;
bio_set_dev(bio, preq.bdev);
bio->bi_private = pending_req;
@@ -1346,9 +1343,6 @@ static int dispatch_rw_block_io(struct xen_blkif_ring 
*ring,
BUG_ON(operation_flags != REQ_PREFLUSH);
 
bio = bio_alloc(GFP_KERNEL, 0);
-   if (unlikely(bio == NULL))
-   goto fail_put_bio;
-
biolist[nbio++] = bio;
bio_set_dev(bio, preq.bdev);
bio->bi_private = pending_req;
@@ -1381,14 +1375,6 @@ static int dispatch_rw_block_io(struct xen_blkif_ring 
*ring,
free_req(ring, pending_req);
msleep(1); /* back off a bit */
return -EIO;
-
- fail_put_bio:
-   for (i = 0; i < nbio; i++)
-   bio_put(biolist[i]);
-   atomic_set(_req->pendcnt, 1);
-   __end_block_io_op(pending_req, BLK_STS_RESOURCE);
-   msleep(1); /* back off a bit */
-   return -EIO;
 }
 
 
-- 
2.30.2




[PATCH 06/19] dm-crypt: remove clone_init

2022-01-24 Thread Christoph Hellwig
Just open code it next to the bio allocations, which saves a few lines
of code, prepares for future changes and allows to remove the duplicate
bi_opf assignment for the bio_clone_fast case in kcryptd_io_read.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/dm-crypt.c | 21 -
 1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 20abe3486aba1..3c5ecd35d3483 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -234,7 +234,7 @@ static volatile unsigned long dm_crypt_pages_per_client;
 #define DM_CRYPT_MEMORY_PERCENT2
 #define DM_CRYPT_MIN_PAGES_PER_CLIENT  (BIO_MAX_VECS * 16)
 
-static void clone_init(struct dm_crypt_io *, struct bio *);
+static void crypt_endio(struct bio *clone);
 static void kcryptd_queue_crypt(struct dm_crypt_io *io);
 static struct scatterlist *crypt_get_sg_data(struct crypt_config *cc,
 struct scatterlist *sg);
@@ -1673,7 +1673,10 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io 
*io, unsigned size)
mutex_lock(>bio_alloc_lock);
 
clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, >bs);
-   clone_init(io, clone);
+   clone->bi_private = io;
+   clone->bi_end_io = crypt_endio;
+   bio_set_dev(clone, cc->dev->bdev);
+   clone->bi_opf = io->base_bio->bi_opf;
 
remaining_size = size;
 
@@ -1826,16 +1829,6 @@ static void crypt_endio(struct bio *clone)
crypt_dec_pending(io);
 }
 
-static void clone_init(struct dm_crypt_io *io, struct bio *clone)
-{
-   struct crypt_config *cc = io->cc;
-
-   clone->bi_private = io;
-   clone->bi_end_io  = crypt_endio;
-   bio_set_dev(clone, cc->dev->bdev);
-   clone->bi_opf = io->base_bio->bi_opf;
-}
-
 static int kcryptd_io_read(struct dm_crypt_io *io, gfp_t gfp)
 {
struct crypt_config *cc = io->cc;
@@ -1850,10 +1843,12 @@ static int kcryptd_io_read(struct dm_crypt_io *io, 
gfp_t gfp)
clone = bio_clone_fast(io->base_bio, gfp, >bs);
if (!clone)
return 1;
+   clone->bi_private = io;
+   clone->bi_end_io = crypt_endio;
+   bio_set_dev(clone, cc->dev->bdev);
 
crypt_inc_pending(io);
 
-   clone_init(io, clone);
clone->bi_iter.bi_sector = cc->start + io->sector;
 
if (dm_crypt_integrity_io_alloc(io, clone)) {
-- 
2.30.2




[PATCH 11/19] rnbd-srv: remove struct rnbd_dev_blk_io

2022-01-24 Thread Christoph Hellwig
Only the priv field of rnbd_dev_blk_io is used, so store the value of
that in bio->bi_private directly and remove the entire bio_set overhead.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/rnbd/rnbd-srv-dev.c |  4 +---
 drivers/block/rnbd/rnbd-srv-dev.h | 13 ++---
 drivers/block/rnbd/rnbd-srv.c | 27 ---
 drivers/block/rnbd/rnbd-srv.h |  1 -
 4 files changed, 7 insertions(+), 38 deletions(-)

diff --git a/drivers/block/rnbd/rnbd-srv-dev.c 
b/drivers/block/rnbd/rnbd-srv-dev.c
index 98d3e591a0885..c5d0a03911659 100644
--- a/drivers/block/rnbd/rnbd-srv-dev.c
+++ b/drivers/block/rnbd/rnbd-srv-dev.c
@@ -12,8 +12,7 @@
 #include "rnbd-srv-dev.h"
 #include "rnbd-log.h"
 
-struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t flags,
-  struct bio_set *bs)
+struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t flags)
 {
struct rnbd_dev *dev;
int ret;
@@ -30,7 +29,6 @@ struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t 
flags,
 
dev->blk_open_flags = flags;
bdevname(dev->bdev, dev->name);
-   dev->ibd_bio_set = bs;
 
return dev;
 
diff --git a/drivers/block/rnbd/rnbd-srv-dev.h 
b/drivers/block/rnbd/rnbd-srv-dev.h
index 1a14ece0be726..2c3df02b5e8ec 100644
--- a/drivers/block/rnbd/rnbd-srv-dev.h
+++ b/drivers/block/rnbd/rnbd-srv-dev.h
@@ -14,25 +14,16 @@
 
 struct rnbd_dev {
struct block_device *bdev;
-   struct bio_set  *ibd_bio_set;
fmode_t blk_open_flags;
charname[BDEVNAME_SIZE];
 };
 
-struct rnbd_dev_blk_io {
-   struct rnbd_dev *dev;
-   void *priv;
-   /* have to be last member for front_pad usage of bioset_init */
-   struct bio  bio;
-};
-
 /**
  * rnbd_dev_open() - Open a device
+ * @path:  path to open
  * @flags: open flags
- * @bs:bio_set to use during block io,
  */
-struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t flags,
-  struct bio_set *bs);
+struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t flags);
 
 /**
  * rnbd_dev_close() - Close a device
diff --git a/drivers/block/rnbd/rnbd-srv.c b/drivers/block/rnbd/rnbd-srv.c
index 6d228af1dcc35..ff9b389976078 100644
--- a/drivers/block/rnbd/rnbd-srv.c
+++ b/drivers/block/rnbd/rnbd-srv.c
@@ -116,9 +116,7 @@ rnbd_get_sess_dev(int dev_id, struct rnbd_srv_session 
*srv_sess)
 
 static void rnbd_dev_bi_end_io(struct bio *bio)
 {
-   struct rnbd_dev_blk_io *io = bio->bi_private;
-
-   rnbd_endio(io->priv, blk_status_to_errno(bio->bi_status));
+   rnbd_endio(bio->bi_private, blk_status_to_errno(bio->bi_status));
bio_put(bio);
 }
 
@@ -131,7 +129,6 @@ static int process_rdma(struct rnbd_srv_session *srv_sess,
struct rnbd_srv_sess_dev *sess_dev;
u32 dev_id;
int err;
-   struct rnbd_dev_blk_io *io;
struct bio *bio;
short prio;
 
@@ -152,7 +149,7 @@ static int process_rdma(struct rnbd_srv_session *srv_sess,
priv->sess_dev = sess_dev;
priv->id = id;
 
-   bio = bio_alloc_bioset(GFP_KERNEL, 1, sess_dev->rnbd_dev->ibd_bio_set);
+   bio = bio_alloc(GFP_KERNEL, 1);
if (bio_add_page(bio, virt_to_page(data), datalen,
offset_in_page(data)) != datalen) {
rnbd_srv_err(sess_dev, "Failed to map data to bio\n");
@@ -160,12 +157,8 @@ static int process_rdma(struct rnbd_srv_session *srv_sess,
goto bio_put;
}
 
-   io = container_of(bio, struct rnbd_dev_blk_io, bio);
-   io->dev = sess_dev->rnbd_dev;
-   io->priv = priv;
-
bio->bi_end_io = rnbd_dev_bi_end_io;
-   bio->bi_private = io;
+   bio->bi_private = priv;
bio->bi_opf = rnbd_to_bio_flags(le32_to_cpu(msg->rw));
bio->bi_iter.bi_sector = le64_to_cpu(msg->sector);
bio->bi_iter.bi_size = le32_to_cpu(msg->bi_size);
@@ -260,7 +253,6 @@ static void destroy_sess(struct rnbd_srv_session *srv_sess)
 
 out:
xa_destroy(_sess->index_idr);
-   bioset_exit(_sess->sess_bio_set);
 
pr_info("RTRS Session %s disconnected\n", srv_sess->sessname);
 
@@ -289,16 +281,6 @@ static int create_sess(struct rtrs_srv_sess *rtrs)
return -ENOMEM;
 
srv_sess->queue_depth = rtrs_srv_get_queue_depth(rtrs);
-   err = bioset_init(_sess->sess_bio_set, srv_sess->queue_depth,
- offsetof(struct rnbd_dev_blk_io, bio),
- BIOSET_NEED_BVECS);
-   if (err) {
-   pr_err("Allocating srv_session for path %s failed\n",
-  pathname);
-   kfree(srv_sess);
-   return err;
-   }
-
xa_init_flags(_sess->index_idr, XA_FLAGS_ALLOC);
INIT_LIST_HEAD(_sess->sess_dev_list);
mutex_init(_sess->lock);
@@ -747,8 +729,7 @@ static int process_msg_open(struct rnbd_srv_session 
*srv_sess,
goto 

[PATCH 10/19] rnbd-srv: simplify bio mapping in process_rdma

2022-01-24 Thread Christoph Hellwig
The memory mapped in process_rdma is contiguous, so there is no need
to loop over bio_add_page.  Remove rnbd_bio_map_kern and just open code
the bio allocation and mapping in the caller.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Jack Wang 
Tested-by: Jack Wang 
---
 drivers/block/rnbd/rnbd-srv-dev.c | 57 ---
 drivers/block/rnbd/rnbd-srv-dev.h |  5 ---
 drivers/block/rnbd/rnbd-srv.c | 23 +
 3 files changed, 16 insertions(+), 69 deletions(-)

diff --git a/drivers/block/rnbd/rnbd-srv-dev.c 
b/drivers/block/rnbd/rnbd-srv-dev.c
index b241a099aeae2..98d3e591a0885 100644
--- a/drivers/block/rnbd/rnbd-srv-dev.c
+++ b/drivers/block/rnbd/rnbd-srv-dev.c
@@ -44,60 +44,3 @@ void rnbd_dev_close(struct rnbd_dev *dev)
blkdev_put(dev->bdev, dev->blk_open_flags);
kfree(dev);
 }
-
-void rnbd_dev_bi_end_io(struct bio *bio)
-{
-   struct rnbd_dev_blk_io *io = bio->bi_private;
-
-   rnbd_endio(io->priv, blk_status_to_errno(bio->bi_status));
-   bio_put(bio);
-}
-
-/**
- * rnbd_bio_map_kern   -   map kernel address into bio
- * @data: pointer to buffer to map
- * @bs: bio_set to use.
- * @len: length in bytes
- * @gfp_mask: allocation flags for bio allocation
- *
- * Map the kernel address into a bio suitable for io to a block
- * device. Returns an error pointer in case of error.
- */
-struct bio *rnbd_bio_map_kern(void *data, struct bio_set *bs,
- unsigned int len, gfp_t gfp_mask)
-{
-   unsigned long kaddr = (unsigned long)data;
-   unsigned long end = (kaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
-   unsigned long start = kaddr >> PAGE_SHIFT;
-   const int nr_pages = end - start;
-   int offset, i;
-   struct bio *bio;
-
-   bio = bio_alloc_bioset(gfp_mask, nr_pages, bs);
-   if (!bio)
-   return ERR_PTR(-ENOMEM);
-
-   offset = offset_in_page(kaddr);
-   for (i = 0; i < nr_pages; i++) {
-   unsigned int bytes = PAGE_SIZE - offset;
-
-   if (len <= 0)
-   break;
-
-   if (bytes > len)
-   bytes = len;
-
-   if (bio_add_page(bio, virt_to_page(data), bytes,
-   offset) < bytes) {
-   /* we don't support partial mappings */
-   bio_put(bio);
-   return ERR_PTR(-EINVAL);
-   }
-
-   data += bytes;
-   len -= bytes;
-   offset = 0;
-   }
-
-   return bio;
-}
diff --git a/drivers/block/rnbd/rnbd-srv-dev.h 
b/drivers/block/rnbd/rnbd-srv-dev.h
index 0eb23850afb95..1a14ece0be726 100644
--- a/drivers/block/rnbd/rnbd-srv-dev.h
+++ b/drivers/block/rnbd/rnbd-srv-dev.h
@@ -41,11 +41,6 @@ void rnbd_dev_close(struct rnbd_dev *dev);
 
 void rnbd_endio(void *priv, int error);
 
-void rnbd_dev_bi_end_io(struct bio *bio);
-
-struct bio *rnbd_bio_map_kern(void *data, struct bio_set *bs,
- unsigned int len, gfp_t gfp_mask);
-
 static inline int rnbd_dev_get_max_segs(const struct rnbd_dev *dev)
 {
return queue_max_segments(bdev_get_queue(dev->bdev));
diff --git a/drivers/block/rnbd/rnbd-srv.c b/drivers/block/rnbd/rnbd-srv.c
index 1ee808fc600cf..6d228af1dcc35 100644
--- a/drivers/block/rnbd/rnbd-srv.c
+++ b/drivers/block/rnbd/rnbd-srv.c
@@ -114,6 +114,14 @@ rnbd_get_sess_dev(int dev_id, struct rnbd_srv_session 
*srv_sess)
return sess_dev;
 }
 
+static void rnbd_dev_bi_end_io(struct bio *bio)
+{
+   struct rnbd_dev_blk_io *io = bio->bi_private;
+
+   rnbd_endio(io->priv, blk_status_to_errno(bio->bi_status));
+   bio_put(bio);
+}
+
 static int process_rdma(struct rnbd_srv_session *srv_sess,
struct rtrs_srv_op *id, void *data, u32 datalen,
const void *usr, size_t usrlen)
@@ -144,12 +152,12 @@ static int process_rdma(struct rnbd_srv_session *srv_sess,
priv->sess_dev = sess_dev;
priv->id = id;
 
-   /* Generate bio with pages pointing to the rdma buffer */
-   bio = rnbd_bio_map_kern(data, sess_dev->rnbd_dev->ibd_bio_set, datalen, 
GFP_KERNEL);
-   if (IS_ERR(bio)) {
-   err = PTR_ERR(bio);
-   rnbd_srv_err(sess_dev, "Failed to generate bio, err: %d\n", 
err);
-   goto sess_dev_put;
+   bio = bio_alloc_bioset(GFP_KERNEL, 1, sess_dev->rnbd_dev->ibd_bio_set);
+   if (bio_add_page(bio, virt_to_page(data), datalen,
+   offset_in_page(data)) != datalen) {
+   rnbd_srv_err(sess_dev, "Failed to map data to bio\n");
+   err = -EINVAL;
+   goto bio_put;
}
 
io = container_of(bio, struct rnbd_dev_blk_io, bio);
@@ -170,7 +178,8 @@ static int process_rdma(struct rnbd_srv_session *srv_sess,
 
return 0;
 
-sess_dev_put:
+bio_put:
+   bio_put(bio);
rnbd_put_sess_dev(sess_dev);
 err:
 

[PATCH 09/19] drbd: bio_alloc can't fail if it is allow to sleep

2022-01-24 Thread Christoph Hellwig
Remove handling of NULL returns from sleeping bio_alloc calls given that
those can't fail.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/drbd/drbd_receiver.c | 22 --
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 6df2539e215ba..fb59b263deeef 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1281,14 +1281,13 @@ static void submit_one_flush(struct drbd_device 
*device, struct issue_flush_cont
 {
struct bio *bio = bio_alloc(GFP_NOIO, 0);
struct one_flush_context *octx = kmalloc(sizeof(*octx), GFP_NOIO);
-   if (!bio || !octx) {
-   drbd_warn(device, "Could not allocate a bio, CANNOT ISSUE 
FLUSH\n");
+
+   if (!octx) {
+   drbd_warn(device, "Could not allocate a octx, CANNOT ISSUE 
FLUSH\n");
/* FIXME: what else can I do now?  disconnecting or detaching
 * really does not help to improve the state of the world, 
either.
 */
-   kfree(octx);
-   if (bio)
-   bio_put(bio);
+   bio_put(bio);
 
ctx->error = -ENOMEM;
put_ldev(device);
@@ -1646,7 +1645,6 @@ int drbd_submit_peer_request(struct drbd_device *device,
unsigned data_size = peer_req->i.size;
unsigned n_bios = 0;
unsigned nr_pages = (data_size + PAGE_SIZE -1) >> PAGE_SHIFT;
-   int err = -ENOMEM;
 
/* TRIM/DISCARD: for now, always use the helper function
 * blkdev_issue_zeroout(..., discard=true).
@@ -1688,10 +1686,6 @@ int drbd_submit_peer_request(struct drbd_device *device,
 */
 next_bio:
bio = bio_alloc(GFP_NOIO, nr_pages);
-   if (!bio) {
-   drbd_err(device, "submit_ee: Allocation of a bio failed 
(nr_pages=%u)\n", nr_pages);
-   goto fail;
-   }
/* > peer_req->i.sector, unless this is the first bio */
bio->bi_iter.bi_sector = sector;
bio_set_dev(bio, device->ldev->backing_bdev);
@@ -1726,14 +1720,6 @@ int drbd_submit_peer_request(struct drbd_device *device,
drbd_submit_bio_noacct(device, fault_type, bio);
} while (bios);
return 0;
-
-fail:
-   while (bios) {
-   bio = bios;
-   bios = bios->bi_next;
-   bio_put(bio);
-   }
-   return err;
 }
 
 static void drbd_remove_epoch_entry_interval(struct drbd_device *device,
-- 
2.30.2




[PATCH 08/19] dm-thin: use blkdev_issue_flush instead of open coding it

2022-01-24 Thread Christoph Hellwig
Use blkdev_issue_flush, which uses an on-stack bio instead of an
opencoded version with a bio embedded into struct pool.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/dm-thin.c | 11 +--
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 76a9c2e9aeeea..411a3f56ed90c 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -282,8 +282,6 @@ struct pool {
struct dm_bio_prison_cell **cell_sort_array;
 
mempool_t mapping_pool;
-
-   struct bio flush_bio;
 };
 
 static void metadata_operation_failed(struct pool *pool, const char *op, int 
r);
@@ -2906,7 +2904,6 @@ static void __pool_destroy(struct pool *pool)
if (pool->next_mapping)
mempool_free(pool->next_mapping, >mapping_pool);
mempool_exit(>mapping_pool);
-   bio_uninit(>flush_bio);
dm_deferred_set_destroy(pool->shared_read_ds);
dm_deferred_set_destroy(pool->all_io_ds);
kfree(pool);
@@ -2987,7 +2984,6 @@ static struct pool *pool_create(struct mapped_device 
*pool_md,
pool->low_water_triggered = false;
pool->suspended = true;
pool->out_of_data_space = false;
-   bio_init(>flush_bio, NULL, 0);
 
pool->shared_read_ds = dm_deferred_set_create();
if (!pool->shared_read_ds) {
@@ -3194,13 +3190,8 @@ static void metadata_low_callback(void *context)
 static int metadata_pre_commit_callback(void *context)
 {
struct pool *pool = context;
-   struct bio *flush_bio = >flush_bio;
-
-   bio_reset(flush_bio);
-   bio_set_dev(flush_bio, pool->data_dev);
-   flush_bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
 
-   return submit_bio_wait(flush_bio);
+   return blkdev_issue_flush(pool->data_dev);
 }
 
 static sector_t get_dev_size(struct block_device *bdev)
-- 
2.30.2




[PATCH 07/19] dm-snap: use blkdev_issue_flush instead of open coding it

2022-01-24 Thread Christoph Hellwig
Use blkdev_issue_flush, which uses an on-stack bio instead of an
opencoded version with a bio embedded into struct dm_snapshot.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/dm-snap.c | 21 +
 1 file changed, 1 insertion(+), 20 deletions(-)

diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index dcf34c6b05ad3..0d336b5ec5714 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -141,11 +141,6 @@ struct dm_snapshot {
 * for them to be committed.
 */
struct bio_list bios_queued_during_merge;
-
-   /*
-* Flush data after merge.
-*/
-   struct bio flush_bio;
 };
 
 /*
@@ -1127,17 +1122,6 @@ static void snapshot_merge_next_chunks(struct 
dm_snapshot *s)
 
 static void error_bios(struct bio *bio);
 
-static int flush_data(struct dm_snapshot *s)
-{
-   struct bio *flush_bio = >flush_bio;
-
-   bio_reset(flush_bio);
-   bio_set_dev(flush_bio, s->origin->bdev);
-   flush_bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
-
-   return submit_bio_wait(flush_bio);
-}
-
 static void merge_callback(int read_err, unsigned long write_err, void 
*context)
 {
struct dm_snapshot *s = context;
@@ -1151,7 +1135,7 @@ static void merge_callback(int read_err, unsigned long 
write_err, void *context)
goto shut;
}
 
-   if (flush_data(s) < 0) {
+   if (blkdev_issue_flush(s->origin->bdev) < 0) {
DMERR("Flush after merge failed: shutting down merge");
goto shut;
}
@@ -1340,7 +1324,6 @@ static int snapshot_ctr(struct dm_target *ti, unsigned 
int argc, char **argv)
s->first_merging_chunk = 0;
s->num_merging_chunks = 0;
bio_list_init(>bios_queued_during_merge);
-   bio_init(>flush_bio, NULL, 0);
 
/* Allocate hash table for COW data */
if (init_hash_tables(s)) {
@@ -1528,8 +1511,6 @@ static void snapshot_dtr(struct dm_target *ti)
 
dm_exception_store_destroy(s->store);
 
-   bio_uninit(>flush_bio);
-
dm_put_device(ti, s->cow);
 
dm_put_device(ti, s->origin);
-- 
2.30.2




[PATCH 05/19] dm: bio_alloc can't fail if it is allowed to sleep

2022-01-24 Thread Christoph Hellwig
Remove handling of NULL returns from sleeping bio_alloc calls given that
those can't fail.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/dm-crypt.c  |  5 +
 drivers/md/dm-log-writes.c | 18 --
 drivers/md/dm-thin.c   | 25 +
 drivers/md/dm-zoned-metadata.c | 11 ---
 drivers/md/dm.c|  2 --
 5 files changed, 10 insertions(+), 51 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index d4ae31558826a..20abe3486aba1 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1673,9 +1673,6 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io 
*io, unsigned size)
mutex_lock(>bio_alloc_lock);
 
clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, >bs);
-   if (!clone)
-   goto out;
-
clone_init(io, clone);
 
remaining_size = size;
@@ -1702,7 +1699,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io 
*io, unsigned size)
bio_put(clone);
clone = NULL;
}
-out:
+
if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
mutex_unlock(>bio_alloc_lock);
 
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index 139b09b06eda9..25f5e8d2d417b 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -218,10 +218,6 @@ static int write_metadata(struct log_writes_c *lc, void 
*entry,
size_t ret;
 
bio = bio_alloc(GFP_KERNEL, 1);
-   if (!bio) {
-   DMERR("Couldn't alloc log bio");
-   goto error;
-   }
bio->bi_iter.bi_size = 0;
bio->bi_iter.bi_sector = sector;
bio_set_dev(bio, lc->logdev->bdev);
@@ -276,11 +272,6 @@ static int write_inline_data(struct log_writes_c *lc, void 
*entry,
atomic_inc(>io_blocks);
 
bio = bio_alloc(GFP_KERNEL, bio_pages);
-   if (!bio) {
-   DMERR("Couldn't alloc inline data bio");
-   goto error;
-   }
-
bio->bi_iter.bi_size = 0;
bio->bi_iter.bi_sector = sector;
bio_set_dev(bio, lc->logdev->bdev);
@@ -322,7 +313,6 @@ static int write_inline_data(struct log_writes_c *lc, void 
*entry,
 error_bio:
bio_free_pages(bio);
bio_put(bio);
-error:
put_io_block(lc);
return -1;
 }
@@ -364,10 +354,6 @@ static int log_one_block(struct log_writes_c *lc,
 
atomic_inc(>io_blocks);
bio = bio_alloc(GFP_KERNEL, bio_max_segs(block->vec_cnt));
-   if (!bio) {
-   DMERR("Couldn't alloc log bio");
-   goto error;
-   }
bio->bi_iter.bi_size = 0;
bio->bi_iter.bi_sector = sector;
bio_set_dev(bio, lc->logdev->bdev);
@@ -387,10 +373,6 @@ static int log_one_block(struct log_writes_c *lc,
submit_bio(bio);
bio = bio_alloc(GFP_KERNEL,
bio_max_segs(block->vec_cnt - i));
-   if (!bio) {
-   DMERR("Couldn't alloc log bio");
-   goto error;
-   }
bio->bi_iter.bi_size = 0;
bio->bi_iter.bi_sector = sector;
bio_set_dev(bio, lc->logdev->bdev);
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index ec119d2422d5d..76a9c2e9aeeea 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1180,24 +1180,17 @@ static void 
process_prepared_discard_passdown_pt1(struct dm_thin_new_mapping *m)
}
 
discard_parent = bio_alloc(GFP_NOIO, 1);
-   if (!discard_parent) {
-   DMWARN("%s: unable to allocate top level discard bio for 
passdown. Skipping passdown.",
-  dm_device_name(tc->pool->pool_md));
-   queue_passdown_pt2(m);
+   discard_parent->bi_end_io = passdown_endio;
+   discard_parent->bi_private = m;
 
-   } else {
-   discard_parent->bi_end_io = passdown_endio;
-   discard_parent->bi_private = m;
-
-   if (m->maybe_shared)
-   passdown_double_checking_shared_status(m, 
discard_parent);
-   else {
-   struct discard_op op;
+   if (m->maybe_shared)
+   passdown_double_checking_shared_status(m, discard_parent);
+   else {
+   struct discard_op op;
 
-   begin_discard(, tc, discard_parent);
-   r = issue_discard(, m->data_block, data_end);
-   end_discard(, r);
-   }
+   begin_discard(, tc, discard_parent);
+   r = issue_discard(, m->data_block, data_end);
+   end_discard(, r);
}
 }
 
diff --git a/drivers/md/dm-zoned-metadata.c b/drivers/md/dm-zoned-metadata.c
index ee4626d085574..5718b83cc7182 

[PATCH 04/19] ntfs3: remove ntfs_alloc_bio

2022-01-24 Thread Christoph Hellwig
bio_alloc will never fail if it is allowed to sleep, so there is no
need for this loop.  Also remove the __GFP_HIGH specifier as it doesn't
make sense here given that we'll always fall back to the mempool anyway.

Signed-off-by: Christoph Hellwig 
---
 fs/ntfs3/fsntfs.c | 23 ++-
 1 file changed, 2 insertions(+), 21 deletions(-)

diff --git a/fs/ntfs3/fsntfs.c b/fs/ntfs3/fsntfs.c
index 4de9acb169689..4a255e21ecf5f 100644
--- a/fs/ntfs3/fsntfs.c
+++ b/fs/ntfs3/fsntfs.c
@@ -1443,17 +1443,6 @@ int ntfs_write_bh(struct ntfs_sb_info *sbi, struct 
NTFS_RECORD_HEADER *rhdr,
return err;
 }
 
-static inline struct bio *ntfs_alloc_bio(u32 nr_vecs)
-{
-   struct bio *bio = bio_alloc(GFP_NOFS | __GFP_HIGH, nr_vecs);
-
-   if (!bio && (current->flags & PF_MEMALLOC)) {
-   while (!bio && (nr_vecs /= 2))
-   bio = bio_alloc(GFP_NOFS | __GFP_HIGH, nr_vecs);
-   }
-   return bio;
-}
-
 /*
  * ntfs_bio_pages - Read/write pages from/to disk.
  */
@@ -1496,11 +1485,7 @@ int ntfs_bio_pages(struct ntfs_sb_info *sbi, const 
struct runs_tree *run,
lbo = ((u64)lcn << cluster_bits) + off;
len = ((u64)clen << cluster_bits) - off;
 new_bio:
-   new = ntfs_alloc_bio(nr_pages - page_idx);
-   if (!new) {
-   err = -ENOMEM;
-   goto out;
-   }
+   new = bio_alloc(GFP_NOFS, nr_pages - page_idx);
if (bio) {
bio_chain(bio, new);
submit_bio(bio);
@@ -1599,11 +1584,7 @@ int ntfs_bio_fill_1(struct ntfs_sb_info *sbi, const 
struct runs_tree *run)
lbo = (u64)lcn << cluster_bits;
len = (u64)clen << cluster_bits;
 new_bio:
-   new = ntfs_alloc_bio(BIO_MAX_VECS);
-   if (!new) {
-   err = -ENOMEM;
-   break;
-   }
+   new = bio_alloc(GFP_NOFS, BIO_MAX_VECS);
if (bio) {
bio_chain(bio, new);
submit_bio(bio);
-- 
2.30.2




[PATCH 03/19] nfs/blocklayout: remove bl_alloc_init_bio

2022-01-24 Thread Christoph Hellwig
bio_alloc will never fail when it can sleep.  Remove the now simple
bl_alloc_init_bio helper and open code it in the only caller.

Signed-off-by: Christoph Hellwig 
---
 fs/nfs/blocklayout/blocklayout.c | 26 +-
 1 file changed, 5 insertions(+), 21 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index fe860c5387476..38e063af7e98a 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -115,23 +115,6 @@ bl_submit_bio(struct bio *bio)
return NULL;
 }
 
-static struct bio *bl_alloc_init_bio(unsigned int npg,
-   struct block_device *bdev, sector_t disk_sector,
-   bio_end_io_t end_io, struct parallel_io *par)
-{
-   struct bio *bio;
-
-   npg = bio_max_segs(npg);
-   bio = bio_alloc(GFP_NOIO, npg);
-   if (bio) {
-   bio->bi_iter.bi_sector = disk_sector;
-   bio_set_dev(bio, bdev);
-   bio->bi_end_io = end_io;
-   bio->bi_private = par;
-   }
-   return bio;
-}
-
 static bool offset_in_map(u64 offset, struct pnfs_block_dev_map *map)
 {
return offset >= map->start && offset < map->start + map->len;
@@ -171,10 +154,11 @@ do_add_page_to_bio(struct bio *bio, int npg, int rw, 
sector_t isect,
 
 retry:
if (!bio) {
-   bio = bl_alloc_init_bio(npg, map->bdev,
-   disk_addr >> SECTOR_SHIFT, end_io, par);
-   if (!bio)
-   return ERR_PTR(-ENOMEM);
+   bio = bio_alloc(GFP_NOIO, bio_max_segs(npg));
+   bio->bi_iter.bi_sector = disk_addr >> SECTOR_SHIFT;
+   bio_set_dev(bio, map->bdev);
+   bio->bi_end_io = end_io;
+   bio->bi_private = par;
bio_set_op_attrs(bio, rw, 0);
}
if (bio_add_page(bio, page, *len, offset) < *len) {
-- 
2.30.2




[PATCH 02/19] nilfs2: remove nilfs_alloc_seg_bio

2022-01-24 Thread Christoph Hellwig
bio_alloc will never fail when it can sleep.  Remove the now simple
nilfs_alloc_seg_bio helper and open code it in the only caller.

Signed-off-by: Christoph Hellwig 
---
 fs/nilfs2/segbuf.c | 31 ---
 1 file changed, 4 insertions(+), 27 deletions(-)

diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
index 43287b0d3e9b6..53b7c6d21cdd8 100644
--- a/fs/nilfs2/segbuf.c
+++ b/fs/nilfs2/segbuf.c
@@ -371,29 +371,6 @@ static int nilfs_segbuf_submit_bio(struct 
nilfs_segment_buffer *segbuf,
return err;
 }
 
-/**
- * nilfs_alloc_seg_bio - allocate a new bio for writing log
- * @nilfs: nilfs object
- * @start: start block number of the bio
- * @nr_vecs: request size of page vector.
- *
- * Return Value: On success, pointer to the struct bio is returned.
- * On error, NULL is returned.
- */
-static struct bio *nilfs_alloc_seg_bio(struct the_nilfs *nilfs, sector_t start,
-  int nr_vecs)
-{
-   struct bio *bio;
-
-   bio = bio_alloc(GFP_NOIO, nr_vecs);
-   if (likely(bio)) {
-   bio_set_dev(bio, nilfs->ns_bdev);
-   bio->bi_iter.bi_sector =
-   start << (nilfs->ns_blocksize_bits - 9);
-   }
-   return bio;
-}
-
 static void nilfs_segbuf_prepare_write(struct nilfs_segment_buffer *segbuf,
   struct nilfs_write_info *wi)
 {
@@ -414,10 +391,10 @@ static int nilfs_segbuf_submit_bh(struct 
nilfs_segment_buffer *segbuf,
BUG_ON(wi->nr_vecs <= 0);
  repeat:
if (!wi->bio) {
-   wi->bio = nilfs_alloc_seg_bio(wi->nilfs, wi->blocknr + wi->end,
- wi->nr_vecs);
-   if (unlikely(!wi->bio))
-   return -ENOMEM;
+   wi->bio = bio_alloc(GFP_NOIO, wi->nr_vecs);
+   bio_set_dev(wi->bio, wi->nilfs->ns_bdev);
+   wi->bio->bi_iter.bi_sector = (wi->blocknr + wi->end) <<
+   (wi->nilfs->ns_blocksize_bits - 9);
}
 
len = bio_add_page(wi->bio, bh->b_page, bh->b_size, bh_offset(bh));
-- 
2.30.2




improve the bio allocation interface v2

2022-01-24 Thread Christoph Hellwig
Hi Jens,

this series is posted early because it has wide-ranging changes and could use 
some
early ACKs before -rc1.

It changes the interface to the bio allocators to always pass a block_device and
the operation, which is information needed for every bio submitted through
bio_submit.  This means the fields can be directly initialized in bio_init 
instead
of first being zeroed and thus should help to micro-optimize even better than 
the
__bio_set_dev that Pavel proposed while also cleaning up code.

I have a follow on series to also deal with the bio cloning interfaces that need
even more love, and additional cleanups for the callers which might be material
for the next merge window.

Changes since v1:
 - fix bio_add_page return value handling in rnbd-srv
 - fix bio clenup in rnbd-srv
 - fix a few commit message typos
 - fix a bisection hazard in rnbd
 - fix an initialization order issue in bio_alloc_bioset/bio_init

Diffstat:
 block/bio.c |   73 
 block/blk-flush.c   |4 -
 block/blk-lib.c |   32 ++-
 block/blk-zoned.c   |   14 +-
 block/blk.h |2 
 block/bounce.c  |6 --
 block/fops.c|   35 +++--
 drivers/block/drbd/drbd_actlog.c|5 --
 drivers/block/drbd/drbd_bitmap.c|7 +--
 drivers/block/drbd/drbd_receiver.c  |   32 +++
 drivers/block/floppy.c  |4 -
 drivers/block/pktcdvd.c |8 ---
 drivers/block/rnbd/rnbd-srv-dev.c   |   61 --
 drivers/block/rnbd/rnbd-srv-dev.h   |   18 
 drivers/block/rnbd/rnbd-srv.c   |   45 --
 drivers/block/rnbd/rnbd-srv.h   |1 
 drivers/block/xen-blkback/blkback.c |   25 ++--
 drivers/block/zram/zram_drv.c   |   16 ++-
 drivers/md/bcache/io.c  |3 -
 drivers/md/bcache/journal.c |   16 ++-
 drivers/md/bcache/movinggc.c|4 -
 drivers/md/bcache/request.c |   18 +++-
 drivers/md/bcache/super.c   |8 +--
 drivers/md/bcache/writeback.c   |4 -
 drivers/md/dm-crypt.c   |   27 -
 drivers/md/dm-io.c  |5 --
 drivers/md/dm-log-writes.c  |   39 +++
 drivers/md/dm-snap.c|   21 --
 drivers/md/dm-thin.c|   41 +---
 drivers/md/dm-writecache.c  |7 +--
 drivers/md/dm-zoned-metadata.c  |   26 ++--
 drivers/md/dm.c |   12 ++---
 drivers/md/md-multipath.c   |2 
 drivers/md/md.c |   24 +--
 drivers/md/raid1.c  |8 +--
 drivers/md/raid10.c |   14 ++
 drivers/md/raid5-cache.c|   19 +++--
 drivers/md/raid5-ppl.c  |   13 ++
 drivers/md/raid5.c  |   12 ++---
 drivers/nvdimm/nd_virtio.c  |6 +-
 drivers/nvme/target/io-cmd-bdev.c   |   18 +++-
 drivers/nvme/target/passthru.c  |7 +--
 drivers/nvme/target/zns.c   |   14 +++---
 drivers/scsi/ufs/ufshpb.c   |4 -
 drivers/target/target_core_iblock.c |   11 +
 fs/btrfs/disk-io.c  |   10 +---
 fs/btrfs/extent_io.c|2 
 fs/buffer.c |   14 ++
 fs/crypto/bio.c |   13 ++
 fs/direct-io.c  |5 --
 fs/erofs/zdata.c|5 --
 fs/ext4/page-io.c   |3 -
 fs/ext4/readpage.c  |8 +--
 fs/f2fs/data.c  |7 +--
 fs/gfs2/lops.c  |8 +--
 fs/gfs2/meta_io.c   |4 -
 fs/gfs2/ops_fstype.c|4 -
 fs/hfsplus/wrapper.c|4 -
 fs/iomap/buffered-io.c  |   26 +---
 fs/iomap/direct-io.c|8 ---
 fs/jfs/jfs_logmgr.c |   11 -
 fs/jfs/jfs_metapage.c   |9 +---
 fs/mpage.c  |   34 ++--
 fs/nfs/blocklayout/blocklayout.c|   26 +---
 fs/nilfs2/segbuf.c  |   31 +--
 fs/ntfs3/fsntfs.c   |   27 -
 fs/ocfs2/cluster/heartbeat.c|4 -
 fs/squashfs/block.c |   11 ++---
 fs/xfs/xfs_bio_io.c |   14 ++
 fs/xfs/xfs_buf.c|4 -
 fs/xfs/xfs_log.c|   14 +++---
 fs/zonefs/super.c   |9 +---
 include/linux/bio.h |   30 ++
 kernel/power/swap.c |5 --
 mm/page_io.c|   10 +---
 75 files changed, 372 insertions(+), 759 deletions(-)



[PATCH 01/19] fs: remove mpage_alloc

2022-01-24 Thread Christoph Hellwig
open code mpage_alloc in it's two callers and simplify the results
because of the context:

 - __mpage_writepage always passes GFP_NOFS and can thus always sleep and
will never get a NULL return from bio_alloc at all.
 - do_mpage_readpage can only get a non-sleeping context for readahead
   which never sets PF_MEMALLOC and thus doesn't need the retry loop
   either.

Both cases will never have __GFP_HIGH set.

Signed-off-by: Christoph Hellwig 
---
 fs/mpage.c | 35 ++-
 1 file changed, 6 insertions(+), 29 deletions(-)

diff --git a/fs/mpage.c b/fs/mpage.c
index 87f5cfef6caa7..06e95d777e940 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -66,29 +66,6 @@ static struct bio *mpage_bio_submit(int op, int op_flags, 
struct bio *bio)
return NULL;
 }
 
-static struct bio *
-mpage_alloc(struct block_device *bdev,
-   sector_t first_sector, int nr_vecs,
-   gfp_t gfp_flags)
-{
-   struct bio *bio;
-
-   /* Restrict the given (page cache) mask for slab allocations */
-   gfp_flags &= GFP_KERNEL;
-   bio = bio_alloc(gfp_flags, nr_vecs);
-
-   if (bio == NULL && (current->flags & PF_MEMALLOC)) {
-   while (!bio && (nr_vecs /= 2))
-   bio = bio_alloc(gfp_flags, nr_vecs);
-   }
-
-   if (bio) {
-   bio_set_dev(bio, bdev);
-   bio->bi_iter.bi_sector = first_sector;
-   }
-   return bio;
-}
-
 /*
  * support function for mpage_readahead.  The fs supplied get_block might
  * return an up to date buffer.  This is used to map that buffer into
@@ -296,10 +273,11 @@ static struct bio *do_mpage_readpage(struct 
mpage_readpage_args *args)
page))
goto out;
}
-   args->bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
-   bio_max_segs(args->nr_pages), gfp);
+   args->bio = bio_alloc(gfp, bio_max_segs(args->nr_pages));
if (args->bio == NULL)
goto confused;
+   bio_set_dev(args->bio, bdev);
+   args->bio->bi_iter.bi_sector = blocks[0] << (blkbits - 9);
}
 
length = first_hole << blkbits;
@@ -608,10 +586,9 @@ static int __mpage_writepage(struct page *page, struct 
writeback_control *wbc,
page, wbc))
goto out;
}
-   bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
-   BIO_MAX_VECS, GFP_NOFS|__GFP_HIGH);
-   if (bio == NULL)
-   goto confused;
+   bio = bio_alloc(GFP_NOFS, BIO_MAX_VECS);
+   bio_set_dev(bio, bdev);
+   bio->bi_iter.bi_sector = blocks[0] << (blkbits - 9);
 
wbc_init_bio(wbc, bio);
bio->bi_write_hint = inode->i_write_hint;
-- 
2.30.2




Re: possible kernel/libxl race with xl network-attach

2022-01-24 Thread Roger Pau Monné
On Fri, Jan 21, 2022 at 03:05:07PM +, James Dingwall wrote:
> On Fri, Jan 21, 2022 at 03:00:29PM +0100, Roger Pau Monné wrote:
> > On Fri, Jan 21, 2022 at 01:34:54PM +, James Dingwall wrote:
> > > On 2022-01-13 16:11, Roger Pau Monné wrote:
> > > > On Thu, Jan 13, 2022 at 11:19:46AM +, James Dingwall wrote:
> > > > > 
> > > > > I have been trying to debug a problem where a vif with the backend
> > > > > in a
> > > > > driver domain is added to dom0.  Intermittently the hotplug script is
> > > > > not invoked by libxl (running as xl devd) in the driver domain.  By
> > > > > enabling some debug for the driver domain kernel and libxl I have
> > > > > these
> > > > > messages:
> > > > > 
> > > > > driver domain kernel (Ubuntu 5.4.0-92-generic):
> > > > > 
> > > > > [Thu Jan 13 01:39:31 2022] [1408] 564: vif vif-0-0 vif0.0:
> > > > > Successfully created xenvif
> > > > > [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed:
> > > > > /local/domain/0/device/vif/0 -> Initialising
> > > > > [Thu Jan 13 01:39:31 2022] [26] 470:
> > > > > xen_netback:backend_switch_state: backend/vif/0/0 -> InitWait
> > > > > [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed:
> > > > > /local/domain/0/device/vif/0 -> Connected
> > > > > [Thu Jan 13 01:39:31 2022] vif vif-0-0 vif0.0: Guest Rx ready
> > > > > [Thu Jan 13 01:39:31 2022] [26] 470:
> > > > > xen_netback:backend_switch_state: backend/vif/0/0 -> Connected
> > > > > 
> > > > > xl devd (Xen 4.14.3):
> > > > > 
> > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528
> > > > > wpath=/local/domain/2/backend token=3/0: event
> > > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac569700:
> > > > > nested ao, parent 0x5633ac567f90
> > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180
> > > > > wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event
> > > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > libxl_event.c:1055:devstate_callback: backend
> > > > > /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting
> > > > > state 4
> > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528
> > > > > wpath=/local/domain/2/backend token=3/0: event
> > > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac56a220:
> > > > > nested ao, parent 0x5633ac567f90
> > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180
> > > > > wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event
> > > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > libxl_event.c:1055:devstate_callback: backend
> > > > > /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting
> > > > > state 4
> > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > libxl_aoutils.c:88:xswait_timeout_callback: backend
> > > > > /local/domain/2/backend/vif/0/0/state (hoping for state change to
> > > > > 2): xswait timeout (path=/local/domain/2/backend/vif/0/0/state)
> > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > libxl_event.c:850:libxl__ev_xswatch_deregister: watch
> > > > > w=0x5633ac569180 wpath=/local/domain/2/backend/vif/0/0/state
> > > > > token=2/1: deregister slotnum=2
> > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > libxl_event.c:1039:devstate_callback: backend
> > > > > /local/domain/2/backend/vif/0/0/state wanted state 2  timed out
> > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch
> > > > > w=0x5633ac569180: deregister unregistered
> > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > libxl_device.c:1092:device_backend_callback: calling
> > > > > device_backend_cleanup
> > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch
> > > > > w=0x5633ac569180: deregister unregistered
> > > > > 2022-01-13 01:39:51 UTC libxl: error:
> > > > > libxl_device.c:1105:device_backend_callback: unable to add device
> > > > > with path /local/domain/2/backend/vif/0/0
> > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch
> > > > > w=0x5633ac569280: deregister unregistered
> > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > libxl_device.c:1470:device_complete: device
> > > > > /local/domain/2/backend/vif/0/0 add failed
> > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > libxl_event.c:2035:libxl__ao__destroy: ao 0x5633ac568f30: destroy
> > > > > 
> > > > > the xenstore content for the backend:
> > > > > 
> 

Re: Debian Bug#1004269: Linker segfault while building src:xen

2022-01-24 Thread Jan Beulich
On 23.01.2022 22:52, Hans van Kranenburg wrote:
> (To both the Debian bug # and xen-devel list, reply-all is fine)
> Hi Xen people,
> 
> I just filed a bug at Debian on the binutils package, because since the 
> latest binutils package update (Debian 2.37.50.20220106-2), Xen (both 
> 4.14 and 4.16) fails to build with a segfault at the following point:
> 
> x86_64-linux-gnu-ld -mi386pep --subsystem=10 
> --image-base=0x82d04000 --stack=0,0 --heap=0,0 
> --section-alignment=0x20 --file-alignment=0x20 
> --major-image-version=4 --minor-image-version=16 --major-os-version=2 
> --minor-os-version=0 --major-subsystem-version=2 
> --minor-subsystem-version=0 --no-insert-timestamp --build-id=sha1 -T 
> efi.lds -N prelink.o 
> /builds/xen-team/debian-xen/debian/output/source_dir/xen/common/symbols-dummy.o
>  
> -b pe-x86-64 efi/buildid.o -o 
> /builds/xen-team/debian-xen/debian/output/source_dir/xen/.xen.efi.0x82d04000.0
>  
> && :
> Segmentation fault (core dumped)
> 
> Full message and links to build logs etc are in the initial bug message, 
> to be seen at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1004269
> 
> We (Debian Xen Team) are awaiting response, but I thought to also let 
> you know already.
> 
> * Does the above error 'ring a bell'?
> * Can you maybe also reproduce this in a development environment with 
> very latest binutils?

I've tried with a 1.5 weeks old snapshot I had lying about; no
"success". But your and my builds are surely quite different anyway.

> * Maybe someone has a useful comment for the Debian binutils maintainer 
> about what's happening in this step of the build?

It's xen.efi that's being linked; not sure what else to say.

> * Any suggestions about what we can do to help figure this out?

I'm pretty certain this needs debugging from the binutils side, so I
guess you will want to report it there (even more so with 2.38 around
the corner). I guess it's actually debugging ld or bisecting which
provide the best chance of figuring out what's going on.

Jan

> * We'll try to help debug, but will surely appreciate upstream help if 
> things get too technical. It's simply the case that I did not have to 
> look into a very similar issue before, so it's new.
> 
> Thanks!
> Hans
> 




[PATCH v2 4/4] x86/APIC: make connections between seemingly arbitrary numbers

2022-01-24 Thread Jan Beulich
Making adjustments to arbitrarily chosen values shouldn't require
auditing the code for possible derived numbers - such a change should
be doable in a single place, having an effect on all code depending on
that choice.

For one make the TDCR write actually use APIC_DIVISOR. With the
necessary mask constant introduced, also use that in vLAPIC code. While
introducing the constant, drop APIC_TDR_DIV_TMBASE: The bit has been
undefined in halfway recent SDM and PM versions.

And then introduce a constant tying together the scale used when
converting nanoseconds to bus clocks.

No functional change intended.

Signed-off-by: Jan Beulich 
---
I thought we have a generic "glue" macro, but I couldn't find one. Hence
I'm (ab)using _AC().
---
v2: New.

--- a/xen/arch/x86/apic.c
+++ b/xen/arch/x86/apic.c
@@ -1078,8 +1078,8 @@ static void __setup_APIC_LVTT(unsigned i
 lvtt_value = /*APIC_TIMER_MODE_PERIODIC |*/ LOCAL_TIMER_VECTOR;
 apic_write(APIC_LVTT, lvtt_value);
 
-tmp_value = apic_read(APIC_TDCR);
-apic_write(APIC_TDCR, tmp_value | APIC_TDR_DIV_1);
+tmp_value = apic_read(APIC_TDCR) & ~APIC_TDR_DIV_MASK;
+apic_write(APIC_TDCR, tmp_value | _AC(APIC_TDR_DIV_, APIC_DIVISOR));
 
 apic_write(APIC_TMICT, clocks / APIC_DIVISOR);
 }
@@ -1196,6 +1196,8 @@ static void __init check_deadline_errata
  * APIC irq that way.
  */
 
+#define BUS_SCALE_SHIFT 18
+
 static void __init calibrate_APIC_clock(void)
 {
 unsigned long bus_freq; /* KAF: pointer-size avoids compile warns. */
@@ -1249,8 +1251,8 @@ static void __init calibrate_APIC_clock(
 /* set up multipliers for accurate timer code */
 bus_cycle  = 1UL / bus_freq; /* in pico seconds */
 bus_cycle += (1UL % bus_freq) * 2 > bus_freq;
-bus_scale  = (1000*262144)/bus_cycle;
-bus_scale += ((1000 * 262144) % bus_cycle) * 2 > bus_cycle;
+bus_scale  = (1000 << BUS_SCALE_SHIFT) / bus_cycle;
+bus_scale += ((1000 << BUS_SCALE_SHIFT) % bus_cycle) * 2 > bus_cycle;
 
 apic_printk(APIC_VERBOSE, ". bus_scale = %#x\n", bus_scale);
 /* reset APIC to zero timeout value */
@@ -1337,7 +1339,8 @@ int reprogram_timer(s_time_t timeout)
 }
 
 if ( timeout && ((expire = timeout - NOW()) > 0) )
-apic_tmict = min_t(u64, (bus_scale * expire) >> 18, UINT_MAX);
+apic_tmict = min_t(uint64_t, (bus_scale * expire) >> BUS_SCALE_SHIFT,
+   UINT32_MAX);
 
 apic_write(APIC_TMICT, (unsigned long)apic_tmict);
 
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -580,7 +580,7 @@ static uint32_t vlapic_get_tmcct(const s
 static void vlapic_set_tdcr(struct vlapic *vlapic, unsigned int val)
 {
 /* Only bits 0, 1 and 3 are settable; others are MBZ. */
-val &= 0xb;
+val &= APIC_TDR_DIV_MASK;
 vlapic_set_reg(vlapic, APIC_TDCR, val);
 
 /* Update the demangled hw.timer_divisor. */
@@ -887,7 +887,7 @@ void vlapic_reg_write(struct vcpu *v, un
 {
 uint32_t current_divisor = vlapic->hw.timer_divisor;
 
-vlapic_set_tdcr(vlapic, val & 0xb);
+vlapic_set_tdcr(vlapic, val);
 
 vlapic_update_timer(vlapic, vlapic_get_reg(vlapic, APIC_LVTT), false,
 current_divisor);
@@ -1019,7 +1019,7 @@ int guest_wrmsr_x2apic(struct vcpu *v, u
 break;
 
 case APIC_TDCR:
-if ( msr_content & ~APIC_TDR_DIV_1 )
+if ( msr_content & ~APIC_TDR_DIV_MASK )
 return X86EMUL_EXCEPTION;
 break;
 
--- a/xen/arch/x86/include/asm/apicdef.h
+++ b/xen/arch/x86/include/asm/apicdef.h
@@ -106,7 +106,7 @@
 #defineAPIC_TMICT  0x380
 #defineAPIC_TMCCT  0x390
 #defineAPIC_TDCR   0x3E0
-#defineAPIC_TDR_DIV_TMBASE (1<<2)
+#defineAPIC_TDR_DIV_MASK   0xB
 #defineAPIC_TDR_DIV_1  0xB
 #defineAPIC_TDR_DIV_2  0x0
 #defineAPIC_TDR_DIV_4  0x1




[PATCH v2 3/4] x86/APIC: skip unnecessary parts of __setup_APIC_LVTT()

2022-01-24 Thread Jan Beulich
In TDT mode there's no point writing TDCR or TMICT, while outside of
that mode there's no need for the MFENCE.

No change intended to overall functioning.

Signed-off-by: Jan Beulich 
---
v2: New.

--- a/xen/arch/x86/apic.c
+++ b/xen/arch/x86/apic.c
@@ -1059,24 +1059,25 @@ static void __setup_APIC_LVTT(unsigned i
 {
 unsigned int lvtt_value, tmp_value;
 
-/* NB. Xen uses local APIC timer in one-shot mode. */
-lvtt_value = /*APIC_TIMER_MODE_PERIODIC |*/ LOCAL_TIMER_VECTOR;
-
 if ( tdt_enabled )
 {
-lvtt_value &= (~APIC_TIMER_MODE_MASK);
-lvtt_value |= APIC_TIMER_MODE_TSC_DEADLINE;
+lvtt_value = APIC_TIMER_MODE_TSC_DEADLINE | LOCAL_TIMER_VECTOR;
+apic_write(APIC_LVTT, lvtt_value);
+
+/*
+ * See Intel SDM: TSC-Deadline Mode chapter. In xAPIC mode,
+ * writing to the APIC LVTT and TSC_DEADLINE MSR isn't serialized.
+ * According to Intel, MFENCE can do the serialization here.
+ */
+asm volatile( "mfence" : : : "memory" );
+
+return;
 }
 
+/* NB. Xen uses local APIC timer in one-shot mode. */
+lvtt_value = /*APIC_TIMER_MODE_PERIODIC |*/ LOCAL_TIMER_VECTOR;
 apic_write(APIC_LVTT, lvtt_value);
 
-/*
- * See Intel SDM: TSC-Deadline Mode chapter. In xAPIC mode,
- * writing to the APIC LVTT and TSC_DEADLINE MSR isn't serialized.
- * According to Intel, MFENCE can do the serialization here.
- */
-asm volatile( "mfence" : : : "memory" );
-
 tmp_value = apic_read(APIC_TDCR);
 apic_write(APIC_TDCR, tmp_value | APIC_TDR_DIV_1);
 




[PATCH v2 2/4] x86/APIC: calibrate against platform timer when possible

2022-01-24 Thread Jan Beulich
Use the original calibration against PIT only when the platform timer
is PIT. This implicitly excludes the "xen_guest" case from using the PIT
logic (init_pit() fails there, and as of 5e73b2594c54 ["x86/time: minor
adjustments to init_pit()"] using_pit also isn't being set too early
anymore), so the respective hack there can be dropped at the same time.
This also reduces calibration time from 100ms to 50ms, albeit this step
is being skipped as of 0731a56c7c72 ("x86/APIC: no need for timer
calibration when using TDT") anyway.

While re-indenting the PIT logic in calibrate_APIC_clock(), besides
adjusting style also switch around the 2nd TSC/TMCCT read pair, to match
the order of the 1st one, yielding more consistent deltas.

Signed-off-by: Jan Beulich 
---
Open-coding apic_read() in read_tmcct() isn't overly nice, but I wanted
to avoid x2apic_enabled being evaluated twice in close succession. (The
barrier is there just in case only anyway: While this RDMSR isn't
serializing, I'm unaware of any statement whether it can also be
executed speculatively, like RDTSC can.) An option might be to move the
function to apic.c such that it would also be used by
calibrate_APIC_clock().

Unlike the CPU frequencies enumerated in CPUID leaf 0x16 (which aren't
precise), using CPUID[0x15].ECX - if populated - may be an option to
skip calibration altogether. Iirc the value there is precise, but using
the systems I have easy access to I cannot verify this: In the sample
of three I have, none have ECX populated.

I wonder whether the secondary CPU freq measurement (used for display
purposes only) wouldn't better be dropped at this occasion.
---
v2: New.

--- a/xen/arch/x86/apic.c
+++ b/xen/arch/x86/apic.c
@@ -1182,20 +1182,6 @@ static void __init check_deadline_errata
"please update microcode to version %#x (or later)\n", rev);
 }
 
-static void __init wait_tick_pvh(void)
-{
-u64 lapse_ns = 10ULL / HZ;
-s_time_t start, curr_time;
-
-start = NOW();
-
-/* Won't wrap around */
-do {
-cpu_relax();
-curr_time = NOW();
-} while ( curr_time - start < lapse_ns );
-}
-
 /*
  * In this function we calibrate APIC bus clocks to the external
  * timer. Unfortunately we cannot use jiffies and the timer irq
@@ -1211,9 +1197,6 @@ static void __init wait_tick_pvh(void)
 
 static void __init calibrate_APIC_clock(void)
 {
-unsigned long long t1, t2;
-unsigned long tt1, tt2;
-unsigned int i;
 unsigned long bus_freq; /* KAF: pointer-size avoids compile warns. */
 unsigned int bus_cycle; /* length of one bus cycle in pico-seconds */
 #define LOOPS_FRAC 10U  /* measure for one tenth of a second */
@@ -1226,39 +1209,38 @@ static void __init calibrate_APIC_clock(
  */
 __setup_APIC_LVTT(0x);
 
-if ( !xen_guest )
+bus_freq = calibrate_apic_timer();
+if ( !bus_freq )
+{
+unsigned int i, tt1, tt2;
+unsigned long t1, t2;
+
+ASSERT(!xen_guest);
+
 /*
- * The timer chip counts down to zero. Let's wait
- * for a wraparound to start exact measurement:
- * (the current tick might have been already half done)
+ * The timer chip counts down to zero. Let's wait for a wraparound to
+ * start exact measurement (the current tick might have been already
+ * half done):
  */
 wait_8254_wraparound();
-else
-wait_tick_pvh();
 
-/*
- * We wrapped around just now. Let's start:
- */
-t1 = rdtsc_ordered();
-tt1 = apic_read(APIC_TMCCT);
+/* We wrapped around just now. Let's start: */
+t1 = rdtsc_ordered();
+tt1 = apic_read(APIC_TMCCT);
 
-/*
- * Let's wait HZ / LOOPS_FRAC ticks:
- */
-for (i = 0; i < HZ / LOOPS_FRAC; i++)
-if ( !xen_guest )
+/* Let's wait HZ / LOOPS_FRAC ticks: */
+for ( i = 0; i < HZ / LOOPS_FRAC; ++i )
 wait_8254_wraparound();
-else
-wait_tick_pvh();
 
-tt2 = apic_read(APIC_TMCCT);
-t2 = rdtsc_ordered();
+t2 = rdtsc_ordered();
+tt2 = apic_read(APIC_TMCCT);
 
-bus_freq = (tt1 - tt2) * APIC_DIVISOR * LOOPS_FRAC;
+bus_freq = (tt1 - tt2) * APIC_DIVISOR * LOOPS_FRAC;
 
-apic_printk(APIC_VERBOSE, ". CPU clock speed is %lu.%04lu MHz.\n",
-((unsigned long)(t2 - t1) * LOOPS_FRAC) / 100,
-(((unsigned long)(t2 - t1) * LOOPS_FRAC) / 100) % 1);
+apic_printk(APIC_VERBOSE, ". CPU clock speed is %lu.%04lu MHz.\n",
+((t2 - t1) * LOOPS_FRAC) / 100,
+(((t2 - t1) * LOOPS_FRAC) / 100) % 1);
+}
 
 apic_printk(APIC_VERBOSE, ". host bus clock speed is %ld.%04ld MHz.\n",
 bus_freq / 100, (bus_freq / 100) % 1);
--- a/xen/arch/x86/include/asm/apic.h
+++ b/xen/arch/x86/include/asm/apic.h
@@ -192,6 +192,8 @@ extern void record_boot_APIC_mode(void);
 extern enum apic_mode 

[PATCH v2 1/4] x86/time: further improve TSC / CPU freq calibration accuracy

2022-01-24 Thread Jan Beulich
Calibration logic assumes that the platform timer (HPET or ACPI PM
timer) and the TSC are read at about the same time. This assumption may
not hold when a long latency event (e.g. SMI or NMI) occurs between the
two reads. Reduce the risk of reading uncorrelated values by doing at
least four pairs of reads, using the tuple where the delta between the
enclosing TSC reads was smallest. From the fourth iteration onwards bail
if the new TSC delta isn't better (smaller) than the best earlier one.

Signed-off-by: Jan Beulich 
---
When running virtualized, scheduling in the host would also constitute
long latency events. I wonder whether, to compensate for that, we'd want
more than 3 "base" iterations, as I would expect scheduling events to
occur more frequently than e.g. SMI (and with a higher probability of
multiple ones occurring in close succession).
---
v2: Use helper functions to fold duplicate code.

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -287,9 +287,47 @@ static char *freq_string(u64 freq)
 return s;
 }
 
-static uint64_t adjust_elapsed(uint64_t elapsed, uint32_t actual,
-   uint32_t target)
+static uint32_t __init read_pt_and_tsc(uint64_t *tsc,
+   const struct platform_timesource *pts)
 {
+uint64_t tsc_prev = *tsc = rdtsc_ordered(), tsc_min = ~0;
+uint32_t best = best;
+unsigned int i;
+
+for ( i = 0; ; ++i )
+{
+uint32_t pt = pts->read_counter();
+uint64_t tsc_cur = rdtsc_ordered();
+uint64_t tsc_delta = tsc_cur - tsc_prev;
+
+if ( tsc_delta < tsc_min )
+{
+tsc_min = tsc_delta;
+*tsc = tsc_cur;
+best = pt;
+}
+else if ( i > 2 )
+break;
+
+tsc_prev = tsc_cur;
+}
+
+return best;
+}
+
+static uint64_t __init calibrate_tsc(const struct platform_timesource *pts)
+{
+uint64_t start, end, elapsed;
+uint32_t count = read_pt_and_tsc(, pts);
+uint32_t target = CALIBRATE_VALUE(pts->frequency), actual;
+uint32_t mask = (uint32_t)~0 >> (32 - pts->counter_bits);
+
+while ( ((pts->read_counter() - count) & mask) < target )
+continue;
+
+actual = read_pt_and_tsc(, pts) - count;
+elapsed = end - start;
+
 if ( likely(actual > target) )
 {
 /*
@@ -395,8 +433,7 @@ static u64 read_hpet_count(void)
 
 static int64_t __init init_hpet(struct platform_timesource *pts)
 {
-uint64_t hpet_rate, start;
-uint32_t count, target, elapsed;
+uint64_t hpet_rate;
 /*
  * Allow HPET to be setup, but report a frequency of 0 so it's not selected
  * as a timer source. This is required so it can be used in legacy
@@ -467,13 +504,7 @@ static int64_t __init init_hpet(struct p
 
 pts->frequency = hpet_rate;
 
-count = hpet_read32(HPET_COUNTER);
-start = rdtsc_ordered();
-target = CALIBRATE_VALUE(hpet_rate);
-while ( (elapsed = hpet_read32(HPET_COUNTER) - count) < target )
-continue;
-
-return adjust_elapsed(rdtsc_ordered() - start, elapsed, target);
+return calibrate_tsc(pts);
 }
 
 static void resume_hpet(struct platform_timesource *pts)
@@ -508,22 +539,12 @@ static u64 read_pmtimer_count(void)
 
 static s64 __init init_pmtimer(struct platform_timesource *pts)
 {
-uint64_t start;
-uint32_t count, target, mask, elapsed;
-
 if ( !pmtmr_ioport || (pmtmr_width != 24 && pmtmr_width != 32) )
 return 0;
 
 pts->counter_bits = pmtmr_width;
-mask = 0x >> (32 - pmtmr_width);
-
-count = inl(pmtmr_ioport);
-start = rdtsc_ordered();
-target = CALIBRATE_VALUE(ACPI_PM_FREQUENCY);
-while ( (elapsed = (inl(pmtmr_ioport) - count) & mask) < target )
-continue;
 
-return adjust_elapsed(rdtsc_ordered() - start, elapsed, target);
+return calibrate_tsc(pts);
 }
 
 static struct platform_timesource __initdata plt_pmtimer =




[PATCH v2 0/4] x86: further improve timer freq calibration accuracy

2022-01-24 Thread Jan Beulich
... plus some tidying (or so I hope). Only the 1st patch was submitted
so far (i.e. as v1), all others are new.

1: time: further improve TSC / CPU freq calibration accuracy
2: APIC: calibrate against platform timer when possible
3: APIC: skip unnecessary parts of __setup_APIC_LVTT()
4: APIC: make connections between seemingly arbitrary numbers

Jan