[libvirt test] 167809: regressions - FAIL
flight 167809 libvirt real [real] http://logs.test-lab.xenproject.org/osstest/logs/167809/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: build-arm64-libvirt 6 libvirt-buildfail REGR. vs. 151777 build-i386-libvirt6 libvirt-buildfail REGR. vs. 151777 build-armhf-libvirt 6 libvirt-buildfail REGR. vs. 151777 build-amd64-libvirt 6 libvirt-buildfail REGR. vs. 151777 Tests which did not succeed, but are not blocking: test-amd64-amd64-libvirt 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-pair 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-vhd 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-xsm 1 build-check(1) blocked n/a test-amd64-i386-libvirt 1 build-check(1) blocked n/a test-amd64-i386-libvirt-pair 1 build-check(1) blocked n/a test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a test-amd64-i386-libvirt-raw 1 build-check(1) blocked n/a test-amd64-i386-libvirt-xsm 1 build-check(1) blocked n/a test-arm64-arm64-libvirt 1 build-check(1) blocked n/a test-arm64-arm64-libvirt-qcow2 1 build-check(1) blocked n/a test-arm64-arm64-libvirt-raw 1 build-check(1) blocked n/a test-armhf-armhf-libvirt-raw 1 build-check(1) blocked n/a test-arm64-arm64-libvirt-xsm 1 build-check(1) blocked n/a test-armhf-armhf-libvirt 1 build-check(1) blocked n/a test-armhf-armhf-libvirt-qcow2 1 build-check(1) blocked n/a version targeted for testing: libvirt fa6e29f9789f16fb3262b3d4de5b90b705f603fb baseline version: libvirt 2c846fa6bcc11929c9fb857a22430fb9945654ad Last test of basis 151777 2020-07-10 04:19:19 Z 564 days Failing since151818 2020-07-11 04:18:52 Z 563 days 545 attempts Testing same since 167809 2022-01-25 04:20:14 Z0 days1 attempts People who touched revisions under test: Adolfo Jayme Barrientos Aleksandr Alekseev Aleksei Zakharov Andika Triwidada Andrea Bolognani Ani Sinha Balázs Meskó Barrett Schonefeld Bastian Germann Bastien Orivel BiaoXiang Ye Bihong Yu Binfeng Wu Bjoern Walk Boris Fiuczynski Brian Turek Bruno Haible Chris Mayo Christian Borntraeger Christian Ehrhardt Christian Kirbach Christian Schoenebeck Cole Robinson Collin Walling Cornelia Huck Cédric Bosdonnat Côme Borsoi Daniel Henrique Barboza Daniel Letai Daniel P. Berrange Daniel P. Berrangé Didik Supriadi dinglimin Divya Garg Dmitrii Shcherbakov Dmytro Linkin Eiichi Tsukata Eric Farman Erik Skultety Fabian Affolter Fabian Freyer Fabiano Fidêncio Fangge Jin Farhan Ali Fedora Weblate Translation Franck Ridel Gavi Teitz gongwei Guoyi Tu Göran Uddeborg Halil Pasic Han Han Hao Wang Hela Basa Helmut Grohne Hiroki Narukawa Hyman Huang(黄勇) Ian Wienand Ioanna Alifieraki Ivan Teterevkov Jakob Meng Jamie Strandboge Jamie Strandboge Jan Kuparinen jason lee Jean-Baptiste Holcroft Jia Zhou Jianan Gao Jim Fehlig Jin Yan Jinsheng Zhang Jiri Denemark Joachim Falk John Ferlan Jonathan Watt Jonathon Jongsma Julio Faracco Justin Gatzen Ján Tomko Kashyap Chamarthy Kevin Locke Koichi Murase Kristina Hanicova Laine Stump Laszlo Ersek Lee Yarwood Lei Yang Liao Pingfang Lin Ma Lin Ma Lin Ma Liu Yiding Luke Yue Luyao Zhong Marc Hartmayer Marc-André Lureau Marek Marczykowski-Górecki Markus Schade Martin Kletzander Masayoshi Mizuma Matej Cepl Matt Coleman Matt Coleman Mauro Matteo Cascella Meina Li Michal Privoznik Michał Smyk Milo Casagrande Moshe Levi Muha Aliss Nathan Neal Gompa Nick Chevsky Nick Shyrokovskiy Nickys Music Group Nico Pache Nikolay Shirokovskiy Olaf Hering Olesya Gerasimenko Or Ozeri Orion Poplawski Pany Patrick Magauran Paulo de Rezende Pinatti Pavel Hrdina Peng Liang Peter Krempa Pino Toscano Pino Toscano Piotr Drąg Prathamesh Chavan Praveen K Paladugu Richard W.M. Jones Ricky Tigg Robin Lee Rohit Kumar Roman Bogorodskiy Roman Bolshakov Ryan Gahagan Ryan Schmidt Sam Hartman Scott Shambarger Sebastian Mitterle SeongHyun Jo Shalini Chellathurai Saroja Shaojun Yang shenjiatong Shi Lei simmon Simon Chopin Simon Gaiser Simon Rowe Stefan Bader Stefan Berger
Re: [XEN PATCH v2 2/5] xen: export get_free_port
On Sun, 23 Jan 2022, Julien Grall wrote: > > diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c > > index da88ad141a..5b0bcaaad4 100644 > > --- a/xen/common/event_channel.c > > +++ b/xen/common/event_channel.c > > @@ -232,7 +232,7 @@ int evtchn_allocate_port(struct domain *d, evtchn_port_t > > port) > > return 0; > > } > > -static int get_free_port(struct domain *d) > > +int get_free_port(struct domain *d) > > I dislike the idea to expose get_free_port() (or whichever name we decide) > because this can be easily misused. > > In fact looking at your next patch (#3), you are misusing it as it is meant to > be called with d->event_lock. I know this doesn't much matter > in your situation because this is done at boot with no other domains running > (or potentially any event channel allocation). However, I still think we > should get the API right. > > I am also not entirely happy of open-coding the allocation in domain_build.c. > Instead, I would prefer if we provide a new helper to allocate an unbound > event channel. This would be similar to your v1 (I still need to review the > patch though). I am happy to go back to v1 and address feedback on that patch. However, I am having difficulties with the implementation. Jan pointed out: > > - > > -chn->state = ECS_UNBOUND; > > This cannot be pulled ahead of the XSM check (or in general anything > potentially resulting in an error), as check_free_port() relies on > ->state remaining ECS_FREE until it is known that the calling function > can't fail anymore. This makes it difficult to reuse _evtchn_alloc_unbound for the implementation of evtchn_alloc_unbound. In fact, I couldn't find a way to do it. Instead, I just create a new public function called "evtchn_alloc_unbound" and renamed the existing funtion to "_evtchn_alloc_unbound" (this to addresses Jan's feedback that the static function should be the one starting with "_"). So the function names are inverted compared to v1. Please let me know if you have any better suggestions. diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c index da88ad141a..c6b7dd7fbd 100644 --- a/xen/common/event_channel.c +++ b/xen/common/event_channel.c @@ -18,6 +18,7 @@ #include #include +#include #include #include #include @@ -284,7 +285,27 @@ void evtchn_free(struct domain *d, struct evtchn *chn) xsm_evtchn_close_post(chn); } -static int evtchn_alloc_unbound(evtchn_alloc_unbound_t *alloc) +struct evtchn *evtchn_alloc_unbound(struct domain *d, domid_t remote_dom) +{ +struct evtchn *chn; +int port; + +if ( (port = get_free_port(d)) < 0 ) +return ERR_PTR(port); +chn = evtchn_from_port(d, port); + +evtchn_write_lock(chn); + +chn->state = ECS_UNBOUND; +chn->u.unbound.remote_domid = remote_dom; +evtchn_port_init(d, chn); + +evtchn_write_unlock(chn); + +return chn; +} + +static int _evtchn_alloc_unbound(evtchn_alloc_unbound_t *alloc) { struct evtchn *chn; struct domain *d; @@ -1195,7 +1216,7 @@ long do_event_channel_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) struct evtchn_alloc_unbound alloc_unbound; if ( copy_from_guest(_unbound, arg, 1) != 0 ) return -EFAULT; -rc = evtchn_alloc_unbound(_unbound); +rc = _evtchn_alloc_unbound(_unbound); if ( !rc && __copy_to_guest(arg, _unbound, 1) ) rc = -EFAULT; /* Cleaning up here would be a mess! */ break; diff --git a/xen/include/xen/event.h b/xen/include/xen/event.h index 21c95e14fd..85dcf1d0c4 100644 --- a/xen/include/xen/event.h +++ b/xen/include/xen/event.h @@ -68,6 +68,9 @@ int evtchn_close(struct domain *d1, int port1, bool guest); /* Free an event channel. */ void evtchn_free(struct domain *d, struct evtchn *chn); +/* Create a new event channel port */ +struct evtchn *evtchn_alloc_unbound(struct domain *d, domid_t remote_dom); + /* Allocate a specific event channel port. */ int evtchn_allocate_port(struct domain *d, unsigned int port);
Re: [RFC v1 3/5] xen/arm: introduce SCMI-SMC mediator driver
On Mon, 24 Jan 2022, Julien Grall wrote: > On 24/01/2022 19:06, Stefano Stabellini wrote: > > It looks like XEN_DOMCTL_host_node_by_path and > > XEN_DOMCTL_find_host_compatible_node would also solve the problem but I > > think that a single hypercall that retrieves the entire host DTB would > > be easier to implement > > DOMCTL should only be used to handle per-domain information. If we want to > create a new sub-hypercall of either __HYPERVISOR_platform_op or > __HYPERVISOR_sysctl_op (not sure which one). > > AFAICT, both are versioned. > > > and more robust in the long term. > > > hypfs has the advantage that it would create an interface more similar > > to the one people are already used to on Linux systems > > (/proc/device-tree). xl/libxl would have to scan the whole hypfs tree, > > which intuitively I think it would be slower. > > Even if you have the binary blob, you would still have to scan the > device-tree. That said, it is probably going to be potentially a bit faster > because you have less hypercall. > > However, here this is a trade-off between memory use and speed. If you want > speed, then you may have to transfer up to 2MB every time. So the question is > do we care more about speed or memory usage? > > > Also the feature might be > > harder to implement but I am not sure. > > > > I don't have a strong preference and this is not a stable interface (we > > don't have to be extra paranoid about forward and backward > > compatibility). So I am fine either way. Let's see what the others think > > as well. > > My preference would be to use hypfs as this is cleaner than exposing a blob. That's also fine by me. Probably the hypfs implementation shouldn't be much more difficult than something like XEN_DOMCTL_host_node_by_path/XEN_DOMCTL_find_host_compatible_node. > However, are we sure we can simply copy the content of the host Device-Tree to > the guest Device-Tree for SCMI? For instance, I know that for device > passthrough there are some property that needs to be altered for some devices. > Hence, why it is not present. Although, I vaguely recalled to have written a > PoC, not sure if it was posted on the ML. The SCMI node cannot be copied "as is" from host to guest. It needs a couple of changes but they seem feasible as they are limited to the channels exposed to the guest. (The generic device passthrough case is a lot more difficult.)
Re: [PATCH 11/19] rnbd-srv: remove struct rnbd_dev_blk_io
On Mon, Jan 24, 2022 at 10:11 AM Christoph Hellwig wrote: > > Only the priv field of rnbd_dev_blk_io is used, so store the value of > that in bio->bi_private directly and remove the entire bio_set overhead. > > Signed-off-by: Christoph Hellwig Reviewed-by: Jack Wang Thanks! > --- > drivers/block/rnbd/rnbd-srv-dev.c | 4 +--- > drivers/block/rnbd/rnbd-srv-dev.h | 13 ++--- > drivers/block/rnbd/rnbd-srv.c | 27 --- > drivers/block/rnbd/rnbd-srv.h | 1 - > 4 files changed, 7 insertions(+), 38 deletions(-) > > diff --git a/drivers/block/rnbd/rnbd-srv-dev.c > b/drivers/block/rnbd/rnbd-srv-dev.c > index 98d3e591a0885..c5d0a03911659 100644 > --- a/drivers/block/rnbd/rnbd-srv-dev.c > +++ b/drivers/block/rnbd/rnbd-srv-dev.c > @@ -12,8 +12,7 @@ > #include "rnbd-srv-dev.h" > #include "rnbd-log.h" > > -struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t flags, > - struct bio_set *bs) > +struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t flags) > { > struct rnbd_dev *dev; > int ret; > @@ -30,7 +29,6 @@ struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t > flags, > > dev->blk_open_flags = flags; > bdevname(dev->bdev, dev->name); > - dev->ibd_bio_set = bs; > > return dev; > > diff --git a/drivers/block/rnbd/rnbd-srv-dev.h > b/drivers/block/rnbd/rnbd-srv-dev.h > index 1a14ece0be726..2c3df02b5e8ec 100644 > --- a/drivers/block/rnbd/rnbd-srv-dev.h > +++ b/drivers/block/rnbd/rnbd-srv-dev.h > @@ -14,25 +14,16 @@ > > struct rnbd_dev { > struct block_device *bdev; > - struct bio_set *ibd_bio_set; > fmode_t blk_open_flags; > charname[BDEVNAME_SIZE]; > }; > > -struct rnbd_dev_blk_io { > - struct rnbd_dev *dev; > - void *priv; > - /* have to be last member for front_pad usage of bioset_init */ > - struct bio bio; > -}; > - > /** > * rnbd_dev_open() - Open a device > + * @path: path to open > * @flags: open flags > - * @bs:bio_set to use during block io, > */ > -struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t flags, > - struct bio_set *bs); > +struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t flags); > > /** > * rnbd_dev_close() - Close a device > diff --git a/drivers/block/rnbd/rnbd-srv.c b/drivers/block/rnbd/rnbd-srv.c > index 6d228af1dcc35..ff9b389976078 100644 > --- a/drivers/block/rnbd/rnbd-srv.c > +++ b/drivers/block/rnbd/rnbd-srv.c > @@ -116,9 +116,7 @@ rnbd_get_sess_dev(int dev_id, struct rnbd_srv_session > *srv_sess) > > static void rnbd_dev_bi_end_io(struct bio *bio) > { > - struct rnbd_dev_blk_io *io = bio->bi_private; > - > - rnbd_endio(io->priv, blk_status_to_errno(bio->bi_status)); > + rnbd_endio(bio->bi_private, blk_status_to_errno(bio->bi_status)); > bio_put(bio); > } > > @@ -131,7 +129,6 @@ static int process_rdma(struct rnbd_srv_session *srv_sess, > struct rnbd_srv_sess_dev *sess_dev; > u32 dev_id; > int err; > - struct rnbd_dev_blk_io *io; > struct bio *bio; > short prio; > > @@ -152,7 +149,7 @@ static int process_rdma(struct rnbd_srv_session *srv_sess, > priv->sess_dev = sess_dev; > priv->id = id; > > - bio = bio_alloc_bioset(GFP_KERNEL, 1, > sess_dev->rnbd_dev->ibd_bio_set); > + bio = bio_alloc(GFP_KERNEL, 1); > if (bio_add_page(bio, virt_to_page(data), datalen, > offset_in_page(data)) != datalen) { > rnbd_srv_err(sess_dev, "Failed to map data to bio\n"); > @@ -160,12 +157,8 @@ static int process_rdma(struct rnbd_srv_session > *srv_sess, > goto bio_put; > } > > - io = container_of(bio, struct rnbd_dev_blk_io, bio); > - io->dev = sess_dev->rnbd_dev; > - io->priv = priv; > - > bio->bi_end_io = rnbd_dev_bi_end_io; > - bio->bi_private = io; > + bio->bi_private = priv; > bio->bi_opf = rnbd_to_bio_flags(le32_to_cpu(msg->rw)); > bio->bi_iter.bi_sector = le64_to_cpu(msg->sector); > bio->bi_iter.bi_size = le32_to_cpu(msg->bi_size); > @@ -260,7 +253,6 @@ static void destroy_sess(struct rnbd_srv_session > *srv_sess) > > out: > xa_destroy(_sess->index_idr); > - bioset_exit(_sess->sess_bio_set); > > pr_info("RTRS Session %s disconnected\n", srv_sess->sessname); > > @@ -289,16 +281,6 @@ static int create_sess(struct rtrs_srv_sess *rtrs) > return -ENOMEM; > > srv_sess->queue_depth = rtrs_srv_get_queue_depth(rtrs); > - err = bioset_init(_sess->sess_bio_set, srv_sess->queue_depth, > - offsetof(struct rnbd_dev_blk_io, bio), > - BIOSET_NEED_BVECS); > - if (err) { > - pr_err("Allocating srv_session for path %s failed\n", > -
[xen-unstable test] 167806: tolerable FAIL - PUSHED
flight 167806 xen-unstable real [real] flight 167807 xen-unstable real-retest [real] http://logs.test-lab.xenproject.org/osstest/logs/167806/ http://logs.test-lab.xenproject.org/osstest/logs/167807/ Failures :-/ but no regressions. Tests which are failing intermittently (not blocking): test-amd64-amd64-xl-qemuu-debianhvm-amd64 20 guest-start/debianhvm.repeat fail pass in 167807-retest Tests which did not succeed, but are not blocking: test-armhf-armhf-xl-rtds 18 guest-start/debian.repeatfail like 167790 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 167802 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 167802 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 167802 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 167802 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 167802 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail like 167802 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 167802 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 167802 test-armhf-armhf-libvirt 16 saverestore-support-checkfail like 167802 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 167802 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 167802 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check fail like 167802 test-arm64-arm64-xl-seattle 15 migrate-support-checkfail never pass test-arm64-arm64-xl-seattle 16 saverestore-support-checkfail never pass test-amd64-i386-libvirt-xsm 15 migrate-support-checkfail never pass test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail never pass test-amd64-i386-libvirt 15 migrate-support-checkfail never pass test-amd64-amd64-libvirt 15 migrate-support-checkfail never pass test-amd64-i386-xl-pvshim14 guest-start fail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-xl 15 migrate-support-checkfail never pass test-arm64-arm64-xl 16 saverestore-support-checkfail never pass test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit2 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit2 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail never pass test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 16 saverestore-support-checkfail never pass test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-amd64-i386-libvirt-raw 14 migrate-support-checkfail never pass test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail never pass test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail never pass test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-vhd 14 migrate-support-checkfail never pass test-arm64-arm64-xl-vhd 15 saverestore-support-checkfail never pass test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-vhd 15 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit1 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 16 saverestore-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16
Re: [RFC v1 3/5] xen/arm: introduce SCMI-SMC mediator driver
Hi, On 24/01/2022 19:06, Stefano Stabellini wrote: It looks like XEN_DOMCTL_host_node_by_path and XEN_DOMCTL_find_host_compatible_node would also solve the problem but I think that a single hypercall that retrieves the entire host DTB would be easier to implement DOMCTL should only be used to handle per-domain information. If we want to create a new sub-hypercall of either __HYPERVISOR_platform_op or __HYPERVISOR_sysctl_op (not sure which one). AFAICT, both are versioned. and more robust in the long term. > hypfs has the advantage that it would create an interface more similar to the one people are already used to on Linux systems (/proc/device-tree). xl/libxl would have to scan the whole hypfs tree, which intuitively I think it would be slower. Even if you have the binary blob, you would still have to scan the device-tree. That said, it is probably going to be potentially a bit faster because you have less hypercall. However, here this is a trade-off between memory use and speed. If you want speed, then you may have to transfer up to 2MB every time. So the question is do we care more about speed or memory usage? Also the feature might be harder to implement but I am not sure. I don't have a strong preference and this is not a stable interface (we don't have to be extra paranoid about forward and backward compatibility). So I am fine either way. Let's see what the others think as well. My preference would be to use hypfs as this is cleaner than exposing a blob. However, are we sure we can simply copy the content of the host Device-Tree to the guest Device-Tree for SCMI? For instance, I know that for device passthrough there are some property that needs to be altered for some devices. Hence, why it is not present. Although, I vaguely recalled to have written a PoC, not sure if it was posted on the ML. Cheeers, -- Julien Grall
Re: [RFC v1 3/5] xen/arm: introduce SCMI-SMC mediator driver
On Mon, 24 Jan 2022, Oleksii Moisieiev wrote: > On Fri, Jan 21, 2022 at 12:49:55PM -0800, Stefano Stabellini wrote: > > On Fri, 21 Jan 2022, Oleksii Moisieiev wrote: > > > On Thu, Jan 20, 2022 at 02:29:41PM -0800, Stefano Stabellini wrote: > > > > On Thu, 20 Jan 2022, Oleksii Moisieiev wrote: > > > > > On Wed, Jan 19, 2022 at 05:28:21PM -0800, Stefano Stabellini wrote: > > > > > > On Wed, 19 Jan 2022, Oleksii Moisieiev wrote: > > > > > > > On Wed, Dec 22, 2021 at 06:23:24PM -0800, Stefano Stabellini > > > > > > > wrote: > > > > > > > > On Wed, 22 Dec 2021, Oleksii Moisieiev wrote: > > > > > > > > > On Tue, Dec 21, 2021 at 01:22:50PM -0800, Stefano Stabellini > > > > > > > > > wrote: > > > > > > > > > > On Tue, 21 Dec 2021, Oleksii Moisieiev wrote: > > > > > > > > > > > Hi Stefano, > > > > > > > > > > > > > > > > > > > > > > On Mon, Dec 20, 2021 at 04:52:01PM -0800, Stefano > > > > > > > > > > > Stabellini wrote: > > > > > > > > > > > > On Mon, 20 Dec 2021, Oleksii Moisieiev wrote: > > > > > > > > > > > > > Hi Stefano, > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Dec 17, 2021 at 06:14:55PM -0800, Stefano > > > > > > > > > > > > > Stabellini wrote: > > > > > > > > > > > > > > On Tue, 14 Dec 2021, Oleksii Moisieiev wrote: > > > > > > > > > > > > > > > This is the implementation of SCI interface, > > > > > > > > > > > > > > > called SCMI-SMC driver, > > > > > > > > > > > > > > > which works as the mediator between XEN Domains > > > > > > > > > > > > > > > and Firmware (SCP, ATF etc). > > > > > > > > > > > > > > > This allows devices from the Domains to work with > > > > > > > > > > > > > > > clocks, resets and > > > > > > > > > > > > > > > power-domains without access to CPG. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The following features are implemented: > > > > > > > > > > > > > > > - request SCMI channels from ATF and pass > > > > > > > > > > > > > > > channels to Domains; > > > > > > > > > > > > > > > - set device permissions for Domains based on the > > > > > > > > > > > > > > > Domain partial > > > > > > > > > > > > > > > device-tree. Devices with permissions are able to > > > > > > > > > > > > > > > work with clocks, > > > > > > > > > > > > > > > resets and power-domains via SCMI; > > > > > > > > > > > > > > > - redirect scmi messages from Domains to ATF. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Oleksii Moisieiev > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > > xen/arch/arm/Kconfig | 2 + > > > > > > > > > > > > > > > xen/arch/arm/sci/Kconfig | 10 + > > > > > > > > > > > > > > > xen/arch/arm/sci/Makefile | 1 + > > > > > > > > > > > > > > > xen/arch/arm/sci/scmi_smc.c | 795 > > > > > > > > > > > > > > > ++ > > > > > > > > > > > > > > > xen/include/public/arch-arm.h | 1 + > > > > > > > > > > > > > > > 5 files changed, 809 insertions(+) > > > > > > > > > > > > > > > create mode 100644 xen/arch/arm/sci/Kconfig > > > > > > > > > > > > > > > create mode 100644 xen/arch/arm/sci/scmi_smc.c > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/xen/arch/arm/Kconfig > > > > > > > > > > > > > > > b/xen/arch/arm/Kconfig > > > > > > > > > > > > > > > index 186e1db389..02d96c6cfc 100644 > > > > > > > > > > > > > > > --- a/xen/arch/arm/Kconfig > > > > > > > > > > > > > > > +++ b/xen/arch/arm/Kconfig > > > > > > > > > > > > > > > @@ -114,6 +114,8 @@ config SCI > > > > > > > > > > > > > > > support. It allows guests to control system > > > > > > > > > > > > > > > resourcess via one of > > > > > > > > > > > > > > > SCI mediators implemented in XEN. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +source "arch/arm/sci/Kconfig" > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > endmenu > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > menu "ARM errata workaround via the alternative > > > > > > > > > > > > > > > framework" > > > > > > > > > > > > > > > diff --git a/xen/arch/arm/sci/Kconfig > > > > > > > > > > > > > > > b/xen/arch/arm/sci/Kconfig > > > > > > > > > > > > > > > new file mode 100644 > > > > > > > > > > > > > > > index 00..9563067ddc > > > > > > > > > > > > > > > --- /dev/null > > > > > > > > > > > > > > > +++ b/xen/arch/arm/sci/Kconfig > > > > > > > > > > > > > > > @@ -0,0 +1,10 @@ > > > > > > > > > > > > > > > +config SCMI_SMC > > > > > > > > > > > > > > > + bool "Enable SCMI-SMC mediator driver" > > > > > > > > > > > > > > > + default n > > > > > > > > > > > > > > > + depends on SCI > > > > > > > > > > > > > > > + ---help--- > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > + Enables mediator in XEN to pass SCMI requests > > > > > > > > > > > > > > > from Domains to ATF. > > > > > > > > > > > > > > > + This feature allows drivers from Domains to > > > > > > > > > > > > > > > work with System > > > > > > > > > > > > > > > + Controllers
Re: [XEN v3] xen/arm64: io: Decode ldr/str post-indexing instructions
On Mon, 24 Jan 2022, Ayan Kumar Halder wrote: > Hi Andre, > > Thanks forn your comments. > > On 24/01/2022 14:36, Andre Przywara wrote: > > On Mon, 24 Jan 2022 12:07:42 + > > Ayan Kumar Halder wrote: > > > > Hi Ayan, > > > > > Many thanks for your feedback. I have one clarification :- > > > > > > On 22/01/2022 01:30, Andre Przywara wrote: > > > > On Thu, 20 Jan 2022 21:55:27 + > > > > Ayan Kumar Halder wrote: > > > > > > > > Hi, > > > > > > > > > At the moment, Xen is only handling data abort with valid syndrome > > > > > (i.e. > > > > > ISV=0). Unfortunately, this doesn't cover all the instructions a > > > > > domain > > > > > could use to access MMIO regions. > > > > > > > > > > For instance, a baremetal OS can use any of the following > > > > > instructions, where > > > > > x1 contains the address of the MMIO region: > > > > > > > > > > 1. ldr x2,[x1],#4 > > > > That looks dodgy, since is misaligns the pointer afterwards. MMIO > > > > access typically go to device memory, which must be naturally aligned. > > > > Just don't give a bad example here and change that to a multiple of 8. > > > > > > > > > 2. ldr w2,[x1],#-4 > > > > (this one is fine, btw, because it's a 32-bit read) > > > > > > > > > 3. ldr x2,[x1],#-8 > > > > > 4. ldr w2,[x1],#4 > > > > > 5. ldrhw2,[x1],#8 > > > > > 6. ldrbw2,[x1],#16 > > > > More naturally I'd use the data size of the postindex value ... > > > > ldr x2 ... #-8 > > > > ldr w2 ... #4 > > > > ldrh w2 ... #2 > > > > ldrb w2 ... #1 > > > > > > > > > 7. str x2,[x1],#4 > > > > This is a again problematic, because x1 is not 8-byte aligned anymore > > > > after that. > > > > > > > > > 8. str w2,[x1],#-4 > > > > > 9. strhw2,[x1],#8 > > > > > 10. strbw2,[x1],#16 > > > > > > > > > > In the following two instructions, sp contains the address of the MMIO > > > > > region:- > > > > Really? I don't think you should give people funny ideas, just mention > > > > that the Rn register could theoretically be the stack pointer. > > > > > > > > > 11. ldrbw2,[sp],#16 > > > > > 12. ldrbwzr, [sp],#16 > > > > > > > > > > In order to handle post-indexing store/load instructions (like those > > > > > mentioned > > > > > above), Xen will need to fetch and decode the instruction. > > > > > > > > > > This patch only cover post-index store/load instructions from AArch64 > > > > > mode. > > > > > For now, this is left unimplemented for trap from AArch32 mode. > > > > > > > > > > Signed-off-by: Ayan Kumar Halder > > > > > --- > > > > > > > > > > Changelog :- > > > > > v2 - 1. Updated the rn register after reading from it. (Pointed by > > > > > Julien, > > > > > Stefano) > > > > >2. Used a union to represent the instruction opcode (Suggestd > > > > > by Bertrand) > > > > >3. Fixed coding style issues (Pointed by Julien) > > > > >4. In the previous patch, I was updating dabt->sign based on > > > > > the signedness > > > > > of imm9. This was incorrect. As mentioned in ARMv8 ARM DDI > > > > > 0487G.b, > > > > > Page 3221, SSE indicates the signedness of the data item > > > > > loaded. In our > > > > > case, the data item loaded is always unsigned. > > > > > > > > > > v3- 1. Handled all the variants of ldr/str (ie 64, 32, 16, 8 bit > > > > > variants). > > > > > Thus, I have removed the check for "instr->code.opc == 0" > > > > > (Suggested by > > > > > Andre) > > > > > 2. Handled the scenario when rn = SP, rt = XZR (Suggested by > > > > > Jan, Andre) > > > > > 3. Added restriction for "rt != rn" (Suggested by Andre) > > > > > 4. Moved union ldr_str_instr_class {} to decode.h. This is the > > > > > header included > > > > > by io.c and decode.c (where the union is referred). > > > > > (Suggested by Jan) > > > > > 5. Indentation and typo fixes (Suggested by Jan) > > > > > > > > > > Changes suggested but could not be considered due to reasons :- > > > > > 1. Using accessor macros instead of bitfields for > > > > > "ldr_str_instr_class". (Andre) > > > > > Reason - I could not find a simple way to represent 9 bit > > > > > signed integer > > > > > (ie imm9) without using bitfields. If I use accessor macros, > > > > > then I need > > > > > to manually calculate two's complement to obtain the value > > > > > when signed > > > > > bit is present. > > > > > > > > > > 2. I/D cache cohenerncy (Andre) > > > > > Reason :- I could not see any instruction to flush the I > > > > > cache. > > > > First, please try to avoid the term "flush", because it is somewhat > > > > overloaded. The architecture speaks of "clean" and "invalidate", which > > > > are more precise. > > > > Assuming you mean "clean" here:
Re: [RFC v1 3/5] xen/arm: introduce SCMI-SMC mediator driver
On Fri, Jan 21, 2022 at 12:49:55PM -0800, Stefano Stabellini wrote: > On Fri, 21 Jan 2022, Oleksii Moisieiev wrote: > > On Thu, Jan 20, 2022 at 02:29:41PM -0800, Stefano Stabellini wrote: > > > On Thu, 20 Jan 2022, Oleksii Moisieiev wrote: > > > > On Wed, Jan 19, 2022 at 05:28:21PM -0800, Stefano Stabellini wrote: > > > > > On Wed, 19 Jan 2022, Oleksii Moisieiev wrote: > > > > > > On Wed, Dec 22, 2021 at 06:23:24PM -0800, Stefano Stabellini wrote: > > > > > > > On Wed, 22 Dec 2021, Oleksii Moisieiev wrote: > > > > > > > > On Tue, Dec 21, 2021 at 01:22:50PM -0800, Stefano Stabellini > > > > > > > > wrote: > > > > > > > > > On Tue, 21 Dec 2021, Oleksii Moisieiev wrote: > > > > > > > > > > Hi Stefano, > > > > > > > > > > > > > > > > > > > > On Mon, Dec 20, 2021 at 04:52:01PM -0800, Stefano > > > > > > > > > > Stabellini wrote: > > > > > > > > > > > On Mon, 20 Dec 2021, Oleksii Moisieiev wrote: > > > > > > > > > > > > Hi Stefano, > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Dec 17, 2021 at 06:14:55PM -0800, Stefano > > > > > > > > > > > > Stabellini wrote: > > > > > > > > > > > > > On Tue, 14 Dec 2021, Oleksii Moisieiev wrote: > > > > > > > > > > > > > > This is the implementation of SCI interface, called > > > > > > > > > > > > > > SCMI-SMC driver, > > > > > > > > > > > > > > which works as the mediator between XEN Domains and > > > > > > > > > > > > > > Firmware (SCP, ATF etc). > > > > > > > > > > > > > > This allows devices from the Domains to work with > > > > > > > > > > > > > > clocks, resets and > > > > > > > > > > > > > > power-domains without access to CPG. > > > > > > > > > > > > > > > > > > > > > > > > > > > > The following features are implemented: > > > > > > > > > > > > > > - request SCMI channels from ATF and pass channels > > > > > > > > > > > > > > to Domains; > > > > > > > > > > > > > > - set device permissions for Domains based on the > > > > > > > > > > > > > > Domain partial > > > > > > > > > > > > > > device-tree. Devices with permissions are able to > > > > > > > > > > > > > > work with clocks, > > > > > > > > > > > > > > resets and power-domains via SCMI; > > > > > > > > > > > > > > - redirect scmi messages from Domains to ATF. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Oleksii Moisieiev > > > > > > > > > > > > > > > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > xen/arch/arm/Kconfig | 2 + > > > > > > > > > > > > > > xen/arch/arm/sci/Kconfig | 10 + > > > > > > > > > > > > > > xen/arch/arm/sci/Makefile | 1 + > > > > > > > > > > > > > > xen/arch/arm/sci/scmi_smc.c | 795 > > > > > > > > > > > > > > ++ > > > > > > > > > > > > > > xen/include/public/arch-arm.h | 1 + > > > > > > > > > > > > > > 5 files changed, 809 insertions(+) > > > > > > > > > > > > > > create mode 100644 xen/arch/arm/sci/Kconfig > > > > > > > > > > > > > > create mode 100644 xen/arch/arm/sci/scmi_smc.c > > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/xen/arch/arm/Kconfig > > > > > > > > > > > > > > b/xen/arch/arm/Kconfig > > > > > > > > > > > > > > index 186e1db389..02d96c6cfc 100644 > > > > > > > > > > > > > > --- a/xen/arch/arm/Kconfig > > > > > > > > > > > > > > +++ b/xen/arch/arm/Kconfig > > > > > > > > > > > > > > @@ -114,6 +114,8 @@ config SCI > > > > > > > > > > > > > > support. It allows guests to control system > > > > > > > > > > > > > > resourcess via one of > > > > > > > > > > > > > > SCI mediators implemented in XEN. > > > > > > > > > > > > > > > > > > > > > > > > > > > > +source "arch/arm/sci/Kconfig" > > > > > > > > > > > > > > + > > > > > > > > > > > > > > endmenu > > > > > > > > > > > > > > > > > > > > > > > > > > > > menu "ARM errata workaround via the alternative > > > > > > > > > > > > > > framework" > > > > > > > > > > > > > > diff --git a/xen/arch/arm/sci/Kconfig > > > > > > > > > > > > > > b/xen/arch/arm/sci/Kconfig > > > > > > > > > > > > > > new file mode 100644 > > > > > > > > > > > > > > index 00..9563067ddc > > > > > > > > > > > > > > --- /dev/null > > > > > > > > > > > > > > +++ b/xen/arch/arm/sci/Kconfig > > > > > > > > > > > > > > @@ -0,0 +1,10 @@ > > > > > > > > > > > > > > +config SCMI_SMC > > > > > > > > > > > > > > + bool "Enable SCMI-SMC mediator driver" > > > > > > > > > > > > > > + default n > > > > > > > > > > > > > > + depends on SCI > > > > > > > > > > > > > > + ---help--- > > > > > > > > > > > > > > + > > > > > > > > > > > > > > + Enables mediator in XEN to pass SCMI requests > > > > > > > > > > > > > > from Domains to ATF. > > > > > > > > > > > > > > + This feature allows drivers from Domains to > > > > > > > > > > > > > > work with System > > > > > > > > > > > > > > + Controllers (such as power,resets,clock etc.). > > > > > > > > > > > > > > SCP is used as transport > > > > > > > > > > > > > > + for communication. > > > > > > > > > > > > > > diff --git a/xen/arch/arm/sci/Makefile > > >
Re: [XEN v3] xen/arm64: io: Decode ldr/str post-indexing instructions
Hi, On 24/01/2022 17:27, Ayan Kumar Halder wrote: Thanks forn your comments. On 24/01/2022 14:36, Andre Przywara wrote: On Mon, 24 Jan 2022 12:07:42 + Ayan Kumar Halder wrote: Hi Ayan, Many thanks for your feedback. I have one clarification :- On 22/01/2022 01:30, Andre Przywara wrote: On Thu, 20 Jan 2022 21:55:27 + Ayan Kumar Halder wrote: Hi, At the moment, Xen is only handling data abort with valid syndrome (i.e. ISV=0). Unfortunately, this doesn't cover all the instructions a domain could use to access MMIO regions. For instance, a baremetal OS can use any of the following instructions, where x1 contains the address of the MMIO region: 1. ldr x2, [x1], #4 That looks dodgy, since is misaligns the pointer afterwards. MMIO access typically go to device memory, which must be naturally aligned. Just don't give a bad example here and change that to a multiple of 8. 2. ldr w2, [x1], #-4 (this one is fine, btw, because it's a 32-bit read) 3. ldr x2, [x1], #-8 4. ldr w2, [x1], #4 5. ldrh w2, [x1], #8 6. ldrb w2, [x1], #16 More naturally I'd use the data size of the postindex value ... ldr x2 ... #-8 ldr w2 ... #4 ldrh w2 ... #2 ldrb w2 ... #1 7. str x2, [x1], #4 This is a again problematic, because x1 is not 8-byte aligned anymore after that. 8. str w2, [x1], #-4 9. strh w2, [x1], #8 10. strb w2, [x1], #16 In the following two instructions, sp contains the address of the MMIO region:- Really? I don't think you should give people funny ideas, just mention that the Rn register could theoretically be the stack pointer. 11. ldrb w2, [sp], #16 12. ldrb wzr, [sp], #16 In order to handle post-indexing store/load instructions (like those mentioned above), Xen will need to fetch and decode the instruction. This patch only cover post-index store/load instructions from AArch64 mode. For now, this is left unimplemented for trap from AArch32 mode. Signed-off-by: Ayan Kumar Halder --- Changelog :- v2 - 1. Updated the rn register after reading from it. (Pointed by Julien, Stefano) 2. Used a union to represent the instruction opcode (Suggestd by Bertrand) 3. Fixed coding style issues (Pointed by Julien) 4. In the previous patch, I was updating dabt->sign based on the signedness of imm9. This was incorrect. As mentioned in ARMv8 ARM DDI 0487G.b, Page 3221, SSE indicates the signedness of the data item loaded. In our case, the data item loaded is always unsigned. v3- 1. Handled all the variants of ldr/str (ie 64, 32, 16, 8 bit variants). Thus, I have removed the check for "instr->code.opc == 0" (Suggested by Andre) 2. Handled the scenario when rn = SP, rt = XZR (Suggested by Jan, Andre) 3. Added restriction for "rt != rn" (Suggested by Andre) 4. Moved union ldr_str_instr_class {} to decode.h. This is the header included by io.c and decode.c (where the union is referred). (Suggested by Jan) 5. Indentation and typo fixes (Suggested by Jan) Changes suggested but could not be considered due to reasons :- 1. Using accessor macros instead of bitfields for "ldr_str_instr_class". (Andre) Reason - I could not find a simple way to represent 9 bit signed integer (ie imm9) without using bitfields. If I use accessor macros, then I need to manually calculate two's complement to obtain the value when signed bit is present. 2. I/D cache cohenerncy (Andre) Reason :- I could not see any instruction to flush the I cache. First, please try to avoid the term "flush", because it is somewhat overloaded. The architecture speaks of "clean" and "invalidate", which are more precise. Assuming you mean "clean" here: conceptually there is no such thing for the I cache, because it's always clean. The I$ will only be read from the CPU side - from the instruction fetcher - there is nothing written back through it. Every store goes through the data path - always. That is the problem that I tried to sketch you previously: you don't have a guarantee that the instruction you read from memory is the same that the CPU executed. The guest could have changed the instruction after the I$ fetched that. So the CPU will execute (and trap) on instruction X, but you will read Y. I leave it up to your imagination if that could be exploited. I see what you mean. Refer Armv8 Arm DDI 0487G.b Page D1-2476, it says that (for instr/data abort) the faulting virtual address and IPA is saved in FAR_ELx and HPFAR_EL2 respectively. But, I do not see if the faulting instruction is saved in any special register. Is there something I am missing ? No, indeed there is no such thing. You get the address, but not the faulting instruction. It would indeed be nice to have from a
Re: [XEN v3] xen/arm64: io: Decode ldr/str post-indexing instructions
Hi Andre, On 24/01/2022 14:36, Andre Przywara wrote: On Mon, 24 Jan 2022 12:07:42 + Also, if an instruction is being modified by the guest (after it has been loaded in the I cache), and if the guest does not invalidate the I cache + ISB, then this is a malicious behavior by the guest. Is my understanding correct ? I wouldn't say malicious per se, there might be legitimate reasons to do so, but in the Xen context this is mostly irrelevant, since we don't trust the guest anyway. So whether it's malicious or accidental, the hypervisor might be mislead. I agree the hypervisor will be mislead to execute the wrong instruction. But, in reality, I don't see how this is a massive problem as this thread seems to imply. At best the guest will shoot itself in the foot. IOW, for now, I think it is fine to assume that the guest will have invalidated the cache instruction before executing any instruction that may fault with ISV=0. This could be revisted if we have use-cases where we really need to know what the guest executed. Cheers, -- Julien Grall
Re: [PATCH v4 05/11] xen/arm: introduce direct-map for domUs
Hi, On 13/01/2022 22:53, Stefano Stabellini wrote: +kinfo->mem.nr_banks = nr_banks; + +/* + * The property 'memory' should match the amount of memory given to + * the guest. + * Currently, it is only possible to either acquire static memory or + * let Xen allocate. *Mixing* is not supported. + */ +if ( kinfo->unassigned_mem != 0 ) +{ +printk(XENLOG_ERR + "Size of \"memory\" property doesn't match up with the sum-up of \"xen,static-mem\". Unsupported configuration.\n"); This line would benefit from being broken down, but I am also OK if we leave it as is We usually keep the message in a single line because (even if it is more than 80 characters) because it helps to find the line afterwards. Looking at the message, I would drop "Unsupported configuration" because it implies that this is because some code is missing (IOW it will be supported in the future). However, this is a requirement. Cheers, -- Julien Grall
Re: [XEN v3] xen/arm64: io: Decode ldr/str post-indexing instructions
Hi Andre, Thanks forn your comments. On 24/01/2022 14:36, Andre Przywara wrote: On Mon, 24 Jan 2022 12:07:42 + Ayan Kumar Halder wrote: Hi Ayan, Many thanks for your feedback. I have one clarification :- On 22/01/2022 01:30, Andre Przywara wrote: On Thu, 20 Jan 2022 21:55:27 + Ayan Kumar Halder wrote: Hi, At the moment, Xen is only handling data abort with valid syndrome (i.e. ISV=0). Unfortunately, this doesn't cover all the instructions a domain could use to access MMIO regions. For instance, a baremetal OS can use any of the following instructions, where x1 contains the address of the MMIO region: 1. ldr x2,[x1],#4 That looks dodgy, since is misaligns the pointer afterwards. MMIO access typically go to device memory, which must be naturally aligned. Just don't give a bad example here and change that to a multiple of 8. 2. ldr w2,[x1],#-4 (this one is fine, btw, because it's a 32-bit read) 3. ldr x2,[x1],#-8 4. ldr w2,[x1],#4 5. ldrhw2,[x1],#8 6. ldrbw2,[x1],#16 More naturally I'd use the data size of the postindex value ... ldr x2 ... #-8 ldr w2 ... #4 ldrh w2 ... #2 ldrb w2 ... #1 7. str x2,[x1],#4 This is a again problematic, because x1 is not 8-byte aligned anymore after that. 8. str w2,[x1],#-4 9. strhw2,[x1],#8 10. strbw2,[x1],#16 In the following two instructions, sp contains the address of the MMIO region:- Really? I don't think you should give people funny ideas, just mention that the Rn register could theoretically be the stack pointer. 11. ldrbw2,[sp],#16 12. ldrbwzr, [sp],#16 In order to handle post-indexing store/load instructions (like those mentioned above), Xen will need to fetch and decode the instruction. This patch only cover post-index store/load instructions from AArch64 mode. For now, this is left unimplemented for trap from AArch32 mode. Signed-off-by: Ayan Kumar Halder --- Changelog :- v2 - 1. Updated the rn register after reading from it. (Pointed by Julien, Stefano) 2. Used a union to represent the instruction opcode (Suggestd by Bertrand) 3. Fixed coding style issues (Pointed by Julien) 4. In the previous patch, I was updating dabt->sign based on the signedness of imm9. This was incorrect. As mentioned in ARMv8 ARM DDI 0487G.b, Page 3221, SSE indicates the signedness of the data item loaded. In our case, the data item loaded is always unsigned. v3- 1. Handled all the variants of ldr/str (ie 64, 32, 16, 8 bit variants). Thus, I have removed the check for "instr->code.opc == 0" (Suggested by Andre) 2. Handled the scenario when rn = SP, rt = XZR (Suggested by Jan, Andre) 3. Added restriction for "rt != rn" (Suggested by Andre) 4. Moved union ldr_str_instr_class {} to decode.h. This is the header included by io.c and decode.c (where the union is referred). (Suggested by Jan) 5. Indentation and typo fixes (Suggested by Jan) Changes suggested but could not be considered due to reasons :- 1. Using accessor macros instead of bitfields for "ldr_str_instr_class". (Andre) Reason - I could not find a simple way to represent 9 bit signed integer (ie imm9) without using bitfields. If I use accessor macros, then I need to manually calculate two's complement to obtain the value when signed bit is present. 2. I/D cache cohenerncy (Andre) Reason :- I could not see any instruction to flush the I cache. First, please try to avoid the term "flush", because it is somewhat overloaded. The architecture speaks of "clean" and "invalidate", which are more precise. Assuming you mean "clean" here: conceptually there is no such thing for the I cache, because it's always clean. The I$ will only be read from the CPU side - from the instruction fetcher - there is nothing written back through it. Every store goes through the data path - always. That is the problem that I tried to sketch you previously: you don't have a guarantee that the instruction you read from memory is the same that the CPU executed. The guest could have changed the instruction after the I$ fetched that. So the CPU will execute (and trap) on instruction X, but you will read Y. I leave it up to your imagination if that could be exploited. I see what you mean. Refer Armv8 Arm DDI 0487G.b Page D1-2476, it says that (for instr/data abort) the faulting virtual address and IPA is saved in FAR_ELx and HPFAR_EL2 respectively. But, I do not see if the faulting instruction is saved in any special register. Is there something I am missing ? No, indeed there is no such thing. You get the address, but not the faulting instruction. It would indeed be nice to have from a software developer's point of view, but the
Re: [PATCH 12/37] xen/x86: decouple nodes_cover_memory from E820 map
On 23.09.2021 14:02, Wei Chen wrote: > We will reuse nodes_cover_memory for Arm to check its bootmem > info. So we introduce two arch helpers to get memory map's > entry number and specified entry's range: > arch_get_memory_bank_number > arch_get_memory_bank_range I'm sorry, but personally I see no way for you to introduce the term "memory bank" into x86 code. > --- a/xen/arch/x86/numa.c > +++ b/xen/arch/x86/numa.c > @@ -378,6 +378,24 @@ unsigned int arch_have_default_dmazone(void) > return ( num_online_nodes() > 1 ) ? 1 : 0; > } > > +uint32_t __init arch_meminfo_get_nr_bank(void) unsigned int (also elsewhere) > +{ > + return e820.nr_map; > +} > + > +int __init arch_meminfo_get_ram_bank_range(uint32_t bank, > + paddr_t *start, paddr_t *end) > +{ > + if (e820.map[bank].type != E820_RAM || !start || !end) { I see no reason for the checking of start and end. > + return -1; > + } No need for braces here. > + *start = e820.map[bank].addr; > + *end = e820.map[bank].addr + e820.map[bank].size; > + > + return 0; > +} > + > static void dump_numa(unsigned char key) > { > s_time_t now = NOW(); Judging by just the two pieces of patch context you're inserting a Linux-style code fragment in the middle of two Xen-style ones. Various other comments given for earlier patches apply here as well. Jan
Re: [PATCH 11/37] xen/x86: abstract neutral code from acpi_numa_memory_affinity_init
On 23.09.2021 14:02, Wei Chen wrote: > There is some code in acpi_numa_memory_affinity_init to update node > memory range and update node_memblk_range array. This code is not > ACPI specific, it can be shared by other NUMA implementation, like > device tree based NUMA implementation. > > So in this patch, we abstract this memory range and blocks relative > code to a new function. This will avoid exporting static variables > like node_memblk_range. And the PXM in neutral code print messages > have been replaced by NODE, as PXM is ACPI specific. > > Signed-off-by: Wei Chen SRAT is an ACPI concept, which I assume has no meaning with DT. Hence any generically usable logic here wants, I think, separating out into a file which is not SRAT-specific (peeking ahead, specifically not a file named "numa_srat.c"). This may in turn require some more though regarding the proper split between the stuff remaining in srat.c and the stuff becoming kind of library code. In particular this may mean moving some of the static variables as well, and with them perhaps some further functions (while I did peek ahead, I didn't look closely at the later patch doing the actual movement). And it is then hard to see why the separation needs to happen in two steps - you could move the generically usable code to a new file right away. > --- a/xen/arch/x86/srat.c > +++ b/xen/arch/x86/srat.c > @@ -104,6 +104,14 @@ nodeid_t setup_node(unsigned pxm) > return node; > } > > +bool __init numa_memblks_available(void) > +{ > + if (num_node_memblks < NR_NODE_MEMBLKS) > + return true; > + > + return false; > +} Please can you avoid expressing things in more complex than necessary ways? Here I don't see why it can't just be bool __init numa_memblks_available(void) { return num_node_memblks < NR_NODE_MEMBLKS; } > @@ -301,69 +309,35 @@ static bool __init is_node_memory_continuous(nodeid_t > nid, > return true; > } > > -/* Callback for parsing of the Proximity Domain <-> Memory Area mappings */ > -void __init > -acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma) > +/* Neutral NUMA memory affinity init function for ACPI and DT */ > +int __init numa_update_node_memblks(nodeid_t node, > + paddr_t start, paddr_t size, bool hotplug) Indentation. > { > - paddr_t start, end; > - unsigned pxm; > - nodeid_t node; > + paddr_t end = start + size; > int i; > > - if (srat_disabled()) > - return; > - if (ma->header.length != sizeof(struct acpi_srat_mem_affinity)) { > - bad_srat(); > - return; > - } > - if (!(ma->flags & ACPI_SRAT_MEM_ENABLED)) > - return; > - > - start = ma->base_address; > - end = start + ma->length; > - /* Supplement the heuristics in l1tf_calculations(). */ > - l1tf_safe_maddr = max(l1tf_safe_maddr, ROUNDUP(end, PAGE_SIZE)); > - > - if (num_node_memblks >= NR_NODE_MEMBLKS) > - { > - dprintk(XENLOG_WARNING, > -"Too many numa entry, try bigger NR_NODE_MEMBLKS \n"); > - bad_srat(); > - return; > - } > - > - pxm = ma->proximity_domain; > - if (srat_rev < 2) > - pxm &= 0xff; > - node = setup_node(pxm); > - if (node == NUMA_NO_NODE) { > - bad_srat(); > - return; > - } > - /* It is fine to add this area to the nodes data it will be used later*/ > + /* It is fine to add this area to the nodes data it will be used later > */ > i = conflicting_memblks(start, end); > if (i < 0) > /* everything fine */; > else if (memblk_nodeid[i] == node) { > - bool mismatch = !(ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) != > - !test_bit(i, memblk_hotplug); > + bool mismatch = !hotplug != !test_bit(i, memblk_hotplug); > > - printk("%sSRAT: PXM %u (%"PRIpaddr"-%"PRIpaddr") overlaps with > itself (%"PRIpaddr"-%"PRIpaddr")\n", > -mismatch ? KERN_ERR : KERN_WARNING, pxm, start, end, > + printk("%sSRAT: NODE %u (%"PRIpaddr"-%"PRIpaddr") overlaps with > itself (%"PRIpaddr"-%"PRIpaddr")\n", Nit: Unlike PXM, which is an acronym, "node" doesn't want to be all upper case. Also did you check that the node <-> PXM association is known to a reader of a log at this point in time? > +mismatch ? KERN_ERR : KERN_WARNING, node, start, end, > node_memblk_range[i].start, node_memblk_range[i].end); > if (mismatch) { > - bad_srat(); > - return; > + return -1; > } > } else { > printk(KERN_ERR > -"SRAT: PXM %u (%"PRIpaddr"-%"PRIpaddr") overlaps with > PXM %u (%"PRIpaddr"-%"PRIpaddr")\n", > -pxm, start, end, node_to_pxm(memblk_nodeid[i]), > +
Re: [PATCH v2] xen-mapcache: Avoid entry->lock overflow
On Mon, 24 Jan 2022, Ross Lagerwall wrote: > In some cases, a particular mapcache entry may be mapped 256 times > causing the lock field to wrap to 0. For example, this may happen when > using emulated NVME and the guest submits a large scatter-gather write. > At this point, the entry map be remapped causing QEMU to write the wrong > data or crash (since remap is not atomic). > > Avoid this overflow by increasing the lock field to a uint32_t and also > detect it and abort rather than continuing regardless. > > Signed-off-by: Ross Lagerwall Reviewed-by: Stefano Stabellini > --- > Changes in v2: Change type to uint32_t since there is a hole there > anyway. The struct size remains at 48 bytes on x86_64. > > hw/i386/xen/xen-mapcache.c | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/hw/i386/xen/xen-mapcache.c b/hw/i386/xen/xen-mapcache.c > index bd47c3d672..f2ef977963 100644 > --- a/hw/i386/xen/xen-mapcache.c > +++ b/hw/i386/xen/xen-mapcache.c > @@ -52,7 +52,7 @@ typedef struct MapCacheEntry { > hwaddr paddr_index; > uint8_t *vaddr_base; > unsigned long *valid_mapping; > -uint8_t lock; > +uint32_t lock; > #define XEN_MAPCACHE_ENTRY_DUMMY (1 << 0) > uint8_t flags; > hwaddr size; > @@ -355,6 +355,12 @@ tryagain: > if (lock) { > MapCacheRev *reventry = g_malloc0(sizeof(MapCacheRev)); > entry->lock++; > +if (entry->lock == 0) { > +fprintf(stderr, > +"mapcache entry lock overflow: "TARGET_FMT_plx" -> %p\n", > +entry->paddr_index, entry->vaddr_base); > +abort(); > +} > reventry->dma = dma; > reventry->vaddr_req = mapcache->last_entry->vaddr_base + > address_offset; > reventry->paddr_index = mapcache->last_entry->paddr_index; > -- > 2.27.0 >
Re: [PATCH 10/37] xen/x86: use helpers to access/update mem_hotplug
On 23.09.2021 14:02, Wei Chen wrote: > --- a/xen/arch/x86/srat.c > +++ b/xen/arch/x86/srat.c > @@ -391,8 +391,8 @@ acpi_numa_memory_affinity_init(const struct > acpi_srat_mem_affinity *ma) > memblk_nodeid[num_node_memblks] = node; > if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) { > __set_bit(num_node_memblks, memblk_hotplug); > - if (end > mem_hotplug) > - mem_hotplug = end; > + if (end > mem_hotplug_boundary()) > + mem_hotplug_update_boundary(end); Can the if() please be folded into mem_hotplug_update_boundary(), eliminating (at least for the purpose here) the need for the separate mem_hotplug_boundary()? As said on the previous patch, I think the two want folding. Jan
Re: [PATCH 09/37] xen/x86: introduce two helpers to access memory hotplug end
On 23.09.2021 14:02, Wei Chen wrote: > x86 provides a mem_hotplug to maintain the end of memory hotplug > end address. This variable can be accessed out of mm.c. We want > some code out of mm.c can be reused by other architectures without > memory hotplug ability. So in this patch, we introduce these two > helpers to replace mem_hotplug direct access. This will give the > ability to stub these two API. > > Signed-off-by: Wei Chen > --- > xen/include/asm-x86/mm.h | 10 ++ > 1 file changed, 10 insertions(+) > > diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h > index cb90527499..af2fc4b0cd 100644 > --- a/xen/include/asm-x86/mm.h > +++ b/xen/include/asm-x86/mm.h > @@ -475,6 +475,16 @@ static inline int get_page_and_type(struct page_info > *page, > > extern paddr_t mem_hotplug; > > +static inline void mem_hotplug_update_boundary(paddr_t end) > +{ > +mem_hotplug = end; > +} > + > +static inline paddr_t mem_hotplug_boundary(void) > +{ > +return mem_hotplug; > +} > + > > /** > * With shadow pagetables, the different kinds of address start > * to get get confusing. Imo for this to make sense you want to also use the new functions right away in the place(s) where the direct access(es) get(s) in your way. Jan
[PATCH v2] x86/pvh: fix population of the low 1MB for dom0
RMRRs are setup ahead of populating the p2m and hence the ASSERT when populating the low 1MB needs to be relaxed when it finds an existing entry: it's either RAM or a RMRR resulting from the IOMMU setup. Rework the logic a bit and introduce a local mfn variable in order to assert that if the gfn is populated and not RAM it is an identity map. Fixes: 6b4f6a31ac ('x86/PVH: de-duplicate mappings for first Mb of Dom0 memory') Signed-off-by: Roger Pau Monné Reviewed-by: Jan Beulich --- Changes since v1: - Fix indentation. - Expand existing assert. --- xen/arch/x86/hvm/dom0_build.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c index b00e45885c..63dceb2116 100644 --- a/xen/arch/x86/hvm/dom0_build.c +++ b/xen/arch/x86/hvm/dom0_build.c @@ -466,11 +466,16 @@ static int __init pvh_populate_p2m(struct domain *d) for ( i = rc = 0; i < MB1_PAGES; ++i ) { p2m_type_t p2mt; +mfn_t mfn = get_gfn_query(d, i, ); -if ( mfn_eq(get_gfn_query(d, i, ), INVALID_MFN) ) +if ( mfn_eq(mfn, INVALID_MFN) ) rc = set_mmio_p2m_entry(d, _gfn(i), _mfn(i), PAGE_ORDER_4K); else -ASSERT(p2mt == p2m_ram_rw); +/* + * If the p2m entry is already set it must belong to a RMRR and + * already be identity mapped, or be a RAM region. + */ +ASSERT(p2mt == p2m_ram_rw || mfn_eq(mfn, _mfn(i))); put_gfn(d, i); if ( rc ) { -- 2.34.1
Re: [PATCH] x86/pvh: fix population of the low 1MB for dom0
On Mon, Jan 24, 2022 at 05:01:04PM +0100, Jan Beulich wrote: > On 24.01.2022 16:23, Roger Pau Monne wrote: > > RMRRs are setup ahead of populating the p2m and hence the ASSERT when > > populating the low 1MB needs to be relaxed when it finds an existing > > entry: it's either RAM or a RMRR resulting from the IOMMU setup. > > > > Rework the logic a bit and introduce a local mfn variable in order to > > assert that if the gfn is populated and not RAM it is an identity map. > > > > Fixes: 6b4f6a31ac ('x86/PVH: de-duplicate mappings for first Mb of Dom0 > > memory') > > Signed-off-by: Roger Pau Monné > > Reviewed-by: Jan Beulich > albeit ... > > > --- a/xen/arch/x86/hvm/dom0_build.c > > +++ b/xen/arch/x86/hvm/dom0_build.c > > @@ -466,11 +466,17 @@ static int __init pvh_populate_p2m(struct domain *d) > > for ( i = rc = 0; i < MB1_PAGES; ++i ) > > { > > p2m_type_t p2mt; > > +mfn_t mfn; > > > > -if ( mfn_eq(get_gfn_query(d, i, ), INVALID_MFN) ) > > +mfn = get_gfn_query(d, i, ); > > ... preferably with this becoming the initializer of the variable then > and ... > > > +if ( mfn_eq(mfn, INVALID_MFN) ) > > rc = set_mmio_p2m_entry(d, _gfn(i), _mfn(i), PAGE_ORDER_4K); > > -else > > -ASSERT(p2mt == p2m_ram_rw); > > +else if ( p2mt != p2m_ram_rw && !mfn_eq(mfn, _mfn(i)) ) > > +/* > > + * If the p2m entry is already set it must belong to a > > RMRR and > > + * already be identity mapped, or be a RAM region. > > + */ > > +ASSERT_UNREACHABLE(); > > ... (not just preferably) indentation reduced by a level here. (I wonder > why you didn't simply extend the existing ASSERT() - ASSERT_UNREACHABLE() > is nice in certain cases, but the expression it logs is entirely unhelpful > for disambiguating the reason without looking at the source.) Indeed, that's better. Sorry about the indentation, not sure what my editor has done here. Thanks, Roger.
Re: [PATCH] x86/pvh: print dom0 memory map
On 24.01.2022 16:51, Roger Pau Monne wrote: > I find it useful for debugging certain issues to have the memory map > dom0 is using, so print it when using `dom0=verbose` on the command > line. > > Signed-off-by: Roger Pau Monné Reviewed-by: Jan Beulich with one further request: > --- a/xen/arch/x86/e820.c > +++ b/xen/arch/x86/e820.c > @@ -88,7 +88,7 @@ static void __init add_memory_region(unsigned long long > start, > e820.nr_map++; > } > > -static void __init print_e820_memory_map(struct e820entry *map, unsigned int > entries) > +void __init print_e820_memory_map(struct e820entry *map, unsigned int > entries) While making this accessible from the outside, could you please make "map" pointer-to-const? Jan
[PATCH] libxl: force netback to wait for hotplug execution before connecting
By writing an empty "hotplug-status" xenstore node in the backend path libxl can force Linux netback to wait for hotplug script execution before proceeding to the 'connected' state. This is required so that netback doesn't skip state 2 (InitWait) and thus blocks libxl waiting for such state in order to launch the hotplug script (see libxl__wait_device_connection). Reported-by: James Dingwall Signed-off-by: Roger Pau Monné Tested-by: James Dingwall --- Cc: Wei Liu Cc: Paul Durrant --- tools/libs/light/libxl_nic.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/tools/libs/light/libxl_nic.c b/tools/libs/light/libxl_nic.c index 0b45469dca..0b9e70c9d1 100644 --- a/tools/libs/light/libxl_nic.c +++ b/tools/libs/light/libxl_nic.c @@ -248,6 +248,13 @@ static int libxl__set_xenstore_nic(libxl__gc *gc, uint32_t domid, flexarray_append(ro_front, "mtu"); flexarray_append(ro_front, GCSPRINTF("%u", nic->mtu)); +/* + * Force backend to wait for hotplug script execution before switching to + * connected state. + */ +flexarray_append(back, "hotplug-status"); +flexarray_append(back, ""); + return 0; } -- 2.34.1
Re: [PATCH] x86/pvh: fix population of the low 1MB for dom0
On 24.01.2022 16:23, Roger Pau Monne wrote: > RMRRs are setup ahead of populating the p2m and hence the ASSERT when > populating the low 1MB needs to be relaxed when it finds an existing > entry: it's either RAM or a RMRR resulting from the IOMMU setup. > > Rework the logic a bit and introduce a local mfn variable in order to > assert that if the gfn is populated and not RAM it is an identity map. > > Fixes: 6b4f6a31ac ('x86/PVH: de-duplicate mappings for first Mb of Dom0 > memory') > Signed-off-by: Roger Pau Monné Reviewed-by: Jan Beulich albeit ... > --- a/xen/arch/x86/hvm/dom0_build.c > +++ b/xen/arch/x86/hvm/dom0_build.c > @@ -466,11 +466,17 @@ static int __init pvh_populate_p2m(struct domain *d) > for ( i = rc = 0; i < MB1_PAGES; ++i ) > { > p2m_type_t p2mt; > +mfn_t mfn; > > -if ( mfn_eq(get_gfn_query(d, i, ), INVALID_MFN) ) > +mfn = get_gfn_query(d, i, ); ... preferably with this becoming the initializer of the variable then and ... > +if ( mfn_eq(mfn, INVALID_MFN) ) > rc = set_mmio_p2m_entry(d, _gfn(i), _mfn(i), PAGE_ORDER_4K); > -else > -ASSERT(p2mt == p2m_ram_rw); > +else if ( p2mt != p2m_ram_rw && !mfn_eq(mfn, _mfn(i)) ) > +/* > + * If the p2m entry is already set it must belong to a RMRR > and > + * already be identity mapped, or be a RAM region. > + */ > +ASSERT_UNREACHABLE(); ... (not just preferably) indentation reduced by a level here. (I wonder why you didn't simply extend the existing ASSERT() - ASSERT_UNREACHABLE() is nice in certain cases, but the expression it logs is entirely unhelpful for disambiguating the reason without looking at the source.) Jan
[PATCH] x86/pvh: print dom0 memory map
I find it useful for debugging certain issues to have the memory map dom0 is using, so print it when using `dom0=verbose` on the command line. Signed-off-by: Roger Pau Monné --- xen/arch/x86/e820.c | 2 +- xen/arch/x86/hvm/dom0_build.c | 6 ++ xen/arch/x86/include/asm/e820.h | 1 + 3 files changed, 8 insertions(+), 1 deletion(-) diff --git a/xen/arch/x86/e820.c b/xen/arch/x86/e820.c index aa602773bb..82e9ac46a0 100644 --- a/xen/arch/x86/e820.c +++ b/xen/arch/x86/e820.c @@ -88,7 +88,7 @@ static void __init add_memory_region(unsigned long long start, e820.nr_map++; } -static void __init print_e820_memory_map(struct e820entry *map, unsigned int entries) +void __init print_e820_memory_map(struct e820entry *map, unsigned int entries) { unsigned int i; diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c index 739bb8adb6..93ebe4f404 100644 --- a/xen/arch/x86/hvm/dom0_build.c +++ b/xen/arch/x86/hvm/dom0_build.c @@ -1271,6 +1271,12 @@ int __init dom0_construct_pvh(struct domain *d, const module_t *image, return rc; } +if ( opt_dom0_verbose ) +{ +printk("Dom%u memory map:\n", d->domain_id); +print_e820_memory_map(d->arch.e820, d->arch.nr_e820); +} + printk("WARNING: PVH is an experimental mode with limited functionality\n"); return 0; } diff --git a/xen/arch/x86/include/asm/e820.h b/xen/arch/x86/include/asm/e820.h index 9d8f1ba960..7fcfde3b66 100644 --- a/xen/arch/x86/include/asm/e820.h +++ b/xen/arch/x86/include/asm/e820.h @@ -32,6 +32,7 @@ extern int e820_change_range_type( extern int e820_add_range( struct e820map *, uint64_t s, uint64_t e, uint32_t type); extern unsigned long init_e820(const char *, struct e820map *); +extern void print_e820_memory_map(struct e820entry *map, unsigned int entries); extern struct e820map e820; extern struct e820map e820_raw; -- 2.34.1
[PATCH] x86/pvh: fix population of the low 1MB for dom0
RMRRs are setup ahead of populating the p2m and hence the ASSERT when populating the low 1MB needs to be relaxed when it finds an existing entry: it's either RAM or a RMRR resulting from the IOMMU setup. Rework the logic a bit and introduce a local mfn variable in order to assert that if the gfn is populated and not RAM it is an identity map. Fixes: 6b4f6a31ac ('x86/PVH: de-duplicate mappings for first Mb of Dom0 memory') Signed-off-by: Roger Pau Monné --- xen/arch/x86/hvm/dom0_build.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c index b00e45885c..739bb8adb6 100644 --- a/xen/arch/x86/hvm/dom0_build.c +++ b/xen/arch/x86/hvm/dom0_build.c @@ -466,11 +466,17 @@ static int __init pvh_populate_p2m(struct domain *d) for ( i = rc = 0; i < MB1_PAGES; ++i ) { p2m_type_t p2mt; +mfn_t mfn; -if ( mfn_eq(get_gfn_query(d, i, ), INVALID_MFN) ) +mfn = get_gfn_query(d, i, ); +if ( mfn_eq(mfn, INVALID_MFN) ) rc = set_mmio_p2m_entry(d, _gfn(i), _mfn(i), PAGE_ORDER_4K); -else -ASSERT(p2mt == p2m_ram_rw); +else if ( p2mt != p2m_ram_rw && !mfn_eq(mfn, _mfn(i)) ) +/* + * If the p2m entry is already set it must belong to a RMRR and + * already be identity mapped, or be a RAM region. + */ +ASSERT_UNREACHABLE(); put_gfn(d, i); if ( rc ) { -- 2.34.1
Re: [PATCH RFC v2 4/5] x86/mwait-idle: enable interrupts before C1 on Xeons
On 20.01.2022 16:52, Roger Pau Monné wrote: > On Thu, Jan 20, 2022 at 03:04:39PM +0100, Jan Beulich wrote: >> From: Artem Bityutskiy >> >> Enable local interrupts before requesting C1 on the last two generations >> of Intel Xeon platforms: Sky Lake, Cascade Lake, Cooper Lake, Ice Lake. >> This decreases average C1 interrupt latency by about 5-10%, as measured >> with the 'wult' tool. >> >> The '->enter()' function of the driver enters C-states with local >> interrupts disabled by executing the 'monitor' and 'mwait' pair of >> instructions. If an interrupt happens, the CPU exits the C-state and >> continues executing instructions after 'mwait'. It does not jump to >> the interrupt handler, because local interrupts are disabled. The >> cpuidle subsystem enables interrupts a bit later, after doing some >> housekeeping. >> >> With this patch, we enable local interrupts before requesting C1. In >> this case, if the CPU wakes up because of an interrupt, it will jump >> to the interrupt handler right away. The cpuidle housekeeping will be >> done after the pending interrupt(s) are handled. >> >> Enabling interrupts before entering a C-state has measurable impact >> for faster C-states, like C1. Deeper, but slower C-states like C6 do >> not really benefit from this sort of change, because their latency is >> a lot higher comparing to the delay added by cpuidle housekeeping. >> >> This change was also tested with cyclictest and dbench. In case of Ice >> Lake, the average cyclictest latency decreased by 5.1%, and the average >> 'dbench' throughput increased by about 0.8%. Both tests were run for 4 >> hours with only C1 enabled (all other idle states, including 'POLL', >> were disabled). CPU frequency was pinned to HFM, and uncore frequency >> was pinned to the maximum value. The other platforms had similar >> single-digit percentage improvements. >> >> It is worth noting that this patch affects 'cpuidle' statistics a tiny >> bit. Before this patch, C1 residency did not include the interrupt >> handling time, but with this patch, it will include it. This is similar >> to what happens in case of the 'POLL' state, which also runs with >> interrupts enabled. >> >> Suggested-by: Len Brown >> Signed-off-by: Artem Bityutskiy >> [Linux commit: c227233ad64c77e57db738ab0e46439db71822a3] >> >> We don't have a pointer into cpuidle_state_table[] readily available. >> To compensate, make use of the new flag only appearing for C1 and hence >> only in the first table entry. > > We could likely use the 8 padding bits in acpi_processor_cx between > entry_method and method to store a flags field? When looking there initially, I thought it wouldn't be good to add field there. But looking again, I now don't see why I thought what I thought. (We actually have an 8-bit padding slot there and a 32-bit one.) > It would seems more resilient, as I guess at some point we could > enable interrupts for further states? C1E maybe; I don't think anything beyond, for having higher wakeup latency anyway. >> Unlike Linux we want to disable IRQs again after MWAITing, as >> subsequently invoked functions assume so. > > I'm also wondering whether there could be interrupts that rely on some > of the housekeeping that's done when returning from mwait. I guess > it's unlikely for an interrupt handler to have anything to do with the > TSC not having been restored. Actually this is a good point you make: We don't want to enable IRQs when cstate_restore_tsc() is not a no-op, or else we might confuse the time rendezvous. (I thought that I would remember TSC to stop only in deeper C-states, but maybe I'm mixing this up with the APIC timer.) >> Signed-off-by: Jan Beulich > > Acked-by: Roger Pau Monné Thanks, but as per above - not just yet. Jan
Re: [XEN v3] xen/arm64: io: Decode ldr/str post-indexing instructions
On Mon, 24 Jan 2022 12:07:42 + Ayan Kumar Halder wrote: Hi Ayan, > Many thanks for your feedback. I have one clarification :- > > On 22/01/2022 01:30, Andre Przywara wrote: > > On Thu, 20 Jan 2022 21:55:27 + > > Ayan Kumar Halder wrote: > > > > Hi, > > > >> At the moment, Xen is only handling data abort with valid syndrome (i.e. > >> ISV=0). Unfortunately, this doesn't cover all the instructions a domain > >> could use to access MMIO regions. > >> > >> For instance, a baremetal OS can use any of the following instructions, > >> where > >> x1 contains the address of the MMIO region: > >> > >> 1. ldr x2,[x1],#4 > > That looks dodgy, since is misaligns the pointer afterwards. MMIO > > access typically go to device memory, which must be naturally aligned. > > Just don't give a bad example here and change that to a multiple of 8. > > > >> 2. ldr w2,[x1],#-4 > > (this one is fine, btw, because it's a 32-bit read) > > > >> 3. ldr x2,[x1],#-8 > >> 4. ldr w2,[x1],#4 > >> 5. ldrhw2,[x1],#8 > >> 6. ldrbw2,[x1],#16 > > More naturally I'd use the data size of the postindex value ... > > ldr x2 ... #-8 > > ldr w2 ... #4 > > ldrh w2 ... #2 > > ldrb w2 ... #1 > > > >> 7. str x2,[x1],#4 > > This is a again problematic, because x1 is not 8-byte aligned anymore > > after that. > > > >> 8. str w2,[x1],#-4 > >> 9. strhw2,[x1],#8 > >> 10. strbw2,[x1],#16 > >> > >> In the following two instructions, sp contains the address of the MMIO > >> region:- > > Really? I don't think you should give people funny ideas, just mention > > that the Rn register could theoretically be the stack pointer. > > > >> 11. ldrbw2,[sp],#16 > >> 12. ldrbwzr, [sp],#16 > >> > >> In order to handle post-indexing store/load instructions (like those > >> mentioned > >> above), Xen will need to fetch and decode the instruction. > >> > >> This patch only cover post-index store/load instructions from AArch64 mode. > >> For now, this is left unimplemented for trap from AArch32 mode. > >> > >> Signed-off-by: Ayan Kumar Halder > >> --- > >> > >> Changelog :- > >> v2 - 1. Updated the rn register after reading from it. (Pointed by Julien, > >> Stefano) > >> 2. Used a union to represent the instruction opcode (Suggestd by > >> Bertrand) > >> 3. Fixed coding style issues (Pointed by Julien) > >> 4. In the previous patch, I was updating dabt->sign based on the > >> signedness > >> of imm9. This was incorrect. As mentioned in ARMv8 ARM DDI > >> 0487G.b, > >> Page 3221, SSE indicates the signedness of the data item loaded. > >> In our > >> case, the data item loaded is always unsigned. > >> > >> v3- 1. Handled all the variants of ldr/str (ie 64, 32, 16, 8 bit variants). > >> Thus, I have removed the check for "instr->code.opc == 0" > >> (Suggested by > >> Andre) > >> 2. Handled the scenario when rn = SP, rt = XZR (Suggested by Jan, > >> Andre) > >> 3. Added restriction for "rt != rn" (Suggested by Andre) > >> 4. Moved union ldr_str_instr_class {} to decode.h. This is the header > >> included > >> by io.c and decode.c (where the union is referred). (Suggested by > >> Jan) > >> 5. Indentation and typo fixes (Suggested by Jan) > >> > >> Changes suggested but could not be considered due to reasons :- > >> 1. Using accessor macros instead of bitfields for > >> "ldr_str_instr_class". (Andre) > >> Reason - I could not find a simple way to represent 9 bit signed > >> integer > >> (ie imm9) without using bitfields. If I use accessor macros, then > >> I need > >> to manually calculate two's complement to obtain the value when > >> signed > >> bit is present. > >> > >> 2. I/D cache cohenerncy (Andre) > >> Reason :- I could not see any instruction to flush the I cache. > > First, please try to avoid the term "flush", because it is somewhat > > overloaded. The architecture speaks of "clean" and "invalidate", which > > are more precise. > > Assuming you mean "clean" here: conceptually there is no such thing for > > the I cache, because it's always clean. The I$ will only be read from > > the CPU side - from the instruction fetcher - there is nothing written > > back through it. Every store goes through the data path - always. > > That is the problem that I tried to sketch you previously: you don't > > have a guarantee that the instruction you read from memory is the same > > that the CPU executed. The guest could have changed the instruction > > after the I$ fetched that. So the CPU will execute (and trap) on > > instruction X, but you will read Y. I leave it up to your imagination > > if that could be exploited. > > I see what you mean. > > Refer Armv8 Arm DDI 0487G.b Page D1-2476,
Re: [PATCH 2/3] x86/vmsi: add support for extended destination ID in address field
On 20.01.2022 16:23, Roger Pau Monne wrote: > Both QEMU/KVM and HyperV support using bits 11:5 from the MSI address > field in order to store the high part of the target APIC ID. This > allows expanding the maximum APID ID usable without interrupt > remapping support from 255 to 32768. > > Note the interface used by QEMU for emulated devices (via the > XEN_DMOP_inject_msi hypercall) already passes both the address and > data fields into Xen for processing, so there's no need for any change > to QEMU there. > > However for PCI passthrough devices QEMU uses the > XEN_DOMCTL_bind_pt_irq hypercall which does need an addition to the > gflags field in order to pass the high bits of the APIC destination > ID. > > Introduce a new CPUID flag to signal the support for the feature. The > introduced flag covers both the support for extended ID for the > IO-APIC RTE and the MSI address registers. Such flag is currently only > exposed when the domain is using vPCI (ie: a PVH dom0). Because of also covering the IO-APIC side, I think the CPUID aspect of this really wants splitting into a 3rd patch. That way the MSI and IO-APIC parts could in principle go in independently, and only the CPUID one needs to remain at the tail. > --- a/xen/arch/x86/hvm/vmsi.c > +++ b/xen/arch/x86/hvm/vmsi.c > @@ -66,7 +66,7 @@ static void vmsi_inj_irq( > > int vmsi_deliver( > struct domain *d, int vector, > -uint8_t dest, uint8_t dest_mode, > +unsigned int dest, unsigned int dest_mode, If you change the type of dest_mode, then to "bool" please - see its only call site. > @@ -123,7 +125,8 @@ void vmsi_deliver_pirq(struct domain *d, const struct > hvm_pirq_dpci *pirq_dpci) > } > > /* Return value, -1 : multi-dests, non-negative value: dest_vcpu_id */ > -int hvm_girq_dest_2_vcpu_id(struct domain *d, uint8_t dest, uint8_t > dest_mode) > +int hvm_girq_dest_2_vcpu_id(struct domain *d, unsigned int dest, > +unsigned int dest_mode) Same here then. > --- a/xen/arch/x86/include/asm/msi.h > +++ b/xen/arch/x86/include/asm/msi.h > @@ -54,6 +54,7 @@ > #define MSI_ADDR_DEST_ID_SHIFT 12 > #define MSI_ADDR_DEST_ID_MASK 0x00ff000 > #define MSI_ADDR_DEST_ID(dest) (((dest) << > MSI_ADDR_DEST_ID_SHIFT) & MSI_ADDR_DEST_ID_MASK) > +#define MSI_ADDR_EXT_DEST_ID_MASK 0xfe0 Especially the immediately preceding macro now becomes kind of stale. > --- a/xen/drivers/passthrough/x86/hvm.c > +++ b/xen/drivers/passthrough/x86/hvm.c > @@ -269,7 +269,7 @@ int pt_irq_create_bind( > { > case PT_IRQ_TYPE_MSI: > { > -uint8_t dest, delivery_mode; > +unsigned int dest, delivery_mode; > bool dest_mode; If you touch delivery_mode's type, wouldn't that better become bool? > --- a/xen/include/public/domctl.h > +++ b/xen/include/public/domctl.h > @@ -588,6 +588,7 @@ struct xen_domctl_bind_pt_irq { > #define XEN_DOMCTL_VMSI_X86_DELIV_MASK 0x007000 > #define XEN_DOMCTL_VMSI_X86_TRIG_MASK0x008000 > #define XEN_DOMCTL_VMSI_X86_UNMASKED 0x01 > +#define XEN_DOMCTL_VMSI_X86_EXT_DEST_ID_MASK 0xfe I'm not convinced it is a good idea to limit the overall destination ID width to 15 bits here - at the interface level we could as well permit more bits right away; the implementation would reject too high a value, of course. Not only with this I further wonder whether the field shouldn't be unsplit while extending it. You won't get away without bumping XEN_DOMCTL_INTERFACE_VERSION anyway (unless it was bumped already for 4.17) since afaics the unused bits of this field previously weren't checked for being zero. We could easily have 8 bits vector, 16 bits flags, and 32 bits destination ID in the struct. And there would then still be 8 unused bits (which from now on we ought to check for being zero). Jan
Re: possible kernel/libxl race with xl network-attach
On Mon, Jan 24, 2022 at 10:07:54AM +0100, Roger Pau Monné wrote: > On Fri, Jan 21, 2022 at 03:05:07PM +, James Dingwall wrote: > > On Fri, Jan 21, 2022 at 03:00:29PM +0100, Roger Pau Monné wrote: > > > On Fri, Jan 21, 2022 at 01:34:54PM +, James Dingwall wrote: > > > > On 2022-01-13 16:11, Roger Pau Monné wrote: > > > > > On Thu, Jan 13, 2022 at 11:19:46AM +, James Dingwall wrote: > > > > > > > > > > > > I have been trying to debug a problem where a vif with the backend > > > > > > in a > > > > > > driver domain is added to dom0. Intermittently the hotplug script > > > > > > is > > > > > > not invoked by libxl (running as xl devd) in the driver domain. By > > > > > > enabling some debug for the driver domain kernel and libxl I have > > > > > > these > > > > > > messages: > > > > > > > > > > > > driver domain kernel (Ubuntu 5.4.0-92-generic): > > > > > > > > > > > > [Thu Jan 13 01:39:31 2022] [1408] 564: vif vif-0-0 vif0.0: > > > > > > Successfully created xenvif > > > > > > [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed: > > > > > > /local/domain/0/device/vif/0 -> Initialising > > > > > > [Thu Jan 13 01:39:31 2022] [26] 470: > > > > > > xen_netback:backend_switch_state: backend/vif/0/0 -> InitWait > > > > > > [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed: > > > > > > /local/domain/0/device/vif/0 -> Connected > > > > > > [Thu Jan 13 01:39:31 2022] vif vif-0-0 vif0.0: Guest Rx ready > > > > > > [Thu Jan 13 01:39:31 2022] [26] 470: > > > > > > xen_netback:backend_switch_state: backend/vif/0/0 -> Connected > > > > > > > > > > > > xl devd (Xen 4.14.3): > > > > > > > > > > > > 2022-01-13 01:39:31 UTC libxl: debug: > > > > > > libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528 > > > > > > wpath=/local/domain/2/backend token=3/0: event > > > > > > epath=/local/domain/2/backend/vif/0/0/state > > > > > > 2022-01-13 01:39:31 UTC libxl: debug: > > > > > > libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac569700: > > > > > > nested ao, parent 0x5633ac567f90 > > > > > > 2022-01-13 01:39:31 UTC libxl: debug: > > > > > > libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180 > > > > > > wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event > > > > > > epath=/local/domain/2/backend/vif/0/0/state > > > > > > 2022-01-13 01:39:31 UTC libxl: debug: > > > > > > libxl_event.c:1055:devstate_callback: backend > > > > > > /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting > > > > > > state 4 > > > > > > 2022-01-13 01:39:31 UTC libxl: debug: > > > > > > libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528 > > > > > > wpath=/local/domain/2/backend token=3/0: event > > > > > > epath=/local/domain/2/backend/vif/0/0/state > > > > > > 2022-01-13 01:39:31 UTC libxl: debug: > > > > > > libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac56a220: > > > > > > nested ao, parent 0x5633ac567f90 > > > > > > 2022-01-13 01:39:31 UTC libxl: debug: > > > > > > libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180 > > > > > > wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event > > > > > > epath=/local/domain/2/backend/vif/0/0/state > > > > > > 2022-01-13 01:39:31 UTC libxl: debug: > > > > > > libxl_event.c:1055:devstate_callback: backend > > > > > > /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting > > > > > > state 4 > > > > > > 2022-01-13 01:39:51 UTC libxl: debug: > > > > > > libxl_aoutils.c:88:xswait_timeout_callback: backend > > > > > > /local/domain/2/backend/vif/0/0/state (hoping for state change to > > > > > > 2): xswait timeout (path=/local/domain/2/backend/vif/0/0/state) > > > > > > 2022-01-13 01:39:51 UTC libxl: debug: > > > > > > libxl_event.c:850:libxl__ev_xswatch_deregister: watch > > > > > > w=0x5633ac569180 wpath=/local/domain/2/backend/vif/0/0/state > > > > > > token=2/1: deregister slotnum=2 > > > > > > 2022-01-13 01:39:51 UTC libxl: debug: > > > > > > libxl_event.c:1039:devstate_callback: backend > > > > > > /local/domain/2/backend/vif/0/0/state wanted state 2 timed out > > > > > > 2022-01-13 01:39:51 UTC libxl: debug: > > > > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch > > > > > > w=0x5633ac569180: deregister unregistered > > > > > > 2022-01-13 01:39:51 UTC libxl: debug: > > > > > > libxl_device.c:1092:device_backend_callback: calling > > > > > > device_backend_cleanup > > > > > > 2022-01-13 01:39:51 UTC libxl: debug: > > > > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch > > > > > > w=0x5633ac569180: deregister unregistered > > > > > > 2022-01-13 01:39:51 UTC libxl: error: > > > > > > libxl_device.c:1105:device_backend_callback: unable to add device > > > > > > with path /local/domain/2/backend/vif/0/0 > > > > > > 2022-01-13 01:39:51 UTC libxl: debug: > > > > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch > > > > > > w=0x5633ac569280: deregister unregistered > > > > > > 2022-01-13 01:39:51 UTC libxl: debug: > > > > > > libxl_device.c:1470:device_complete:
Re: [PATCH 1/3] xen/vioapic: add support for the extended destination ID field
On 20.01.2022 16:23, Roger Pau Monne wrote: > Such field uses bits 55:48, but for the purposes the register will be > used use bits 55:49 instead. Bit 48 is used to signal an RTE entry is > in remappable format which is not supported by the vIO-APIC. Neither here nor in the cover letter you point at a formal specification of this mode of operation. What I'm aware of are vague indications of this mode's existence in some of Intel's chipset data sheets. Yet that leaves open, for example, whether indeed bit 48 cannot be used here. > --- a/xen/arch/x86/hvm/vioapic.c > +++ b/xen/arch/x86/hvm/vioapic.c > @@ -412,7 +412,8 @@ static void ioapic_inj_irq( > > static void vioapic_deliver(struct hvm_vioapic *vioapic, unsigned int pin) > { > -uint16_t dest = vioapic->redirtbl[pin].fields.dest_id; > +uint16_t dest = vioapic->redirtbl[pin].fields.dest_id | > +(vioapic->redirtbl[pin].fields.ext_dest_id << 8); What if an existing guest has been writing non-zero in these bits? Can you really use them here without any further indication by the guest? Jan
[xen-unstable test] 167802: tolerable FAIL
flight 167802 xen-unstable real [real] http://logs.test-lab.xenproject.org/osstest/logs/167802/ Failures :-/ but no regressions. Tests which did not succeed, but are not blocking: test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 167796 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 167796 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 167796 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 167796 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 167796 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail like 167796 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 167796 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 167796 test-armhf-armhf-libvirt 16 saverestore-support-checkfail like 167796 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 167796 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 167796 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check fail like 167796 test-arm64-arm64-xl-seattle 15 migrate-support-checkfail never pass test-arm64-arm64-xl-seattle 16 saverestore-support-checkfail never pass test-amd64-i386-libvirt-xsm 15 migrate-support-checkfail never pass test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail never pass test-amd64-i386-libvirt 15 migrate-support-checkfail never pass test-amd64-amd64-libvirt 15 migrate-support-checkfail never pass test-amd64-i386-xl-pvshim14 guest-start fail never pass test-arm64-arm64-xl 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit2 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit2 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail never pass test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 16 saverestore-support-checkfail never pass test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-amd64-i386-libvirt-raw 14 migrate-support-checkfail never pass test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail never pass test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail never pass test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-vhd 14 migrate-support-checkfail never pass test-arm64-arm64-xl-vhd 15 saverestore-support-checkfail never pass test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-vhd 15 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit1 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 16 saverestore-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass test-armhf-armhf-libvirt 15 migrate-support-checkfail never pass test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass version targeted for testing: xen fe9be76d880b1d43b9dca471f45af3fd380ecb00 baseline version: xen
Re: [XEN v3] xen/arm64: io: Decode ldr/str post-indexing instructions
Hi Andre, Many thanks for your feedback. I have one clarification :- On 22/01/2022 01:30, Andre Przywara wrote: On Thu, 20 Jan 2022 21:55:27 + Ayan Kumar Halder wrote: Hi, At the moment, Xen is only handling data abort with valid syndrome (i.e. ISV=0). Unfortunately, this doesn't cover all the instructions a domain could use to access MMIO regions. For instance, a baremetal OS can use any of the following instructions, where x1 contains the address of the MMIO region: 1. ldr x2,[x1],#4 That looks dodgy, since is misaligns the pointer afterwards. MMIO access typically go to device memory, which must be naturally aligned. Just don't give a bad example here and change that to a multiple of 8. 2. ldr w2,[x1],#-4 (this one is fine, btw, because it's a 32-bit read) 3. ldr x2,[x1],#-8 4. ldr w2,[x1],#4 5. ldrhw2,[x1],#8 6. ldrbw2,[x1],#16 More naturally I'd use the data size of the postindex value ... ldr x2 ... #-8 ldr w2 ... #4 ldrh w2 ... #2 ldrb w2 ... #1 7. str x2,[x1],#4 This is a again problematic, because x1 is not 8-byte aligned anymore after that. 8. str w2,[x1],#-4 9. strhw2,[x1],#8 10. strbw2,[x1],#16 In the following two instructions, sp contains the address of the MMIO region:- Really? I don't think you should give people funny ideas, just mention that the Rn register could theoretically be the stack pointer. 11. ldrbw2,[sp],#16 12. ldrbwzr, [sp],#16 In order to handle post-indexing store/load instructions (like those mentioned above), Xen will need to fetch and decode the instruction. This patch only cover post-index store/load instructions from AArch64 mode. For now, this is left unimplemented for trap from AArch32 mode. Signed-off-by: Ayan Kumar Halder --- Changelog :- v2 - 1. Updated the rn register after reading from it. (Pointed by Julien, Stefano) 2. Used a union to represent the instruction opcode (Suggestd by Bertrand) 3. Fixed coding style issues (Pointed by Julien) 4. In the previous patch, I was updating dabt->sign based on the signedness of imm9. This was incorrect. As mentioned in ARMv8 ARM DDI 0487G.b, Page 3221, SSE indicates the signedness of the data item loaded. In our case, the data item loaded is always unsigned. v3- 1. Handled all the variants of ldr/str (ie 64, 32, 16, 8 bit variants). Thus, I have removed the check for "instr->code.opc == 0" (Suggested by Andre) 2. Handled the scenario when rn = SP, rt = XZR (Suggested by Jan, Andre) 3. Added restriction for "rt != rn" (Suggested by Andre) 4. Moved union ldr_str_instr_class {} to decode.h. This is the header included by io.c and decode.c (where the union is referred). (Suggested by Jan) 5. Indentation and typo fixes (Suggested by Jan) Changes suggested but could not be considered due to reasons :- 1. Using accessor macros instead of bitfields for "ldr_str_instr_class". (Andre) Reason - I could not find a simple way to represent 9 bit signed integer (ie imm9) without using bitfields. If I use accessor macros, then I need to manually calculate two's complement to obtain the value when signed bit is present. 2. I/D cache cohenerncy (Andre) Reason :- I could not see any instruction to flush the I cache. First, please try to avoid the term "flush", because it is somewhat overloaded. The architecture speaks of "clean" and "invalidate", which are more precise. Assuming you mean "clean" here: conceptually there is no such thing for the I cache, because it's always clean. The I$ will only be read from the CPU side - from the instruction fetcher - there is nothing written back through it. Every store goes through the data path - always. That is the problem that I tried to sketch you previously: you don't have a guarantee that the instruction you read from memory is the same that the CPU executed. The guest could have changed the instruction after the I$ fetched that. So the CPU will execute (and trap) on instruction X, but you will read Y. I leave it up to your imagination if that could be exploited. I see what you mean. Refer Armv8 Arm DDI 0487G.b Page D1-2476, it says that (for instr/data abort) the faulting virtual address and IPA is saved in FAR_ELx and HPFAR_EL2 respectively. But, I do not see if the faulting instruction is saved in any special register. Is there something I am missing ? Else, :( this is a limitation of the architecture (imo). A hypervisor can be interested to see which instruction caused the abort when ISV = 0. Also, if an instruction is being modified by the guest (after it has been loaded in the I cache), and if the guest does not invalidate the I cache + ISB, then this is a malicious behavior by the guest. Is my
[xen-unstable-smoke test] 167805: tolerable all pass - PUSHED
flight 167805 xen-unstable-smoke real [real] http://logs.test-lab.xenproject.org/osstest/logs/167805/ Failures :-/ but no regressions. Tests which did not succeed, but are not blocking: test-amd64-amd64-libvirt 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass version targeted for testing: xen ec32910f4f871dce0f0e32dfb36f218fa1a2e869 baseline version: xen fe9be76d880b1d43b9dca471f45af3fd380ecb00 Last test of basis 167789 2022-01-21 20:02:56 Z2 days Testing same since 167805 2022-01-24 08:01:41 Z0 days1 attempts People who touched revisions under test: Andrew Cooper Artem Bityutskiy Jan Beulich Rafael J. Wysocki Roger Pau Monné jobs: build-arm64-xsm pass build-amd64 pass build-armhf pass build-amd64-libvirt pass test-armhf-armhf-xl pass test-arm64-arm64-xl-xsm pass test-amd64-amd64-xl-qemuu-debianhvm-amd64pass test-amd64-amd64-libvirt pass sg-report-flight on osstest.test-lab.xenproject.org logs: /home/logs/logs images: /home/logs/images Logs, config files, etc. are available at http://logs.test-lab.xenproject.org/osstest/logs Explanation of these reports, and of osstest in general, is at http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master Test harness code can be found at http://xenbits.xen.org/gitweb?p=osstest.git;a=summary Pushing revision : To xenbits.xen.org:/home/xen/git/xen.git fe9be76d88..ec32910f4f ec32910f4f871dce0f0e32dfb36f218fa1a2e869 -> smoke
[libvirt test] 167804: regressions - FAIL
flight 167804 libvirt real [real] http://logs.test-lab.xenproject.org/osstest/logs/167804/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: build-arm64-libvirt 6 libvirt-buildfail REGR. vs. 151777 build-i386-libvirt6 libvirt-buildfail REGR. vs. 151777 build-armhf-libvirt 6 libvirt-buildfail REGR. vs. 151777 build-amd64-libvirt 6 libvirt-buildfail REGR. vs. 151777 Tests which did not succeed, but are not blocking: test-amd64-amd64-libvirt 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-pair 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-vhd 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-xsm 1 build-check(1) blocked n/a test-amd64-i386-libvirt 1 build-check(1) blocked n/a test-amd64-i386-libvirt-pair 1 build-check(1) blocked n/a test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a test-amd64-i386-libvirt-raw 1 build-check(1) blocked n/a test-amd64-i386-libvirt-xsm 1 build-check(1) blocked n/a test-arm64-arm64-libvirt 1 build-check(1) blocked n/a test-arm64-arm64-libvirt-qcow2 1 build-check(1) blocked n/a test-arm64-arm64-libvirt-raw 1 build-check(1) blocked n/a test-armhf-armhf-libvirt-raw 1 build-check(1) blocked n/a test-arm64-arm64-libvirt-xsm 1 build-check(1) blocked n/a test-armhf-armhf-libvirt 1 build-check(1) blocked n/a test-armhf-armhf-libvirt-qcow2 1 build-check(1) blocked n/a version targeted for testing: libvirt eee062d7a263a2c26c375812aeee6736bac53dd0 baseline version: libvirt 2c846fa6bcc11929c9fb857a22430fb9945654ad Last test of basis 151777 2020-07-10 04:19:19 Z 563 days Failing since151818 2020-07-11 04:18:52 Z 562 days 544 attempts Testing same since 167791 2022-01-22 04:18:47 Z2 days3 attempts People who touched revisions under test: Adolfo Jayme Barrientos Aleksandr Alekseev Aleksei Zakharov Andika Triwidada Andrea Bolognani Ani Sinha Balázs Meskó Barrett Schonefeld Bastian Germann Bastien Orivel BiaoXiang Ye Bihong Yu Binfeng Wu Bjoern Walk Boris Fiuczynski Brian Turek Bruno Haible Chris Mayo Christian Borntraeger Christian Ehrhardt Christian Kirbach Christian Schoenebeck Cole Robinson Collin Walling Cornelia Huck Cédric Bosdonnat Côme Borsoi Daniel Henrique Barboza Daniel Letai Daniel P. Berrange Daniel P. Berrangé Didik Supriadi dinglimin Divya Garg Dmitrii Shcherbakov Dmytro Linkin Eiichi Tsukata Eric Farman Erik Skultety Fabian Affolter Fabian Freyer Fabiano Fidêncio Fangge Jin Farhan Ali Fedora Weblate Translation Franck Ridel Gavi Teitz gongwei Guoyi Tu Göran Uddeborg Halil Pasic Han Han Hao Wang Hela Basa Helmut Grohne Hiroki Narukawa Hyman Huang(黄勇) Ian Wienand Ioanna Alifieraki Ivan Teterevkov Jakob Meng Jamie Strandboge Jamie Strandboge Jan Kuparinen jason lee Jean-Baptiste Holcroft Jia Zhou Jianan Gao Jim Fehlig Jin Yan Jinsheng Zhang Jiri Denemark Joachim Falk John Ferlan Jonathan Watt Jonathon Jongsma Julio Faracco Justin Gatzen Ján Tomko Kashyap Chamarthy Kevin Locke Koichi Murase Kristina Hanicova Laine Stump Laszlo Ersek Lee Yarwood Lei Yang Liao Pingfang Lin Ma Lin Ma Lin Ma Liu Yiding Luke Yue Luyao Zhong Marc Hartmayer Marc-André Lureau Marek Marczykowski-Górecki Markus Schade Martin Kletzander Masayoshi Mizuma Matej Cepl Matt Coleman Matt Coleman Mauro Matteo Cascella Meina Li Michal Privoznik Michał Smyk Milo Casagrande Moshe Levi Muha Aliss Nathan Neal Gompa Nick Chevsky Nick Shyrokovskiy Nickys Music Group Nico Pache Nikolay Shirokovskiy Olaf Hering Olesya Gerasimenko Or Ozeri Orion Poplawski Pany Patrick Magauran Paulo de Rezende Pinatti Pavel Hrdina Peng Liang Peter Krempa Pino Toscano Pino Toscano Piotr Drąg Prathamesh Chavan Praveen K Paladugu Richard W.M. Jones Ricky Tigg Robin Lee Rohit Kumar Roman Bogorodskiy Roman Bolshakov Ryan Gahagan Ryan Schmidt Sam Hartman Scott Shambarger Sebastian Mitterle SeongHyun Jo Shalini Chellathurai Saroja Shaojun Yang Shi Lei simmon Simon Chopin Simon Gaiser Simon Rowe Stefan Bader Stefan Berger Stefan Berger
[PATCH v2] xen-mapcache: Avoid entry->lock overflow
In some cases, a particular mapcache entry may be mapped 256 times causing the lock field to wrap to 0. For example, this may happen when using emulated NVME and the guest submits a large scatter-gather write. At this point, the entry map be remapped causing QEMU to write the wrong data or crash (since remap is not atomic). Avoid this overflow by increasing the lock field to a uint32_t and also detect it and abort rather than continuing regardless. Signed-off-by: Ross Lagerwall --- Changes in v2: Change type to uint32_t since there is a hole there anyway. The struct size remains at 48 bytes on x86_64. hw/i386/xen/xen-mapcache.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/hw/i386/xen/xen-mapcache.c b/hw/i386/xen/xen-mapcache.c index bd47c3d672..f2ef977963 100644 --- a/hw/i386/xen/xen-mapcache.c +++ b/hw/i386/xen/xen-mapcache.c @@ -52,7 +52,7 @@ typedef struct MapCacheEntry { hwaddr paddr_index; uint8_t *vaddr_base; unsigned long *valid_mapping; -uint8_t lock; +uint32_t lock; #define XEN_MAPCACHE_ENTRY_DUMMY (1 << 0) uint8_t flags; hwaddr size; @@ -355,6 +355,12 @@ tryagain: if (lock) { MapCacheRev *reventry = g_malloc0(sizeof(MapCacheRev)); entry->lock++; +if (entry->lock == 0) { +fprintf(stderr, +"mapcache entry lock overflow: "TARGET_FMT_plx" -> %p\n", +entry->paddr_index, entry->vaddr_base); +abort(); +} reventry->dma = dma; reventry->vaddr_req = mapcache->last_entry->vaddr_base + address_offset; reventry->paddr_index = mapcache->last_entry->paddr_index; -- 2.27.0
[linux-linus test] 167801: tolerable FAIL - PUSHED
flight 167801 linux-linus real [real] http://logs.test-lab.xenproject.org/osstest/logs/167801/ Failures :-/ but no regressions. Tests which did not succeed, but are not blocking: test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 167684 test-armhf-armhf-libvirt 16 saverestore-support-checkfail like 167684 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 167684 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 167684 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 167684 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 167684 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check fail like 167684 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail like 167684 test-amd64-amd64-libvirt 15 migrate-support-checkfail never pass test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-seattle 15 migrate-support-checkfail never pass test-arm64-arm64-xl-seattle 16 saverestore-support-checkfail never pass test-arm64-arm64-xl 15 migrate-support-checkfail never pass test-arm64-arm64-xl 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit2 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit2 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 16 saverestore-support-checkfail never pass test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail never pass test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail never pass test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail never pass test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail never pass test-arm64-arm64-xl-vhd 14 migrate-support-checkfail never pass test-arm64-arm64-xl-vhd 15 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit1 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail never pass test-armhf-armhf-libvirt 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail never pass test-armhf-armhf-xl-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-vhd 15 saverestore-support-checkfail never pass version targeted for testing: linuxdd81e1c7d5fb126e5fbc5c9e334d7b3ec29a16a0 baseline version: linux455e73a07f6e288b0061dfcf4fcf54fa9fe06458 Last test of basis 167684 2022-01-13 07:47:42 Z 11 days Failing since167693 2022-01-13 22:41:04 Z 10 days 20 attempts Testing same since 167801 2022-01-23 22:40:37 Z0 days1 attempts 936 people touched revisions under test, not listing them all jobs: build-amd64-xsm pass build-arm64-xsm pass build-i386-xsm pass build-amd64 pass
[PATCH 17/19] block: pass a block_device and opf to bio_alloc
Pass the block_device and operation that we plan to use this bio for to bio_alloc to optimize the assignment. NULL/0 can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code. Also move the gfp_mask argument after the nr_vecs argument for a much more logical calling convention matching what most of the kernel does. Signed-off-by: Christoph Hellwig Reviewed-by: Chaitanya Kulkarni --- block/bio.c | 5 + block/fops.c| 4 +--- drivers/block/drbd/drbd_receiver.c | 10 -- drivers/block/rnbd/rnbd-srv.c | 5 ++--- drivers/block/xen-blkback/blkback.c | 11 +-- drivers/block/zram/zram_drv.c | 11 --- drivers/md/dm-log-writes.c | 21 - drivers/md/dm-thin.c| 9 - drivers/md/dm-zoned-metadata.c | 15 ++- drivers/nvdimm/nd_virtio.c | 6 +++--- drivers/nvme/target/io-cmd-bdev.c | 12 ++-- drivers/nvme/target/passthru.c | 5 +++-- drivers/nvme/target/zns.c | 6 +++--- drivers/scsi/ufs/ufshpb.c | 4 ++-- drivers/target/target_core_iblock.c | 5 ++--- fs/btrfs/disk-io.c | 6 +++--- fs/buffer.c | 14 ++ fs/crypto/bio.c | 13 +++-- fs/direct-io.c | 5 + fs/erofs/zdata.c| 5 ++--- fs/ext4/page-io.c | 3 +-- fs/ext4/readpage.c | 8 fs/gfs2/lops.c | 8 +++- fs/gfs2/meta_io.c | 4 +--- fs/gfs2/ops_fstype.c| 4 +--- fs/hfsplus/wrapper.c| 4 +--- fs/iomap/buffered-io.c | 16 fs/iomap/direct-io.c| 8 ++-- fs/jfs/jfs_logmgr.c | 11 ++- fs/jfs/jfs_metapage.c | 9 +++-- fs/mpage.c | 7 +++ fs/nfs/blocklayout/blocklayout.c| 4 +--- fs/nilfs2/segbuf.c | 4 ++-- fs/ntfs3/fsntfs.c | 8 ++-- fs/ocfs2/cluster/heartbeat.c| 4 +--- fs/squashfs/block.c | 11 ++- fs/xfs/xfs_bio_io.c | 10 -- fs/xfs/xfs_buf.c| 4 +--- fs/zonefs/super.c | 5 ++--- include/linux/bio.h | 5 +++-- kernel/power/swap.c | 5 ++--- mm/page_io.c| 10 -- 42 files changed, 130 insertions(+), 194 deletions(-) diff --git a/block/bio.c b/block/bio.c index 6c3efb0fd12b1..b73c9babd5835 100644 --- a/block/bio.c +++ b/block/bio.c @@ -347,10 +347,7 @@ EXPORT_SYMBOL(bio_chain); struct bio *blk_next_bio(struct bio *bio, struct block_device *bdev, unsigned int nr_pages, unsigned int opf, gfp_t gfp) { - struct bio *new = bio_alloc(gfp, nr_pages); - - bio_set_dev(new, bdev); - new->bi_opf = opf; + struct bio *new = bio_alloc(bdev, nr_pages, opf, gfp); if (bio) { bio_chain(bio, new); diff --git a/block/fops.c b/block/fops.c index 3a62b8b912750..c683596847731 100644 --- a/block/fops.c +++ b/block/fops.c @@ -256,9 +256,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, } atomic_inc(>ref); submit_bio(bio); - bio = bio_alloc(GFP_KERNEL, nr_pages); - bio_set_dev(bio, bdev); - bio->bi_opf = opf; + bio = bio_alloc(bdev, nr_pages, opf, GFP_KERNEL); } blk_finish_plug(); diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c index fb59b263deeef..04e3ec12d8b49 100644 --- a/drivers/block/drbd/drbd_receiver.c +++ b/drivers/block/drbd/drbd_receiver.c @@ -1279,7 +1279,8 @@ static void one_flush_endio(struct bio *bio) static void submit_one_flush(struct drbd_device *device, struct issue_flush_context *ctx) { - struct bio *bio = bio_alloc(GFP_NOIO, 0); + struct bio *bio = bio_alloc(device->ldev->backing_bdev, 0, + REQ_OP_FLUSH | REQ_PREFLUSH, GFP_NOIO); struct one_flush_context *octx = kmalloc(sizeof(*octx), GFP_NOIO); if (!octx) { @@ -1297,10 +1298,8 @@ static void submit_one_flush(struct drbd_device *device, struct issue_flush_cont octx->device = device; octx->ctx = ctx; - bio_set_dev(bio, device->ldev->backing_bdev); bio->bi_private = octx; bio->bi_end_io = one_flush_endio; - bio->bi_opf = REQ_OP_FLUSH | REQ_PREFLUSH; device->flush_jif = jiffies; set_bit(FLUSH_PENDING, >flags); @@ -1685,11 +1684,10 @@ int drbd_submit_peer_request(struct drbd_device *device, * generated bio, but a bio allocated on behalf of the peer. */ next_bio: - bio =
[PATCH 18/19] block: pass a block_device and opf to bio_init
Pass the block_device that we plan to use this bio for and the operation to bio_init to optimize the assignment. A NULL block_device can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code. Signed-off-by: Christoph Hellwig Reviewed-by: Chaitanya Kulkarni --- block/bio.c | 27 +-- block/blk-flush.c | 4 +--- block/blk-zoned.c | 5 + block/fops.c | 18 +- drivers/block/floppy.c| 4 +--- drivers/block/zram/zram_drv.c | 5 ++--- drivers/md/bcache/io.c| 3 ++- drivers/md/bcache/journal.c | 4 +--- drivers/md/bcache/movinggc.c | 4 ++-- drivers/md/bcache/request.c | 2 +- drivers/md/bcache/super.c | 8 +++- drivers/md/bcache/writeback.c | 4 ++-- drivers/md/dm.c | 5 ++--- drivers/md/md-multipath.c | 2 +- drivers/md/md.c | 8 +++- drivers/md/raid5-cache.c | 2 +- drivers/md/raid5-ppl.c| 2 +- drivers/md/raid5.c| 4 ++-- drivers/nvme/target/io-cmd-bdev.c | 10 -- drivers/nvme/target/passthru.c| 4 ++-- drivers/nvme/target/zns.c | 4 ++-- fs/iomap/buffered-io.c| 4 +--- fs/xfs/xfs_bio_io.c | 4 +--- fs/xfs/xfs_log.c | 14 +++--- fs/zonefs/super.c | 4 +--- include/linux/bio.h | 4 ++-- 26 files changed, 68 insertions(+), 91 deletions(-) diff --git a/block/bio.c b/block/bio.c index b73c9babd5835..b2133d86e885e 100644 --- a/block/bio.c +++ b/block/bio.c @@ -249,12 +249,12 @@ static void bio_free(struct bio *bio) * they must remember to pair any call to bio_init() with bio_uninit() * when IO has completed, or when the bio is released. */ -void bio_init(struct bio *bio, struct bio_vec *table, - unsigned short max_vecs) +void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table, + unsigned short max_vecs, unsigned int opf) { bio->bi_next = NULL; - bio->bi_bdev = NULL; - bio->bi_opf = 0; + bio->bi_bdev = bdev; + bio->bi_opf = opf; bio->bi_flags = 0; bio->bi_ioprio = 0; bio->bi_write_hint = 0; @@ -268,6 +268,8 @@ void bio_init(struct bio *bio, struct bio_vec *table, #ifdef CONFIG_BLK_CGROUP bio->bi_blkg = NULL; bio->bi_issue.value = 0; + if (bdev) + bio_associate_blkg(bio); #ifdef CONFIG_BLK_CGROUP_IOCOST bio->bi_iocost_cost = 0; #endif @@ -504,17 +506,14 @@ struct bio *bio_alloc_bioset(struct block_device *bdev, unsigned short nr_vecs, if (unlikely(!bvl)) goto err_free; - bio_init(bio, bvl, nr_vecs); + bio_init(bio, bdev, bvl, nr_vecs, opf); } else if (nr_vecs) { - bio_init(bio, bio->bi_inline_vecs, BIO_INLINE_VECS); + bio_init(bio, bdev, bio->bi_inline_vecs, BIO_INLINE_VECS, opf); } else { - bio_init(bio, NULL, 0); + bio_init(bio, bdev, NULL, 0, opf); } bio->bi_pool = bs; - if (bdev) - bio_set_dev(bio, bdev); - bio->bi_opf = opf; return bio; err_free: @@ -542,7 +541,8 @@ struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned short nr_iovecs) bio = kmalloc(struct_size(bio, bi_inline_vecs, nr_iovecs), gfp_mask); if (unlikely(!bio)) return NULL; - bio_init(bio, nr_iovecs ? bio->bi_inline_vecs : NULL, nr_iovecs); + bio_init(bio, NULL, nr_iovecs ? bio->bi_inline_vecs : NULL, nr_iovecs, +0); bio->bi_pool = NULL; return bio; } @@ -1756,9 +1756,8 @@ struct bio *bio_alloc_kiocb(struct kiocb *kiocb, struct block_device *bdev, cache->free_list = bio->bi_next; cache->nr--; put_cpu(); - bio_init(bio, nr_vecs ? bio->bi_inline_vecs : NULL, nr_vecs); - bio_set_dev(bio, bdev); - bio->bi_opf = opf; + bio_init(bio, bdev, nr_vecs ? bio->bi_inline_vecs : NULL, +nr_vecs, opf); bio->bi_pool = bs; bio_set_flag(bio, BIO_PERCPU_CACHE); return bio; diff --git a/block/blk-flush.c b/block/blk-flush.c index e4df894189ced..c689687248706 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -460,9 +460,7 @@ int blkdev_issue_flush(struct block_device *bdev) { struct bio bio; - bio_init(, NULL, 0); - bio_set_dev(, bdev); - bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH; + bio_init(, bdev, NULL, 0, REQ_OP_WRITE | REQ_PREFLUSH); return submit_bio_wait(); } EXPORT_SYMBOL(blkdev_issue_flush); diff --git a/block/blk-zoned.c b/block/blk-zoned.c index
[PATCH 15/19] block: pass a block_device and opf to bio_alloc_bioset
Pass the block_device and operation that we plan to use this bio for to bio_alloc_bioset to optimize the assigment. NULL/0 can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code. Also move the gfp_mask argument after the nr_vecs argument for a much more logical calling convention matching what most of the kernel does. Signed-off-by: Christoph Hellwig Reviewed-by: Chaitanya Kulkarni --- block/bio.c | 30 + block/bounce.c | 6 ++ drivers/block/drbd/drbd_actlog.c| 5 ++--- drivers/block/drbd/drbd_bitmap.c| 7 +++ drivers/md/bcache/request.c | 12 +--- drivers/md/dm-crypt.c | 5 ++--- drivers/md/dm-io.c | 5 ++--- drivers/md/dm-writecache.c | 7 --- drivers/md/dm.c | 5 +++-- drivers/md/md.c | 16 +++ drivers/md/raid1.c | 3 ++- drivers/md/raid10.c | 6 ++ drivers/md/raid5-cache.c| 8 drivers/md/raid5-ppl.c | 11 +-- drivers/target/target_core_iblock.c | 6 ++ fs/btrfs/extent_io.c| 2 +- fs/f2fs/data.c | 7 +++ fs/iomap/buffered-io.c | 6 +++--- include/linux/bio.h | 7 --- 19 files changed, 75 insertions(+), 79 deletions(-) diff --git a/block/bio.c b/block/bio.c index a0166f29a05c3..9afc0c2aca6e4 100644 --- a/block/bio.c +++ b/block/bio.c @@ -417,8 +417,10 @@ static void punt_bios_to_rescuer(struct bio_set *bs) /** * bio_alloc_bioset - allocate a bio for I/O + * @bdev: block device to allocate the bio for (can be %NULL) + * @nr_vecs: number of bvecs to pre-allocate + * @opf: operation and flags for bio * @gfp_mask: the GFP_* mask given to the slab allocator - * @nr_iovecs: number of iovecs to pre-allocate * @bs:the bio_set to allocate from. * * Allocate a bio from the mempools in @bs. @@ -447,15 +449,16 @@ static void punt_bios_to_rescuer(struct bio_set *bs) * * Returns: Pointer to new bio on success, NULL on failure. */ -struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned short nr_iovecs, +struct bio *bio_alloc_bioset(struct block_device *bdev, unsigned short nr_vecs, +unsigned int opf, gfp_t gfp_mask, struct bio_set *bs) { gfp_t saved_gfp = gfp_mask; struct bio *bio; void *p; - /* should not use nobvec bioset for nr_iovecs > 0 */ - if (WARN_ON_ONCE(!mempool_initialized(>bvec_pool) && nr_iovecs > 0)) + /* should not use nobvec bioset for nr_vecs > 0 */ + if (WARN_ON_ONCE(!mempool_initialized(>bvec_pool) && nr_vecs > 0)) return NULL; /* @@ -492,26 +495,29 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned short nr_iovecs, return NULL; bio = p + bs->front_pad; - if (nr_iovecs > BIO_INLINE_VECS) { + if (nr_vecs > BIO_INLINE_VECS) { struct bio_vec *bvl = NULL; - bvl = bvec_alloc(>bvec_pool, _iovecs, gfp_mask); + bvl = bvec_alloc(>bvec_pool, _vecs, gfp_mask); if (!bvl && gfp_mask != saved_gfp) { punt_bios_to_rescuer(bs); gfp_mask = saved_gfp; - bvl = bvec_alloc(>bvec_pool, _iovecs, gfp_mask); + bvl = bvec_alloc(>bvec_pool, _vecs, gfp_mask); } if (unlikely(!bvl)) goto err_free; - bio_init(bio, bvl, nr_iovecs); - } else if (nr_iovecs) { + bio_init(bio, bvl, nr_vecs); + } else if (nr_vecs) { bio_init(bio, bio->bi_inline_vecs, BIO_INLINE_VECS); } else { bio_init(bio, NULL, 0); } bio->bi_pool = bs; + if (bdev) + bio_set_dev(bio, bdev); + bio->bi_opf = opf; return bio; err_free: @@ -767,7 +773,7 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs) { struct bio *b; - b = bio_alloc_bioset(gfp_mask, 0, bs); + b = bio_alloc_bioset(NULL, 0, 0, gfp_mask, bs); if (!b) return NULL; @@ -1743,7 +1749,7 @@ struct bio *bio_alloc_kiocb(struct kiocb *kiocb, unsigned short nr_vecs, struct bio *bio; if (!(kiocb->ki_flags & IOCB_ALLOC_CACHE) || nr_vecs > BIO_INLINE_VECS) - return bio_alloc_bioset(GFP_KERNEL, nr_vecs, bs); + return bio_alloc_bioset(NULL, nr_vecs, 0, GFP_KERNEL, bs); cache = per_cpu_ptr(bs->cache, get_cpu()); if (cache->free_list) { @@ -1757,7 +1763,7 @@ struct bio *bio_alloc_kiocb(struct kiocb *kiocb, unsigned short nr_vecs, return bio; }
[PATCH 16/19] block: pass a block_device and opf to bio_alloc_kiocb
Pass the block_device and operation that we plan to use this bio for to bio_alloc_kiocb to optimize the assigment. Signed-off-by: Christoph Hellwig Reviewed-by: Chaitanya Kulkarni --- block/bio.c | 12 block/fops.c| 17 - include/linux/bio.h | 4 ++-- 3 files changed, 18 insertions(+), 15 deletions(-) diff --git a/block/bio.c b/block/bio.c index 9afc0c2aca6e4..6c3efb0fd12b1 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1731,7 +1731,9 @@ EXPORT_SYMBOL(bioset_init_from_src); /** * bio_alloc_kiocb - Allocate a bio from bio_set based on kiocb * @kiocb: kiocb describing the IO + * @bdev: block device to allocate the bio for (can be %NULL) * @nr_vecs: number of iovecs to pre-allocate + * @opf: operation and flags for bio * @bs:bio_set to allocate from * * Description: @@ -1742,14 +1744,14 @@ EXPORT_SYMBOL(bioset_init_from_src); *MUST be done from process context, not hard/soft IRQ. * */ -struct bio *bio_alloc_kiocb(struct kiocb *kiocb, unsigned short nr_vecs, - struct bio_set *bs) +struct bio *bio_alloc_kiocb(struct kiocb *kiocb, struct block_device *bdev, + unsigned short nr_vecs, unsigned int opf, struct bio_set *bs) { struct bio_alloc_cache *cache; struct bio *bio; if (!(kiocb->ki_flags & IOCB_ALLOC_CACHE) || nr_vecs > BIO_INLINE_VECS) - return bio_alloc_bioset(NULL, nr_vecs, 0, GFP_KERNEL, bs); + return bio_alloc_bioset(bdev, nr_vecs, opf, GFP_KERNEL, bs); cache = per_cpu_ptr(bs->cache, get_cpu()); if (cache->free_list) { @@ -1758,12 +1760,14 @@ struct bio *bio_alloc_kiocb(struct kiocb *kiocb, unsigned short nr_vecs, cache->nr--; put_cpu(); bio_init(bio, nr_vecs ? bio->bi_inline_vecs : NULL, nr_vecs); + bio_set_dev(bio, bdev); + bio->bi_opf = opf; bio->bi_pool = bs; bio_set_flag(bio, BIO_PERCPU_CACHE); return bio; } put_cpu(); - bio = bio_alloc_bioset(NULL, nr_vecs, 0, GFP_KERNEL, bs); + bio = bio_alloc_bioset(bdev, nr_vecs, opf, GFP_KERNEL, bs); bio_set_flag(bio, BIO_PERCPU_CACHE); return bio; } diff --git a/block/fops.c b/block/fops.c index 26bf15c770d21..3a62b8b912750 100644 --- a/block/fops.c +++ b/block/fops.c @@ -190,6 +190,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, struct blkdev_dio *dio; struct bio *bio; bool is_read = (iov_iter_rw(iter) == READ), is_sync; + unsigned int opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb); loff_t pos = iocb->ki_pos; int ret = 0; @@ -197,7 +198,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, (bdev_logical_block_size(bdev) - 1)) return -EINVAL; - bio = bio_alloc_kiocb(iocb, nr_pages, _dio_pool); + bio = bio_alloc_kiocb(iocb, bdev, nr_pages, opf, _dio_pool); dio = container_of(bio, struct blkdev_dio, bio); atomic_set(>ref, 1); @@ -223,7 +224,6 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, blk_start_plug(); for (;;) { - bio_set_dev(bio, bdev); bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT; bio->bi_write_hint = iocb->ki_hint; bio->bi_private = dio; @@ -238,11 +238,9 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, } if (is_read) { - bio->bi_opf = REQ_OP_READ; if (dio->flags & DIO_SHOULD_DIRTY) bio_set_pages_dirty(bio); } else { - bio->bi_opf = dio_bio_write_op(iocb); task_io_account_write(bio->bi_iter.bi_size); } if (iocb->ki_flags & IOCB_NOWAIT) @@ -259,6 +257,8 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, atomic_inc(>ref); submit_bio(bio); bio = bio_alloc(GFP_KERNEL, nr_pages); + bio_set_dev(bio, bdev); + bio->bi_opf = opf; } blk_finish_plug(); @@ -311,6 +311,8 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb, unsigned int nr_pages) { struct block_device *bdev = iocb->ki_filp->private_data; + bool is_read = iov_iter_rw(iter) == READ; + unsigned int opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb); struct blkdev_dio *dio; struct bio *bio; loff_t pos = iocb->ki_pos; @@ -320,11 +322,10 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb, (bdev_logical_block_size(bdev) - 1)) return -EINVAL; -
[PATCH 19/19] block: pass a block_device and opf to bio_reset
Pass the block_device that we plan to use this bio for and the operation to bio_reset to optimize the assigment. A NULL block_device can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code. Signed-off-by: Christoph Hellwig Reviewed-by: Chaitanya Kulkarni --- block/bio.c | 6 +- drivers/block/pktcdvd.c | 8 ++-- drivers/md/bcache/journal.c | 12 drivers/md/bcache/request.c | 4 ++-- drivers/md/raid1.c | 5 ++--- drivers/md/raid10.c | 8 +++- drivers/md/raid5-cache.c| 9 +++-- drivers/md/raid5.c | 8 fs/btrfs/disk-io.c | 4 +--- fs/crypto/bio.c | 8 ++-- include/linux/bio.h | 9 + 11 files changed, 29 insertions(+), 52 deletions(-) diff --git a/block/bio.c b/block/bio.c index b2133d86e885e..03cefe81950f2 100644 --- a/block/bio.c +++ b/block/bio.c @@ -295,6 +295,8 @@ EXPORT_SYMBOL(bio_init); /** * bio_reset - reinitialize a bio * @bio: bio to reset + * @bdev: block device to use the bio for + * @opf: operation and flags for bio * * Description: * After calling bio_reset(), @bio will be in the same state as a freshly @@ -302,11 +304,13 @@ EXPORT_SYMBOL(bio_init); * preserved are the ones that are initialized by bio_alloc_bioset(). See * comment in struct bio. */ -void bio_reset(struct bio *bio) +void bio_reset(struct bio *bio, struct block_device *bdev, unsigned int opf) { bio_uninit(bio); memset(bio, 0, BIO_RESET_BYTES); atomic_set(>__bi_remaining, 1); + bio->bi_bdev = bdev; + bio->bi_opf = opf; } EXPORT_SYMBOL(bio_reset); diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c index 2b6b70a39e760..3aa5954429462 100644 --- a/drivers/block/pktcdvd.c +++ b/drivers/block/pktcdvd.c @@ -1020,9 +1020,8 @@ static void pkt_gather_data(struct pktcdvd_device *pd, struct packet_data *pkt) continue; bio = pkt->r_bios[f]; - bio_reset(bio); + bio_reset(bio, pd->bdev, REQ_OP_READ); bio->bi_iter.bi_sector = pkt->sector + f * (CD_FRAMESIZE >> 9); - bio_set_dev(bio, pd->bdev); bio->bi_end_io = pkt_end_io_read; bio->bi_private = pkt; @@ -1034,7 +1033,6 @@ static void pkt_gather_data(struct pktcdvd_device *pd, struct packet_data *pkt) BUG(); atomic_inc(>io_wait); - bio_set_op_attrs(bio, REQ_OP_READ, 0); pkt_queue_bio(pd, bio); frames_read++; } @@ -1235,9 +1233,8 @@ static void pkt_start_write(struct pktcdvd_device *pd, struct packet_data *pkt) { int f; - bio_reset(pkt->w_bio); + bio_reset(pkt->w_bio, pd->bdev, REQ_OP_WRITE); pkt->w_bio->bi_iter.bi_sector = pkt->sector; - bio_set_dev(pkt->w_bio, pd->bdev); pkt->w_bio->bi_end_io = pkt_end_io_packet_write; pkt->w_bio->bi_private = pkt; @@ -1270,7 +1267,6 @@ static void pkt_start_write(struct pktcdvd_device *pd, struct packet_data *pkt) /* Start the write request */ atomic_set(>io_wait, 1); - bio_set_op_attrs(pkt->w_bio, REQ_OP_WRITE, 0); pkt_queue_bio(pd, pkt->w_bio); } diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index 6d26c5b06e2b6..7c2ca52ca3e43 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -53,14 +53,12 @@ static int journal_read_bucket(struct cache *ca, struct list_head *list, reread:left = ca->sb.bucket_size - offset; len = min_t(unsigned int, left, PAGE_SECTORS << JSET_BITS); - bio_reset(bio); + bio_reset(bio, ca->bdev, REQ_OP_READ); bio->bi_iter.bi_sector = bucket + offset; - bio_set_dev(bio, ca->bdev); bio->bi_iter.bi_size= len << 9; bio->bi_end_io = journal_read_endio; bio->bi_private = - bio_set_op_attrs(bio, REQ_OP_READ, 0); bch_bio_map(bio, data); closure_bio_submit(ca->set, bio, ); @@ -771,16 +769,14 @@ static void journal_write_unlocked(struct closure *cl) atomic_long_add(sectors, >meta_sectors_written); - bio_reset(bio); + bio_reset(bio, ca->bdev, REQ_OP_WRITE | + REQ_SYNC | REQ_META | REQ_PREFLUSH | REQ_FUA); + bch_bio_map(bio, w->data); bio->bi_iter.bi_sector = PTR_OFFSET(k, i); - bio_set_dev(bio, ca->bdev); bio->bi_iter.bi_size = sectors << 9; bio->bi_end_io = journal_write_endio; bio->bi_private = w; - bio_set_op_attrs(bio, REQ_OP_WRITE, -REQ_SYNC|REQ_META|REQ_PREFLUSH|REQ_FUA); -
[PATCH 14/19] block: pass a block_device and opf to blk_next_bio
From: Chaitanya Kulkarni All callers need to set the block_device and operation, so lift that into the common code. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- block/bio.c | 6 +- block/blk-lib.c | 19 +-- block/blk-zoned.c | 9 +++-- block/blk.h | 2 -- drivers/nvme/target/zns.c | 6 +++--- include/linux/bio.h | 3 ++- 6 files changed, 18 insertions(+), 27 deletions(-) diff --git a/block/bio.c b/block/bio.c index 1536579ed490a..a0166f29a05c3 100644 --- a/block/bio.c +++ b/block/bio.c @@ -344,10 +344,14 @@ void bio_chain(struct bio *bio, struct bio *parent) } EXPORT_SYMBOL(bio_chain); -struct bio *blk_next_bio(struct bio *bio, unsigned int nr_pages, gfp_t gfp) +struct bio *blk_next_bio(struct bio *bio, struct block_device *bdev, + unsigned int nr_pages, unsigned int opf, gfp_t gfp) { struct bio *new = bio_alloc(gfp, nr_pages); + bio_set_dev(new, bdev); + new->bi_opf = opf; + if (bio) { bio_chain(bio, new); submit_bio(bio); diff --git a/block/blk-lib.c b/block/blk-lib.c index 9245b300ef73e..1b8ced45e4e55 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -82,11 +82,8 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, WARN_ON_ONCE((req_sects << 9) > UINT_MAX); - bio = blk_next_bio(bio, 0, gfp_mask); + bio = blk_next_bio(bio, bdev, 0, op, gfp_mask); bio->bi_iter.bi_sector = sector; - bio_set_dev(bio, bdev); - bio_set_op_attrs(bio, op, 0); - bio->bi_iter.bi_size = req_sects << 9; sector += req_sects; nr_sects -= req_sects; @@ -176,14 +173,12 @@ static int __blkdev_issue_write_same(struct block_device *bdev, sector_t sector, max_write_same_sectors = bio_allowed_max_sectors(q); while (nr_sects) { - bio = blk_next_bio(bio, 1, gfp_mask); + bio = blk_next_bio(bio, bdev, 1, REQ_OP_WRITE_SAME, gfp_mask); bio->bi_iter.bi_sector = sector; - bio_set_dev(bio, bdev); bio->bi_vcnt = 1; bio->bi_io_vec->bv_page = page; bio->bi_io_vec->bv_offset = 0; bio->bi_io_vec->bv_len = bdev_logical_block_size(bdev); - bio_set_op_attrs(bio, REQ_OP_WRITE_SAME, 0); if (nr_sects > max_write_same_sectors) { bio->bi_iter.bi_size = max_write_same_sectors << 9; @@ -252,10 +247,8 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, return -EOPNOTSUPP; while (nr_sects) { - bio = blk_next_bio(bio, 0, gfp_mask); + bio = blk_next_bio(bio, bdev, 0, REQ_OP_WRITE_ZEROES, gfp_mask); bio->bi_iter.bi_sector = sector; - bio_set_dev(bio, bdev); - bio->bi_opf = REQ_OP_WRITE_ZEROES; if (flags & BLKDEV_ZERO_NOUNMAP) bio->bi_opf |= REQ_NOUNMAP; @@ -303,11 +296,9 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev, return -EPERM; while (nr_sects != 0) { - bio = blk_next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects), - gfp_mask); + bio = blk_next_bio(bio, bdev, __blkdev_sectors_to_bio_pages(nr_sects), + REQ_OP_WRITE, gfp_mask); bio->bi_iter.bi_sector = sector; - bio_set_dev(bio, bdev); - bio_set_op_attrs(bio, REQ_OP_WRITE, 0); while (nr_sects != 0) { sz = min((sector_t) PAGE_SIZE, nr_sects << 9); diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 774ecc598bee2..5ab755d792c81 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -215,9 +215,8 @@ static int blkdev_zone_reset_all_emulated(struct block_device *bdev, continue; } - bio = blk_next_bio(bio, 0, gfp_mask); - bio_set_dev(bio, bdev); - bio->bi_opf = REQ_OP_ZONE_RESET | REQ_SYNC; + bio = blk_next_bio(bio, bdev, 0, REQ_OP_ZONE_RESET | REQ_SYNC, + gfp_mask); bio->bi_iter.bi_sector = sector; sector += zone_sectors; @@ -306,9 +305,7 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op, } while (sector < end_sector) { - bio = blk_next_bio(bio, 0, gfp_mask); - bio_set_dev(bio, bdev); - bio->bi_opf = op | REQ_SYNC; + bio = blk_next_bio(bio, bdev, 0, op | REQ_SYNC, gfp_mask); bio->bi_iter.bi_sector = sector; sector += zone_sectors; diff --git a/block/blk.h b/block/blk.h index
[PATCH 13/19] block: move blk_next_bio to bio.c
Keep blk_next_bio next to the core bio infrastructure. Signed-off-by: Christoph Hellwig Reviewed-by: Chaitanya Kulkarni --- block/bio.c | 13 + block/blk-lib.c | 13 - 2 files changed, 13 insertions(+), 13 deletions(-) diff --git a/block/bio.c b/block/bio.c index 4312a8085396b..1536579ed490a 100644 --- a/block/bio.c +++ b/block/bio.c @@ -344,6 +344,19 @@ void bio_chain(struct bio *bio, struct bio *parent) } EXPORT_SYMBOL(bio_chain); +struct bio *blk_next_bio(struct bio *bio, unsigned int nr_pages, gfp_t gfp) +{ + struct bio *new = bio_alloc(gfp, nr_pages); + + if (bio) { + bio_chain(bio, new); + submit_bio(bio); + } + + return new; +} +EXPORT_SYMBOL_GPL(blk_next_bio); + static void bio_alloc_rescue(struct work_struct *work) { struct bio_set *bs = container_of(work, struct bio_set, rescue_work); diff --git a/block/blk-lib.c b/block/blk-lib.c index 9f09beadcbe30..9245b300ef73e 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -10,19 +10,6 @@ #include "blk.h" -struct bio *blk_next_bio(struct bio *bio, unsigned int nr_pages, gfp_t gfp) -{ - struct bio *new = bio_alloc(gfp, nr_pages); - - if (bio) { - bio_chain(bio, new); - submit_bio(bio); - } - - return new; -} -EXPORT_SYMBOL_GPL(blk_next_bio); - int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, int flags, struct bio **biop) -- 2.30.2
[PATCH 12/19] xen-blkback: bio_alloc can't fail if it is allow to sleep
Remove handling of NULL returns from sleeping bio_alloc calls given that those can't fail. Signed-off-by: Christoph Hellwig --- drivers/block/xen-blkback/blkback.c | 14 -- 1 file changed, 14 deletions(-) diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c index 14e452896d04c..6bb2ad7692065 100644 --- a/drivers/block/xen-blkback/blkback.c +++ b/drivers/block/xen-blkback/blkback.c @@ -1327,9 +1327,6 @@ static int dispatch_rw_block_io(struct xen_blkif_ring *ring, seg[i].nsec << 9, seg[i].offset) == 0)) { bio = bio_alloc(GFP_KERNEL, bio_max_segs(nseg - i)); - if (unlikely(bio == NULL)) - goto fail_put_bio; - biolist[nbio++] = bio; bio_set_dev(bio, preq.bdev); bio->bi_private = pending_req; @@ -1346,9 +1343,6 @@ static int dispatch_rw_block_io(struct xen_blkif_ring *ring, BUG_ON(operation_flags != REQ_PREFLUSH); bio = bio_alloc(GFP_KERNEL, 0); - if (unlikely(bio == NULL)) - goto fail_put_bio; - biolist[nbio++] = bio; bio_set_dev(bio, preq.bdev); bio->bi_private = pending_req; @@ -1381,14 +1375,6 @@ static int dispatch_rw_block_io(struct xen_blkif_ring *ring, free_req(ring, pending_req); msleep(1); /* back off a bit */ return -EIO; - - fail_put_bio: - for (i = 0; i < nbio; i++) - bio_put(biolist[i]); - atomic_set(_req->pendcnt, 1); - __end_block_io_op(pending_req, BLK_STS_RESOURCE); - msleep(1); /* back off a bit */ - return -EIO; } -- 2.30.2
[PATCH 06/19] dm-crypt: remove clone_init
Just open code it next to the bio allocations, which saves a few lines of code, prepares for future changes and allows to remove the duplicate bi_opf assignment for the bio_clone_fast case in kcryptd_io_read. Signed-off-by: Christoph Hellwig --- drivers/md/dm-crypt.c | 21 - 1 file changed, 8 insertions(+), 13 deletions(-) diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index 20abe3486aba1..3c5ecd35d3483 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -234,7 +234,7 @@ static volatile unsigned long dm_crypt_pages_per_client; #define DM_CRYPT_MEMORY_PERCENT2 #define DM_CRYPT_MIN_PAGES_PER_CLIENT (BIO_MAX_VECS * 16) -static void clone_init(struct dm_crypt_io *, struct bio *); +static void crypt_endio(struct bio *clone); static void kcryptd_queue_crypt(struct dm_crypt_io *io); static struct scatterlist *crypt_get_sg_data(struct crypt_config *cc, struct scatterlist *sg); @@ -1673,7 +1673,10 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size) mutex_lock(>bio_alloc_lock); clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, >bs); - clone_init(io, clone); + clone->bi_private = io; + clone->bi_end_io = crypt_endio; + bio_set_dev(clone, cc->dev->bdev); + clone->bi_opf = io->base_bio->bi_opf; remaining_size = size; @@ -1826,16 +1829,6 @@ static void crypt_endio(struct bio *clone) crypt_dec_pending(io); } -static void clone_init(struct dm_crypt_io *io, struct bio *clone) -{ - struct crypt_config *cc = io->cc; - - clone->bi_private = io; - clone->bi_end_io = crypt_endio; - bio_set_dev(clone, cc->dev->bdev); - clone->bi_opf = io->base_bio->bi_opf; -} - static int kcryptd_io_read(struct dm_crypt_io *io, gfp_t gfp) { struct crypt_config *cc = io->cc; @@ -1850,10 +1843,12 @@ static int kcryptd_io_read(struct dm_crypt_io *io, gfp_t gfp) clone = bio_clone_fast(io->base_bio, gfp, >bs); if (!clone) return 1; + clone->bi_private = io; + clone->bi_end_io = crypt_endio; + bio_set_dev(clone, cc->dev->bdev); crypt_inc_pending(io); - clone_init(io, clone); clone->bi_iter.bi_sector = cc->start + io->sector; if (dm_crypt_integrity_io_alloc(io, clone)) { -- 2.30.2
[PATCH 11/19] rnbd-srv: remove struct rnbd_dev_blk_io
Only the priv field of rnbd_dev_blk_io is used, so store the value of that in bio->bi_private directly and remove the entire bio_set overhead. Signed-off-by: Christoph Hellwig --- drivers/block/rnbd/rnbd-srv-dev.c | 4 +--- drivers/block/rnbd/rnbd-srv-dev.h | 13 ++--- drivers/block/rnbd/rnbd-srv.c | 27 --- drivers/block/rnbd/rnbd-srv.h | 1 - 4 files changed, 7 insertions(+), 38 deletions(-) diff --git a/drivers/block/rnbd/rnbd-srv-dev.c b/drivers/block/rnbd/rnbd-srv-dev.c index 98d3e591a0885..c5d0a03911659 100644 --- a/drivers/block/rnbd/rnbd-srv-dev.c +++ b/drivers/block/rnbd/rnbd-srv-dev.c @@ -12,8 +12,7 @@ #include "rnbd-srv-dev.h" #include "rnbd-log.h" -struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t flags, - struct bio_set *bs) +struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t flags) { struct rnbd_dev *dev; int ret; @@ -30,7 +29,6 @@ struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t flags, dev->blk_open_flags = flags; bdevname(dev->bdev, dev->name); - dev->ibd_bio_set = bs; return dev; diff --git a/drivers/block/rnbd/rnbd-srv-dev.h b/drivers/block/rnbd/rnbd-srv-dev.h index 1a14ece0be726..2c3df02b5e8ec 100644 --- a/drivers/block/rnbd/rnbd-srv-dev.h +++ b/drivers/block/rnbd/rnbd-srv-dev.h @@ -14,25 +14,16 @@ struct rnbd_dev { struct block_device *bdev; - struct bio_set *ibd_bio_set; fmode_t blk_open_flags; charname[BDEVNAME_SIZE]; }; -struct rnbd_dev_blk_io { - struct rnbd_dev *dev; - void *priv; - /* have to be last member for front_pad usage of bioset_init */ - struct bio bio; -}; - /** * rnbd_dev_open() - Open a device + * @path: path to open * @flags: open flags - * @bs:bio_set to use during block io, */ -struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t flags, - struct bio_set *bs); +struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t flags); /** * rnbd_dev_close() - Close a device diff --git a/drivers/block/rnbd/rnbd-srv.c b/drivers/block/rnbd/rnbd-srv.c index 6d228af1dcc35..ff9b389976078 100644 --- a/drivers/block/rnbd/rnbd-srv.c +++ b/drivers/block/rnbd/rnbd-srv.c @@ -116,9 +116,7 @@ rnbd_get_sess_dev(int dev_id, struct rnbd_srv_session *srv_sess) static void rnbd_dev_bi_end_io(struct bio *bio) { - struct rnbd_dev_blk_io *io = bio->bi_private; - - rnbd_endio(io->priv, blk_status_to_errno(bio->bi_status)); + rnbd_endio(bio->bi_private, blk_status_to_errno(bio->bi_status)); bio_put(bio); } @@ -131,7 +129,6 @@ static int process_rdma(struct rnbd_srv_session *srv_sess, struct rnbd_srv_sess_dev *sess_dev; u32 dev_id; int err; - struct rnbd_dev_blk_io *io; struct bio *bio; short prio; @@ -152,7 +149,7 @@ static int process_rdma(struct rnbd_srv_session *srv_sess, priv->sess_dev = sess_dev; priv->id = id; - bio = bio_alloc_bioset(GFP_KERNEL, 1, sess_dev->rnbd_dev->ibd_bio_set); + bio = bio_alloc(GFP_KERNEL, 1); if (bio_add_page(bio, virt_to_page(data), datalen, offset_in_page(data)) != datalen) { rnbd_srv_err(sess_dev, "Failed to map data to bio\n"); @@ -160,12 +157,8 @@ static int process_rdma(struct rnbd_srv_session *srv_sess, goto bio_put; } - io = container_of(bio, struct rnbd_dev_blk_io, bio); - io->dev = sess_dev->rnbd_dev; - io->priv = priv; - bio->bi_end_io = rnbd_dev_bi_end_io; - bio->bi_private = io; + bio->bi_private = priv; bio->bi_opf = rnbd_to_bio_flags(le32_to_cpu(msg->rw)); bio->bi_iter.bi_sector = le64_to_cpu(msg->sector); bio->bi_iter.bi_size = le32_to_cpu(msg->bi_size); @@ -260,7 +253,6 @@ static void destroy_sess(struct rnbd_srv_session *srv_sess) out: xa_destroy(_sess->index_idr); - bioset_exit(_sess->sess_bio_set); pr_info("RTRS Session %s disconnected\n", srv_sess->sessname); @@ -289,16 +281,6 @@ static int create_sess(struct rtrs_srv_sess *rtrs) return -ENOMEM; srv_sess->queue_depth = rtrs_srv_get_queue_depth(rtrs); - err = bioset_init(_sess->sess_bio_set, srv_sess->queue_depth, - offsetof(struct rnbd_dev_blk_io, bio), - BIOSET_NEED_BVECS); - if (err) { - pr_err("Allocating srv_session for path %s failed\n", - pathname); - kfree(srv_sess); - return err; - } - xa_init_flags(_sess->index_idr, XA_FLAGS_ALLOC); INIT_LIST_HEAD(_sess->sess_dev_list); mutex_init(_sess->lock); @@ -747,8 +729,7 @@ static int process_msg_open(struct rnbd_srv_session *srv_sess, goto
[PATCH 10/19] rnbd-srv: simplify bio mapping in process_rdma
The memory mapped in process_rdma is contiguous, so there is no need to loop over bio_add_page. Remove rnbd_bio_map_kern and just open code the bio allocation and mapping in the caller. Signed-off-by: Christoph Hellwig Reviewed-by: Jack Wang Tested-by: Jack Wang --- drivers/block/rnbd/rnbd-srv-dev.c | 57 --- drivers/block/rnbd/rnbd-srv-dev.h | 5 --- drivers/block/rnbd/rnbd-srv.c | 23 + 3 files changed, 16 insertions(+), 69 deletions(-) diff --git a/drivers/block/rnbd/rnbd-srv-dev.c b/drivers/block/rnbd/rnbd-srv-dev.c index b241a099aeae2..98d3e591a0885 100644 --- a/drivers/block/rnbd/rnbd-srv-dev.c +++ b/drivers/block/rnbd/rnbd-srv-dev.c @@ -44,60 +44,3 @@ void rnbd_dev_close(struct rnbd_dev *dev) blkdev_put(dev->bdev, dev->blk_open_flags); kfree(dev); } - -void rnbd_dev_bi_end_io(struct bio *bio) -{ - struct rnbd_dev_blk_io *io = bio->bi_private; - - rnbd_endio(io->priv, blk_status_to_errno(bio->bi_status)); - bio_put(bio); -} - -/** - * rnbd_bio_map_kern - map kernel address into bio - * @data: pointer to buffer to map - * @bs: bio_set to use. - * @len: length in bytes - * @gfp_mask: allocation flags for bio allocation - * - * Map the kernel address into a bio suitable for io to a block - * device. Returns an error pointer in case of error. - */ -struct bio *rnbd_bio_map_kern(void *data, struct bio_set *bs, - unsigned int len, gfp_t gfp_mask) -{ - unsigned long kaddr = (unsigned long)data; - unsigned long end = (kaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT; - unsigned long start = kaddr >> PAGE_SHIFT; - const int nr_pages = end - start; - int offset, i; - struct bio *bio; - - bio = bio_alloc_bioset(gfp_mask, nr_pages, bs); - if (!bio) - return ERR_PTR(-ENOMEM); - - offset = offset_in_page(kaddr); - for (i = 0; i < nr_pages; i++) { - unsigned int bytes = PAGE_SIZE - offset; - - if (len <= 0) - break; - - if (bytes > len) - bytes = len; - - if (bio_add_page(bio, virt_to_page(data), bytes, - offset) < bytes) { - /* we don't support partial mappings */ - bio_put(bio); - return ERR_PTR(-EINVAL); - } - - data += bytes; - len -= bytes; - offset = 0; - } - - return bio; -} diff --git a/drivers/block/rnbd/rnbd-srv-dev.h b/drivers/block/rnbd/rnbd-srv-dev.h index 0eb23850afb95..1a14ece0be726 100644 --- a/drivers/block/rnbd/rnbd-srv-dev.h +++ b/drivers/block/rnbd/rnbd-srv-dev.h @@ -41,11 +41,6 @@ void rnbd_dev_close(struct rnbd_dev *dev); void rnbd_endio(void *priv, int error); -void rnbd_dev_bi_end_io(struct bio *bio); - -struct bio *rnbd_bio_map_kern(void *data, struct bio_set *bs, - unsigned int len, gfp_t gfp_mask); - static inline int rnbd_dev_get_max_segs(const struct rnbd_dev *dev) { return queue_max_segments(bdev_get_queue(dev->bdev)); diff --git a/drivers/block/rnbd/rnbd-srv.c b/drivers/block/rnbd/rnbd-srv.c index 1ee808fc600cf..6d228af1dcc35 100644 --- a/drivers/block/rnbd/rnbd-srv.c +++ b/drivers/block/rnbd/rnbd-srv.c @@ -114,6 +114,14 @@ rnbd_get_sess_dev(int dev_id, struct rnbd_srv_session *srv_sess) return sess_dev; } +static void rnbd_dev_bi_end_io(struct bio *bio) +{ + struct rnbd_dev_blk_io *io = bio->bi_private; + + rnbd_endio(io->priv, blk_status_to_errno(bio->bi_status)); + bio_put(bio); +} + static int process_rdma(struct rnbd_srv_session *srv_sess, struct rtrs_srv_op *id, void *data, u32 datalen, const void *usr, size_t usrlen) @@ -144,12 +152,12 @@ static int process_rdma(struct rnbd_srv_session *srv_sess, priv->sess_dev = sess_dev; priv->id = id; - /* Generate bio with pages pointing to the rdma buffer */ - bio = rnbd_bio_map_kern(data, sess_dev->rnbd_dev->ibd_bio_set, datalen, GFP_KERNEL); - if (IS_ERR(bio)) { - err = PTR_ERR(bio); - rnbd_srv_err(sess_dev, "Failed to generate bio, err: %d\n", err); - goto sess_dev_put; + bio = bio_alloc_bioset(GFP_KERNEL, 1, sess_dev->rnbd_dev->ibd_bio_set); + if (bio_add_page(bio, virt_to_page(data), datalen, + offset_in_page(data)) != datalen) { + rnbd_srv_err(sess_dev, "Failed to map data to bio\n"); + err = -EINVAL; + goto bio_put; } io = container_of(bio, struct rnbd_dev_blk_io, bio); @@ -170,7 +178,8 @@ static int process_rdma(struct rnbd_srv_session *srv_sess, return 0; -sess_dev_put: +bio_put: + bio_put(bio); rnbd_put_sess_dev(sess_dev); err:
[PATCH 09/19] drbd: bio_alloc can't fail if it is allow to sleep
Remove handling of NULL returns from sleeping bio_alloc calls given that those can't fail. Signed-off-by: Christoph Hellwig --- drivers/block/drbd/drbd_receiver.c | 22 -- 1 file changed, 4 insertions(+), 18 deletions(-) diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c index 6df2539e215ba..fb59b263deeef 100644 --- a/drivers/block/drbd/drbd_receiver.c +++ b/drivers/block/drbd/drbd_receiver.c @@ -1281,14 +1281,13 @@ static void submit_one_flush(struct drbd_device *device, struct issue_flush_cont { struct bio *bio = bio_alloc(GFP_NOIO, 0); struct one_flush_context *octx = kmalloc(sizeof(*octx), GFP_NOIO); - if (!bio || !octx) { - drbd_warn(device, "Could not allocate a bio, CANNOT ISSUE FLUSH\n"); + + if (!octx) { + drbd_warn(device, "Could not allocate a octx, CANNOT ISSUE FLUSH\n"); /* FIXME: what else can I do now? disconnecting or detaching * really does not help to improve the state of the world, either. */ - kfree(octx); - if (bio) - bio_put(bio); + bio_put(bio); ctx->error = -ENOMEM; put_ldev(device); @@ -1646,7 +1645,6 @@ int drbd_submit_peer_request(struct drbd_device *device, unsigned data_size = peer_req->i.size; unsigned n_bios = 0; unsigned nr_pages = (data_size + PAGE_SIZE -1) >> PAGE_SHIFT; - int err = -ENOMEM; /* TRIM/DISCARD: for now, always use the helper function * blkdev_issue_zeroout(..., discard=true). @@ -1688,10 +1686,6 @@ int drbd_submit_peer_request(struct drbd_device *device, */ next_bio: bio = bio_alloc(GFP_NOIO, nr_pages); - if (!bio) { - drbd_err(device, "submit_ee: Allocation of a bio failed (nr_pages=%u)\n", nr_pages); - goto fail; - } /* > peer_req->i.sector, unless this is the first bio */ bio->bi_iter.bi_sector = sector; bio_set_dev(bio, device->ldev->backing_bdev); @@ -1726,14 +1720,6 @@ int drbd_submit_peer_request(struct drbd_device *device, drbd_submit_bio_noacct(device, fault_type, bio); } while (bios); return 0; - -fail: - while (bios) { - bio = bios; - bios = bios->bi_next; - bio_put(bio); - } - return err; } static void drbd_remove_epoch_entry_interval(struct drbd_device *device, -- 2.30.2
[PATCH 08/19] dm-thin: use blkdev_issue_flush instead of open coding it
Use blkdev_issue_flush, which uses an on-stack bio instead of an opencoded version with a bio embedded into struct pool. Signed-off-by: Christoph Hellwig --- drivers/md/dm-thin.c | 11 +-- 1 file changed, 1 insertion(+), 10 deletions(-) diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c index 76a9c2e9aeeea..411a3f56ed90c 100644 --- a/drivers/md/dm-thin.c +++ b/drivers/md/dm-thin.c @@ -282,8 +282,6 @@ struct pool { struct dm_bio_prison_cell **cell_sort_array; mempool_t mapping_pool; - - struct bio flush_bio; }; static void metadata_operation_failed(struct pool *pool, const char *op, int r); @@ -2906,7 +2904,6 @@ static void __pool_destroy(struct pool *pool) if (pool->next_mapping) mempool_free(pool->next_mapping, >mapping_pool); mempool_exit(>mapping_pool); - bio_uninit(>flush_bio); dm_deferred_set_destroy(pool->shared_read_ds); dm_deferred_set_destroy(pool->all_io_ds); kfree(pool); @@ -2987,7 +2984,6 @@ static struct pool *pool_create(struct mapped_device *pool_md, pool->low_water_triggered = false; pool->suspended = true; pool->out_of_data_space = false; - bio_init(>flush_bio, NULL, 0); pool->shared_read_ds = dm_deferred_set_create(); if (!pool->shared_read_ds) { @@ -3194,13 +3190,8 @@ static void metadata_low_callback(void *context) static int metadata_pre_commit_callback(void *context) { struct pool *pool = context; - struct bio *flush_bio = >flush_bio; - - bio_reset(flush_bio); - bio_set_dev(flush_bio, pool->data_dev); - flush_bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH; - return submit_bio_wait(flush_bio); + return blkdev_issue_flush(pool->data_dev); } static sector_t get_dev_size(struct block_device *bdev) -- 2.30.2
[PATCH 07/19] dm-snap: use blkdev_issue_flush instead of open coding it
Use blkdev_issue_flush, which uses an on-stack bio instead of an opencoded version with a bio embedded into struct dm_snapshot. Signed-off-by: Christoph Hellwig --- drivers/md/dm-snap.c | 21 + 1 file changed, 1 insertion(+), 20 deletions(-) diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c index dcf34c6b05ad3..0d336b5ec5714 100644 --- a/drivers/md/dm-snap.c +++ b/drivers/md/dm-snap.c @@ -141,11 +141,6 @@ struct dm_snapshot { * for them to be committed. */ struct bio_list bios_queued_during_merge; - - /* -* Flush data after merge. -*/ - struct bio flush_bio; }; /* @@ -1127,17 +1122,6 @@ static void snapshot_merge_next_chunks(struct dm_snapshot *s) static void error_bios(struct bio *bio); -static int flush_data(struct dm_snapshot *s) -{ - struct bio *flush_bio = >flush_bio; - - bio_reset(flush_bio); - bio_set_dev(flush_bio, s->origin->bdev); - flush_bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH; - - return submit_bio_wait(flush_bio); -} - static void merge_callback(int read_err, unsigned long write_err, void *context) { struct dm_snapshot *s = context; @@ -1151,7 +1135,7 @@ static void merge_callback(int read_err, unsigned long write_err, void *context) goto shut; } - if (flush_data(s) < 0) { + if (blkdev_issue_flush(s->origin->bdev) < 0) { DMERR("Flush after merge failed: shutting down merge"); goto shut; } @@ -1340,7 +1324,6 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv) s->first_merging_chunk = 0; s->num_merging_chunks = 0; bio_list_init(>bios_queued_during_merge); - bio_init(>flush_bio, NULL, 0); /* Allocate hash table for COW data */ if (init_hash_tables(s)) { @@ -1528,8 +1511,6 @@ static void snapshot_dtr(struct dm_target *ti) dm_exception_store_destroy(s->store); - bio_uninit(>flush_bio); - dm_put_device(ti, s->cow); dm_put_device(ti, s->origin); -- 2.30.2
[PATCH 05/19] dm: bio_alloc can't fail if it is allowed to sleep
Remove handling of NULL returns from sleeping bio_alloc calls given that those can't fail. Signed-off-by: Christoph Hellwig --- drivers/md/dm-crypt.c | 5 + drivers/md/dm-log-writes.c | 18 -- drivers/md/dm-thin.c | 25 + drivers/md/dm-zoned-metadata.c | 11 --- drivers/md/dm.c| 2 -- 5 files changed, 10 insertions(+), 51 deletions(-) diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index d4ae31558826a..20abe3486aba1 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -1673,9 +1673,6 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size) mutex_lock(>bio_alloc_lock); clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, >bs); - if (!clone) - goto out; - clone_init(io, clone); remaining_size = size; @@ -1702,7 +1699,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size) bio_put(clone); clone = NULL; } -out: + if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM)) mutex_unlock(>bio_alloc_lock); diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c index 139b09b06eda9..25f5e8d2d417b 100644 --- a/drivers/md/dm-log-writes.c +++ b/drivers/md/dm-log-writes.c @@ -218,10 +218,6 @@ static int write_metadata(struct log_writes_c *lc, void *entry, size_t ret; bio = bio_alloc(GFP_KERNEL, 1); - if (!bio) { - DMERR("Couldn't alloc log bio"); - goto error; - } bio->bi_iter.bi_size = 0; bio->bi_iter.bi_sector = sector; bio_set_dev(bio, lc->logdev->bdev); @@ -276,11 +272,6 @@ static int write_inline_data(struct log_writes_c *lc, void *entry, atomic_inc(>io_blocks); bio = bio_alloc(GFP_KERNEL, bio_pages); - if (!bio) { - DMERR("Couldn't alloc inline data bio"); - goto error; - } - bio->bi_iter.bi_size = 0; bio->bi_iter.bi_sector = sector; bio_set_dev(bio, lc->logdev->bdev); @@ -322,7 +313,6 @@ static int write_inline_data(struct log_writes_c *lc, void *entry, error_bio: bio_free_pages(bio); bio_put(bio); -error: put_io_block(lc); return -1; } @@ -364,10 +354,6 @@ static int log_one_block(struct log_writes_c *lc, atomic_inc(>io_blocks); bio = bio_alloc(GFP_KERNEL, bio_max_segs(block->vec_cnt)); - if (!bio) { - DMERR("Couldn't alloc log bio"); - goto error; - } bio->bi_iter.bi_size = 0; bio->bi_iter.bi_sector = sector; bio_set_dev(bio, lc->logdev->bdev); @@ -387,10 +373,6 @@ static int log_one_block(struct log_writes_c *lc, submit_bio(bio); bio = bio_alloc(GFP_KERNEL, bio_max_segs(block->vec_cnt - i)); - if (!bio) { - DMERR("Couldn't alloc log bio"); - goto error; - } bio->bi_iter.bi_size = 0; bio->bi_iter.bi_sector = sector; bio_set_dev(bio, lc->logdev->bdev); diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c index ec119d2422d5d..76a9c2e9aeeea 100644 --- a/drivers/md/dm-thin.c +++ b/drivers/md/dm-thin.c @@ -1180,24 +1180,17 @@ static void process_prepared_discard_passdown_pt1(struct dm_thin_new_mapping *m) } discard_parent = bio_alloc(GFP_NOIO, 1); - if (!discard_parent) { - DMWARN("%s: unable to allocate top level discard bio for passdown. Skipping passdown.", - dm_device_name(tc->pool->pool_md)); - queue_passdown_pt2(m); + discard_parent->bi_end_io = passdown_endio; + discard_parent->bi_private = m; - } else { - discard_parent->bi_end_io = passdown_endio; - discard_parent->bi_private = m; - - if (m->maybe_shared) - passdown_double_checking_shared_status(m, discard_parent); - else { - struct discard_op op; + if (m->maybe_shared) + passdown_double_checking_shared_status(m, discard_parent); + else { + struct discard_op op; - begin_discard(, tc, discard_parent); - r = issue_discard(, m->data_block, data_end); - end_discard(, r); - } + begin_discard(, tc, discard_parent); + r = issue_discard(, m->data_block, data_end); + end_discard(, r); } } diff --git a/drivers/md/dm-zoned-metadata.c b/drivers/md/dm-zoned-metadata.c index ee4626d085574..5718b83cc7182
[PATCH 04/19] ntfs3: remove ntfs_alloc_bio
bio_alloc will never fail if it is allowed to sleep, so there is no need for this loop. Also remove the __GFP_HIGH specifier as it doesn't make sense here given that we'll always fall back to the mempool anyway. Signed-off-by: Christoph Hellwig --- fs/ntfs3/fsntfs.c | 23 ++- 1 file changed, 2 insertions(+), 21 deletions(-) diff --git a/fs/ntfs3/fsntfs.c b/fs/ntfs3/fsntfs.c index 4de9acb169689..4a255e21ecf5f 100644 --- a/fs/ntfs3/fsntfs.c +++ b/fs/ntfs3/fsntfs.c @@ -1443,17 +1443,6 @@ int ntfs_write_bh(struct ntfs_sb_info *sbi, struct NTFS_RECORD_HEADER *rhdr, return err; } -static inline struct bio *ntfs_alloc_bio(u32 nr_vecs) -{ - struct bio *bio = bio_alloc(GFP_NOFS | __GFP_HIGH, nr_vecs); - - if (!bio && (current->flags & PF_MEMALLOC)) { - while (!bio && (nr_vecs /= 2)) - bio = bio_alloc(GFP_NOFS | __GFP_HIGH, nr_vecs); - } - return bio; -} - /* * ntfs_bio_pages - Read/write pages from/to disk. */ @@ -1496,11 +1485,7 @@ int ntfs_bio_pages(struct ntfs_sb_info *sbi, const struct runs_tree *run, lbo = ((u64)lcn << cluster_bits) + off; len = ((u64)clen << cluster_bits) - off; new_bio: - new = ntfs_alloc_bio(nr_pages - page_idx); - if (!new) { - err = -ENOMEM; - goto out; - } + new = bio_alloc(GFP_NOFS, nr_pages - page_idx); if (bio) { bio_chain(bio, new); submit_bio(bio); @@ -1599,11 +1584,7 @@ int ntfs_bio_fill_1(struct ntfs_sb_info *sbi, const struct runs_tree *run) lbo = (u64)lcn << cluster_bits; len = (u64)clen << cluster_bits; new_bio: - new = ntfs_alloc_bio(BIO_MAX_VECS); - if (!new) { - err = -ENOMEM; - break; - } + new = bio_alloc(GFP_NOFS, BIO_MAX_VECS); if (bio) { bio_chain(bio, new); submit_bio(bio); -- 2.30.2
[PATCH 03/19] nfs/blocklayout: remove bl_alloc_init_bio
bio_alloc will never fail when it can sleep. Remove the now simple bl_alloc_init_bio helper and open code it in the only caller. Signed-off-by: Christoph Hellwig --- fs/nfs/blocklayout/blocklayout.c | 26 +- 1 file changed, 5 insertions(+), 21 deletions(-) diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c index fe860c5387476..38e063af7e98a 100644 --- a/fs/nfs/blocklayout/blocklayout.c +++ b/fs/nfs/blocklayout/blocklayout.c @@ -115,23 +115,6 @@ bl_submit_bio(struct bio *bio) return NULL; } -static struct bio *bl_alloc_init_bio(unsigned int npg, - struct block_device *bdev, sector_t disk_sector, - bio_end_io_t end_io, struct parallel_io *par) -{ - struct bio *bio; - - npg = bio_max_segs(npg); - bio = bio_alloc(GFP_NOIO, npg); - if (bio) { - bio->bi_iter.bi_sector = disk_sector; - bio_set_dev(bio, bdev); - bio->bi_end_io = end_io; - bio->bi_private = par; - } - return bio; -} - static bool offset_in_map(u64 offset, struct pnfs_block_dev_map *map) { return offset >= map->start && offset < map->start + map->len; @@ -171,10 +154,11 @@ do_add_page_to_bio(struct bio *bio, int npg, int rw, sector_t isect, retry: if (!bio) { - bio = bl_alloc_init_bio(npg, map->bdev, - disk_addr >> SECTOR_SHIFT, end_io, par); - if (!bio) - return ERR_PTR(-ENOMEM); + bio = bio_alloc(GFP_NOIO, bio_max_segs(npg)); + bio->bi_iter.bi_sector = disk_addr >> SECTOR_SHIFT; + bio_set_dev(bio, map->bdev); + bio->bi_end_io = end_io; + bio->bi_private = par; bio_set_op_attrs(bio, rw, 0); } if (bio_add_page(bio, page, *len, offset) < *len) { -- 2.30.2
[PATCH 02/19] nilfs2: remove nilfs_alloc_seg_bio
bio_alloc will never fail when it can sleep. Remove the now simple nilfs_alloc_seg_bio helper and open code it in the only caller. Signed-off-by: Christoph Hellwig --- fs/nilfs2/segbuf.c | 31 --- 1 file changed, 4 insertions(+), 27 deletions(-) diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c index 43287b0d3e9b6..53b7c6d21cdd8 100644 --- a/fs/nilfs2/segbuf.c +++ b/fs/nilfs2/segbuf.c @@ -371,29 +371,6 @@ static int nilfs_segbuf_submit_bio(struct nilfs_segment_buffer *segbuf, return err; } -/** - * nilfs_alloc_seg_bio - allocate a new bio for writing log - * @nilfs: nilfs object - * @start: start block number of the bio - * @nr_vecs: request size of page vector. - * - * Return Value: On success, pointer to the struct bio is returned. - * On error, NULL is returned. - */ -static struct bio *nilfs_alloc_seg_bio(struct the_nilfs *nilfs, sector_t start, - int nr_vecs) -{ - struct bio *bio; - - bio = bio_alloc(GFP_NOIO, nr_vecs); - if (likely(bio)) { - bio_set_dev(bio, nilfs->ns_bdev); - bio->bi_iter.bi_sector = - start << (nilfs->ns_blocksize_bits - 9); - } - return bio; -} - static void nilfs_segbuf_prepare_write(struct nilfs_segment_buffer *segbuf, struct nilfs_write_info *wi) { @@ -414,10 +391,10 @@ static int nilfs_segbuf_submit_bh(struct nilfs_segment_buffer *segbuf, BUG_ON(wi->nr_vecs <= 0); repeat: if (!wi->bio) { - wi->bio = nilfs_alloc_seg_bio(wi->nilfs, wi->blocknr + wi->end, - wi->nr_vecs); - if (unlikely(!wi->bio)) - return -ENOMEM; + wi->bio = bio_alloc(GFP_NOIO, wi->nr_vecs); + bio_set_dev(wi->bio, wi->nilfs->ns_bdev); + wi->bio->bi_iter.bi_sector = (wi->blocknr + wi->end) << + (wi->nilfs->ns_blocksize_bits - 9); } len = bio_add_page(wi->bio, bh->b_page, bh->b_size, bh_offset(bh)); -- 2.30.2
improve the bio allocation interface v2
Hi Jens, this series is posted early because it has wide-ranging changes and could use some early ACKs before -rc1. It changes the interface to the bio allocators to always pass a block_device and the operation, which is information needed for every bio submitted through bio_submit. This means the fields can be directly initialized in bio_init instead of first being zeroed and thus should help to micro-optimize even better than the __bio_set_dev that Pavel proposed while also cleaning up code. I have a follow on series to also deal with the bio cloning interfaces that need even more love, and additional cleanups for the callers which might be material for the next merge window. Changes since v1: - fix bio_add_page return value handling in rnbd-srv - fix bio clenup in rnbd-srv - fix a few commit message typos - fix a bisection hazard in rnbd - fix an initialization order issue in bio_alloc_bioset/bio_init Diffstat: block/bio.c | 73 block/blk-flush.c |4 - block/blk-lib.c | 32 ++- block/blk-zoned.c | 14 +- block/blk.h |2 block/bounce.c |6 -- block/fops.c| 35 +++-- drivers/block/drbd/drbd_actlog.c|5 -- drivers/block/drbd/drbd_bitmap.c|7 +-- drivers/block/drbd/drbd_receiver.c | 32 +++ drivers/block/floppy.c |4 - drivers/block/pktcdvd.c |8 --- drivers/block/rnbd/rnbd-srv-dev.c | 61 -- drivers/block/rnbd/rnbd-srv-dev.h | 18 drivers/block/rnbd/rnbd-srv.c | 45 -- drivers/block/rnbd/rnbd-srv.h |1 drivers/block/xen-blkback/blkback.c | 25 ++-- drivers/block/zram/zram_drv.c | 16 ++- drivers/md/bcache/io.c |3 - drivers/md/bcache/journal.c | 16 ++- drivers/md/bcache/movinggc.c|4 - drivers/md/bcache/request.c | 18 +++- drivers/md/bcache/super.c |8 +-- drivers/md/bcache/writeback.c |4 - drivers/md/dm-crypt.c | 27 - drivers/md/dm-io.c |5 -- drivers/md/dm-log-writes.c | 39 +++ drivers/md/dm-snap.c| 21 -- drivers/md/dm-thin.c| 41 +--- drivers/md/dm-writecache.c |7 +-- drivers/md/dm-zoned-metadata.c | 26 ++-- drivers/md/dm.c | 12 ++--- drivers/md/md-multipath.c |2 drivers/md/md.c | 24 +-- drivers/md/raid1.c |8 +-- drivers/md/raid10.c | 14 ++ drivers/md/raid5-cache.c| 19 +++-- drivers/md/raid5-ppl.c | 13 ++ drivers/md/raid5.c | 12 ++--- drivers/nvdimm/nd_virtio.c |6 +- drivers/nvme/target/io-cmd-bdev.c | 18 +++- drivers/nvme/target/passthru.c |7 +-- drivers/nvme/target/zns.c | 14 +++--- drivers/scsi/ufs/ufshpb.c |4 - drivers/target/target_core_iblock.c | 11 + fs/btrfs/disk-io.c | 10 +--- fs/btrfs/extent_io.c|2 fs/buffer.c | 14 ++ fs/crypto/bio.c | 13 ++ fs/direct-io.c |5 -- fs/erofs/zdata.c|5 -- fs/ext4/page-io.c |3 - fs/ext4/readpage.c |8 +-- fs/f2fs/data.c |7 +-- fs/gfs2/lops.c |8 +-- fs/gfs2/meta_io.c |4 - fs/gfs2/ops_fstype.c|4 - fs/hfsplus/wrapper.c|4 - fs/iomap/buffered-io.c | 26 +--- fs/iomap/direct-io.c|8 --- fs/jfs/jfs_logmgr.c | 11 - fs/jfs/jfs_metapage.c |9 +--- fs/mpage.c | 34 ++-- fs/nfs/blocklayout/blocklayout.c| 26 +--- fs/nilfs2/segbuf.c | 31 +-- fs/ntfs3/fsntfs.c | 27 - fs/ocfs2/cluster/heartbeat.c|4 - fs/squashfs/block.c | 11 ++--- fs/xfs/xfs_bio_io.c | 14 ++ fs/xfs/xfs_buf.c|4 - fs/xfs/xfs_log.c| 14 +++--- fs/zonefs/super.c |9 +--- include/linux/bio.h | 30 ++ kernel/power/swap.c |5 -- mm/page_io.c| 10 +--- 75 files changed, 372 insertions(+), 759 deletions(-)
[PATCH 01/19] fs: remove mpage_alloc
open code mpage_alloc in it's two callers and simplify the results because of the context: - __mpage_writepage always passes GFP_NOFS and can thus always sleep and will never get a NULL return from bio_alloc at all. - do_mpage_readpage can only get a non-sleeping context for readahead which never sets PF_MEMALLOC and thus doesn't need the retry loop either. Both cases will never have __GFP_HIGH set. Signed-off-by: Christoph Hellwig --- fs/mpage.c | 35 ++- 1 file changed, 6 insertions(+), 29 deletions(-) diff --git a/fs/mpage.c b/fs/mpage.c index 87f5cfef6caa7..06e95d777e940 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -66,29 +66,6 @@ static struct bio *mpage_bio_submit(int op, int op_flags, struct bio *bio) return NULL; } -static struct bio * -mpage_alloc(struct block_device *bdev, - sector_t first_sector, int nr_vecs, - gfp_t gfp_flags) -{ - struct bio *bio; - - /* Restrict the given (page cache) mask for slab allocations */ - gfp_flags &= GFP_KERNEL; - bio = bio_alloc(gfp_flags, nr_vecs); - - if (bio == NULL && (current->flags & PF_MEMALLOC)) { - while (!bio && (nr_vecs /= 2)) - bio = bio_alloc(gfp_flags, nr_vecs); - } - - if (bio) { - bio_set_dev(bio, bdev); - bio->bi_iter.bi_sector = first_sector; - } - return bio; -} - /* * support function for mpage_readahead. The fs supplied get_block might * return an up to date buffer. This is used to map that buffer into @@ -296,10 +273,11 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args) page)) goto out; } - args->bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), - bio_max_segs(args->nr_pages), gfp); + args->bio = bio_alloc(gfp, bio_max_segs(args->nr_pages)); if (args->bio == NULL) goto confused; + bio_set_dev(args->bio, bdev); + args->bio->bi_iter.bi_sector = blocks[0] << (blkbits - 9); } length = first_hole << blkbits; @@ -608,10 +586,9 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc, page, wbc)) goto out; } - bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), - BIO_MAX_VECS, GFP_NOFS|__GFP_HIGH); - if (bio == NULL) - goto confused; + bio = bio_alloc(GFP_NOFS, BIO_MAX_VECS); + bio_set_dev(bio, bdev); + bio->bi_iter.bi_sector = blocks[0] << (blkbits - 9); wbc_init_bio(wbc, bio); bio->bi_write_hint = inode->i_write_hint; -- 2.30.2
Re: possible kernel/libxl race with xl network-attach
On Fri, Jan 21, 2022 at 03:05:07PM +, James Dingwall wrote: > On Fri, Jan 21, 2022 at 03:00:29PM +0100, Roger Pau Monné wrote: > > On Fri, Jan 21, 2022 at 01:34:54PM +, James Dingwall wrote: > > > On 2022-01-13 16:11, Roger Pau Monné wrote: > > > > On Thu, Jan 13, 2022 at 11:19:46AM +, James Dingwall wrote: > > > > > > > > > > I have been trying to debug a problem where a vif with the backend > > > > > in a > > > > > driver domain is added to dom0. Intermittently the hotplug script is > > > > > not invoked by libxl (running as xl devd) in the driver domain. By > > > > > enabling some debug for the driver domain kernel and libxl I have > > > > > these > > > > > messages: > > > > > > > > > > driver domain kernel (Ubuntu 5.4.0-92-generic): > > > > > > > > > > [Thu Jan 13 01:39:31 2022] [1408] 564: vif vif-0-0 vif0.0: > > > > > Successfully created xenvif > > > > > [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed: > > > > > /local/domain/0/device/vif/0 -> Initialising > > > > > [Thu Jan 13 01:39:31 2022] [26] 470: > > > > > xen_netback:backend_switch_state: backend/vif/0/0 -> InitWait > > > > > [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed: > > > > > /local/domain/0/device/vif/0 -> Connected > > > > > [Thu Jan 13 01:39:31 2022] vif vif-0-0 vif0.0: Guest Rx ready > > > > > [Thu Jan 13 01:39:31 2022] [26] 470: > > > > > xen_netback:backend_switch_state: backend/vif/0/0 -> Connected > > > > > > > > > > xl devd (Xen 4.14.3): > > > > > > > > > > 2022-01-13 01:39:31 UTC libxl: debug: > > > > > libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528 > > > > > wpath=/local/domain/2/backend token=3/0: event > > > > > epath=/local/domain/2/backend/vif/0/0/state > > > > > 2022-01-13 01:39:31 UTC libxl: debug: > > > > > libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac569700: > > > > > nested ao, parent 0x5633ac567f90 > > > > > 2022-01-13 01:39:31 UTC libxl: debug: > > > > > libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180 > > > > > wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event > > > > > epath=/local/domain/2/backend/vif/0/0/state > > > > > 2022-01-13 01:39:31 UTC libxl: debug: > > > > > libxl_event.c:1055:devstate_callback: backend > > > > > /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting > > > > > state 4 > > > > > 2022-01-13 01:39:31 UTC libxl: debug: > > > > > libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528 > > > > > wpath=/local/domain/2/backend token=3/0: event > > > > > epath=/local/domain/2/backend/vif/0/0/state > > > > > 2022-01-13 01:39:31 UTC libxl: debug: > > > > > libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac56a220: > > > > > nested ao, parent 0x5633ac567f90 > > > > > 2022-01-13 01:39:31 UTC libxl: debug: > > > > > libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180 > > > > > wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event > > > > > epath=/local/domain/2/backend/vif/0/0/state > > > > > 2022-01-13 01:39:31 UTC libxl: debug: > > > > > libxl_event.c:1055:devstate_callback: backend > > > > > /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting > > > > > state 4 > > > > > 2022-01-13 01:39:51 UTC libxl: debug: > > > > > libxl_aoutils.c:88:xswait_timeout_callback: backend > > > > > /local/domain/2/backend/vif/0/0/state (hoping for state change to > > > > > 2): xswait timeout (path=/local/domain/2/backend/vif/0/0/state) > > > > > 2022-01-13 01:39:51 UTC libxl: debug: > > > > > libxl_event.c:850:libxl__ev_xswatch_deregister: watch > > > > > w=0x5633ac569180 wpath=/local/domain/2/backend/vif/0/0/state > > > > > token=2/1: deregister slotnum=2 > > > > > 2022-01-13 01:39:51 UTC libxl: debug: > > > > > libxl_event.c:1039:devstate_callback: backend > > > > > /local/domain/2/backend/vif/0/0/state wanted state 2 timed out > > > > > 2022-01-13 01:39:51 UTC libxl: debug: > > > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch > > > > > w=0x5633ac569180: deregister unregistered > > > > > 2022-01-13 01:39:51 UTC libxl: debug: > > > > > libxl_device.c:1092:device_backend_callback: calling > > > > > device_backend_cleanup > > > > > 2022-01-13 01:39:51 UTC libxl: debug: > > > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch > > > > > w=0x5633ac569180: deregister unregistered > > > > > 2022-01-13 01:39:51 UTC libxl: error: > > > > > libxl_device.c:1105:device_backend_callback: unable to add device > > > > > with path /local/domain/2/backend/vif/0/0 > > > > > 2022-01-13 01:39:51 UTC libxl: debug: > > > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch > > > > > w=0x5633ac569280: deregister unregistered > > > > > 2022-01-13 01:39:51 UTC libxl: debug: > > > > > libxl_device.c:1470:device_complete: device > > > > > /local/domain/2/backend/vif/0/0 add failed > > > > > 2022-01-13 01:39:51 UTC libxl: debug: > > > > > libxl_event.c:2035:libxl__ao__destroy: ao 0x5633ac568f30: destroy > > > > > > > > > > the xenstore content for the backend: > > > > > >
Re: Debian Bug#1004269: Linker segfault while building src:xen
On 23.01.2022 22:52, Hans van Kranenburg wrote: > (To both the Debian bug # and xen-devel list, reply-all is fine) > Hi Xen people, > > I just filed a bug at Debian on the binutils package, because since the > latest binutils package update (Debian 2.37.50.20220106-2), Xen (both > 4.14 and 4.16) fails to build with a segfault at the following point: > > x86_64-linux-gnu-ld -mi386pep --subsystem=10 > --image-base=0x82d04000 --stack=0,0 --heap=0,0 > --section-alignment=0x20 --file-alignment=0x20 > --major-image-version=4 --minor-image-version=16 --major-os-version=2 > --minor-os-version=0 --major-subsystem-version=2 > --minor-subsystem-version=0 --no-insert-timestamp --build-id=sha1 -T > efi.lds -N prelink.o > /builds/xen-team/debian-xen/debian/output/source_dir/xen/common/symbols-dummy.o > > -b pe-x86-64 efi/buildid.o -o > /builds/xen-team/debian-xen/debian/output/source_dir/xen/.xen.efi.0x82d04000.0 > > && : > Segmentation fault (core dumped) > > Full message and links to build logs etc are in the initial bug message, > to be seen at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1004269 > > We (Debian Xen Team) are awaiting response, but I thought to also let > you know already. > > * Does the above error 'ring a bell'? > * Can you maybe also reproduce this in a development environment with > very latest binutils? I've tried with a 1.5 weeks old snapshot I had lying about; no "success". But your and my builds are surely quite different anyway. > * Maybe someone has a useful comment for the Debian binutils maintainer > about what's happening in this step of the build? It's xen.efi that's being linked; not sure what else to say. > * Any suggestions about what we can do to help figure this out? I'm pretty certain this needs debugging from the binutils side, so I guess you will want to report it there (even more so with 2.38 around the corner). I guess it's actually debugging ld or bisecting which provide the best chance of figuring out what's going on. Jan > * We'll try to help debug, but will surely appreciate upstream help if > things get too technical. It's simply the case that I did not have to > look into a very similar issue before, so it's new. > > Thanks! > Hans >
[PATCH v2 4/4] x86/APIC: make connections between seemingly arbitrary numbers
Making adjustments to arbitrarily chosen values shouldn't require auditing the code for possible derived numbers - such a change should be doable in a single place, having an effect on all code depending on that choice. For one make the TDCR write actually use APIC_DIVISOR. With the necessary mask constant introduced, also use that in vLAPIC code. While introducing the constant, drop APIC_TDR_DIV_TMBASE: The bit has been undefined in halfway recent SDM and PM versions. And then introduce a constant tying together the scale used when converting nanoseconds to bus clocks. No functional change intended. Signed-off-by: Jan Beulich --- I thought we have a generic "glue" macro, but I couldn't find one. Hence I'm (ab)using _AC(). --- v2: New. --- a/xen/arch/x86/apic.c +++ b/xen/arch/x86/apic.c @@ -1078,8 +1078,8 @@ static void __setup_APIC_LVTT(unsigned i lvtt_value = /*APIC_TIMER_MODE_PERIODIC |*/ LOCAL_TIMER_VECTOR; apic_write(APIC_LVTT, lvtt_value); -tmp_value = apic_read(APIC_TDCR); -apic_write(APIC_TDCR, tmp_value | APIC_TDR_DIV_1); +tmp_value = apic_read(APIC_TDCR) & ~APIC_TDR_DIV_MASK; +apic_write(APIC_TDCR, tmp_value | _AC(APIC_TDR_DIV_, APIC_DIVISOR)); apic_write(APIC_TMICT, clocks / APIC_DIVISOR); } @@ -1196,6 +1196,8 @@ static void __init check_deadline_errata * APIC irq that way. */ +#define BUS_SCALE_SHIFT 18 + static void __init calibrate_APIC_clock(void) { unsigned long bus_freq; /* KAF: pointer-size avoids compile warns. */ @@ -1249,8 +1251,8 @@ static void __init calibrate_APIC_clock( /* set up multipliers for accurate timer code */ bus_cycle = 1UL / bus_freq; /* in pico seconds */ bus_cycle += (1UL % bus_freq) * 2 > bus_freq; -bus_scale = (1000*262144)/bus_cycle; -bus_scale += ((1000 * 262144) % bus_cycle) * 2 > bus_cycle; +bus_scale = (1000 << BUS_SCALE_SHIFT) / bus_cycle; +bus_scale += ((1000 << BUS_SCALE_SHIFT) % bus_cycle) * 2 > bus_cycle; apic_printk(APIC_VERBOSE, ". bus_scale = %#x\n", bus_scale); /* reset APIC to zero timeout value */ @@ -1337,7 +1339,8 @@ int reprogram_timer(s_time_t timeout) } if ( timeout && ((expire = timeout - NOW()) > 0) ) -apic_tmict = min_t(u64, (bus_scale * expire) >> 18, UINT_MAX); +apic_tmict = min_t(uint64_t, (bus_scale * expire) >> BUS_SCALE_SHIFT, + UINT32_MAX); apic_write(APIC_TMICT, (unsigned long)apic_tmict); --- a/xen/arch/x86/hvm/vlapic.c +++ b/xen/arch/x86/hvm/vlapic.c @@ -580,7 +580,7 @@ static uint32_t vlapic_get_tmcct(const s static void vlapic_set_tdcr(struct vlapic *vlapic, unsigned int val) { /* Only bits 0, 1 and 3 are settable; others are MBZ. */ -val &= 0xb; +val &= APIC_TDR_DIV_MASK; vlapic_set_reg(vlapic, APIC_TDCR, val); /* Update the demangled hw.timer_divisor. */ @@ -887,7 +887,7 @@ void vlapic_reg_write(struct vcpu *v, un { uint32_t current_divisor = vlapic->hw.timer_divisor; -vlapic_set_tdcr(vlapic, val & 0xb); +vlapic_set_tdcr(vlapic, val); vlapic_update_timer(vlapic, vlapic_get_reg(vlapic, APIC_LVTT), false, current_divisor); @@ -1019,7 +1019,7 @@ int guest_wrmsr_x2apic(struct vcpu *v, u break; case APIC_TDCR: -if ( msr_content & ~APIC_TDR_DIV_1 ) +if ( msr_content & ~APIC_TDR_DIV_MASK ) return X86EMUL_EXCEPTION; break; --- a/xen/arch/x86/include/asm/apicdef.h +++ b/xen/arch/x86/include/asm/apicdef.h @@ -106,7 +106,7 @@ #defineAPIC_TMICT 0x380 #defineAPIC_TMCCT 0x390 #defineAPIC_TDCR 0x3E0 -#defineAPIC_TDR_DIV_TMBASE (1<<2) +#defineAPIC_TDR_DIV_MASK 0xB #defineAPIC_TDR_DIV_1 0xB #defineAPIC_TDR_DIV_2 0x0 #defineAPIC_TDR_DIV_4 0x1
[PATCH v2 3/4] x86/APIC: skip unnecessary parts of __setup_APIC_LVTT()
In TDT mode there's no point writing TDCR or TMICT, while outside of that mode there's no need for the MFENCE. No change intended to overall functioning. Signed-off-by: Jan Beulich --- v2: New. --- a/xen/arch/x86/apic.c +++ b/xen/arch/x86/apic.c @@ -1059,24 +1059,25 @@ static void __setup_APIC_LVTT(unsigned i { unsigned int lvtt_value, tmp_value; -/* NB. Xen uses local APIC timer in one-shot mode. */ -lvtt_value = /*APIC_TIMER_MODE_PERIODIC |*/ LOCAL_TIMER_VECTOR; - if ( tdt_enabled ) { -lvtt_value &= (~APIC_TIMER_MODE_MASK); -lvtt_value |= APIC_TIMER_MODE_TSC_DEADLINE; +lvtt_value = APIC_TIMER_MODE_TSC_DEADLINE | LOCAL_TIMER_VECTOR; +apic_write(APIC_LVTT, lvtt_value); + +/* + * See Intel SDM: TSC-Deadline Mode chapter. In xAPIC mode, + * writing to the APIC LVTT and TSC_DEADLINE MSR isn't serialized. + * According to Intel, MFENCE can do the serialization here. + */ +asm volatile( "mfence" : : : "memory" ); + +return; } +/* NB. Xen uses local APIC timer in one-shot mode. */ +lvtt_value = /*APIC_TIMER_MODE_PERIODIC |*/ LOCAL_TIMER_VECTOR; apic_write(APIC_LVTT, lvtt_value); -/* - * See Intel SDM: TSC-Deadline Mode chapter. In xAPIC mode, - * writing to the APIC LVTT and TSC_DEADLINE MSR isn't serialized. - * According to Intel, MFENCE can do the serialization here. - */ -asm volatile( "mfence" : : : "memory" ); - tmp_value = apic_read(APIC_TDCR); apic_write(APIC_TDCR, tmp_value | APIC_TDR_DIV_1);
[PATCH v2 2/4] x86/APIC: calibrate against platform timer when possible
Use the original calibration against PIT only when the platform timer is PIT. This implicitly excludes the "xen_guest" case from using the PIT logic (init_pit() fails there, and as of 5e73b2594c54 ["x86/time: minor adjustments to init_pit()"] using_pit also isn't being set too early anymore), so the respective hack there can be dropped at the same time. This also reduces calibration time from 100ms to 50ms, albeit this step is being skipped as of 0731a56c7c72 ("x86/APIC: no need for timer calibration when using TDT") anyway. While re-indenting the PIT logic in calibrate_APIC_clock(), besides adjusting style also switch around the 2nd TSC/TMCCT read pair, to match the order of the 1st one, yielding more consistent deltas. Signed-off-by: Jan Beulich --- Open-coding apic_read() in read_tmcct() isn't overly nice, but I wanted to avoid x2apic_enabled being evaluated twice in close succession. (The barrier is there just in case only anyway: While this RDMSR isn't serializing, I'm unaware of any statement whether it can also be executed speculatively, like RDTSC can.) An option might be to move the function to apic.c such that it would also be used by calibrate_APIC_clock(). Unlike the CPU frequencies enumerated in CPUID leaf 0x16 (which aren't precise), using CPUID[0x15].ECX - if populated - may be an option to skip calibration altogether. Iirc the value there is precise, but using the systems I have easy access to I cannot verify this: In the sample of three I have, none have ECX populated. I wonder whether the secondary CPU freq measurement (used for display purposes only) wouldn't better be dropped at this occasion. --- v2: New. --- a/xen/arch/x86/apic.c +++ b/xen/arch/x86/apic.c @@ -1182,20 +1182,6 @@ static void __init check_deadline_errata "please update microcode to version %#x (or later)\n", rev); } -static void __init wait_tick_pvh(void) -{ -u64 lapse_ns = 10ULL / HZ; -s_time_t start, curr_time; - -start = NOW(); - -/* Won't wrap around */ -do { -cpu_relax(); -curr_time = NOW(); -} while ( curr_time - start < lapse_ns ); -} - /* * In this function we calibrate APIC bus clocks to the external * timer. Unfortunately we cannot use jiffies and the timer irq @@ -1211,9 +1197,6 @@ static void __init wait_tick_pvh(void) static void __init calibrate_APIC_clock(void) { -unsigned long long t1, t2; -unsigned long tt1, tt2; -unsigned int i; unsigned long bus_freq; /* KAF: pointer-size avoids compile warns. */ unsigned int bus_cycle; /* length of one bus cycle in pico-seconds */ #define LOOPS_FRAC 10U /* measure for one tenth of a second */ @@ -1226,39 +1209,38 @@ static void __init calibrate_APIC_clock( */ __setup_APIC_LVTT(0x); -if ( !xen_guest ) +bus_freq = calibrate_apic_timer(); +if ( !bus_freq ) +{ +unsigned int i, tt1, tt2; +unsigned long t1, t2; + +ASSERT(!xen_guest); + /* - * The timer chip counts down to zero. Let's wait - * for a wraparound to start exact measurement: - * (the current tick might have been already half done) + * The timer chip counts down to zero. Let's wait for a wraparound to + * start exact measurement (the current tick might have been already + * half done): */ wait_8254_wraparound(); -else -wait_tick_pvh(); -/* - * We wrapped around just now. Let's start: - */ -t1 = rdtsc_ordered(); -tt1 = apic_read(APIC_TMCCT); +/* We wrapped around just now. Let's start: */ +t1 = rdtsc_ordered(); +tt1 = apic_read(APIC_TMCCT); -/* - * Let's wait HZ / LOOPS_FRAC ticks: - */ -for (i = 0; i < HZ / LOOPS_FRAC; i++) -if ( !xen_guest ) +/* Let's wait HZ / LOOPS_FRAC ticks: */ +for ( i = 0; i < HZ / LOOPS_FRAC; ++i ) wait_8254_wraparound(); -else -wait_tick_pvh(); -tt2 = apic_read(APIC_TMCCT); -t2 = rdtsc_ordered(); +t2 = rdtsc_ordered(); +tt2 = apic_read(APIC_TMCCT); -bus_freq = (tt1 - tt2) * APIC_DIVISOR * LOOPS_FRAC; +bus_freq = (tt1 - tt2) * APIC_DIVISOR * LOOPS_FRAC; -apic_printk(APIC_VERBOSE, ". CPU clock speed is %lu.%04lu MHz.\n", -((unsigned long)(t2 - t1) * LOOPS_FRAC) / 100, -(((unsigned long)(t2 - t1) * LOOPS_FRAC) / 100) % 1); +apic_printk(APIC_VERBOSE, ". CPU clock speed is %lu.%04lu MHz.\n", +((t2 - t1) * LOOPS_FRAC) / 100, +(((t2 - t1) * LOOPS_FRAC) / 100) % 1); +} apic_printk(APIC_VERBOSE, ". host bus clock speed is %ld.%04ld MHz.\n", bus_freq / 100, (bus_freq / 100) % 1); --- a/xen/arch/x86/include/asm/apic.h +++ b/xen/arch/x86/include/asm/apic.h @@ -192,6 +192,8 @@ extern void record_boot_APIC_mode(void); extern enum apic_mode
[PATCH v2 1/4] x86/time: further improve TSC / CPU freq calibration accuracy
Calibration logic assumes that the platform timer (HPET or ACPI PM timer) and the TSC are read at about the same time. This assumption may not hold when a long latency event (e.g. SMI or NMI) occurs between the two reads. Reduce the risk of reading uncorrelated values by doing at least four pairs of reads, using the tuple where the delta between the enclosing TSC reads was smallest. From the fourth iteration onwards bail if the new TSC delta isn't better (smaller) than the best earlier one. Signed-off-by: Jan Beulich --- When running virtualized, scheduling in the host would also constitute long latency events. I wonder whether, to compensate for that, we'd want more than 3 "base" iterations, as I would expect scheduling events to occur more frequently than e.g. SMI (and with a higher probability of multiple ones occurring in close succession). --- v2: Use helper functions to fold duplicate code. --- a/xen/arch/x86/time.c +++ b/xen/arch/x86/time.c @@ -287,9 +287,47 @@ static char *freq_string(u64 freq) return s; } -static uint64_t adjust_elapsed(uint64_t elapsed, uint32_t actual, - uint32_t target) +static uint32_t __init read_pt_and_tsc(uint64_t *tsc, + const struct platform_timesource *pts) { +uint64_t tsc_prev = *tsc = rdtsc_ordered(), tsc_min = ~0; +uint32_t best = best; +unsigned int i; + +for ( i = 0; ; ++i ) +{ +uint32_t pt = pts->read_counter(); +uint64_t tsc_cur = rdtsc_ordered(); +uint64_t tsc_delta = tsc_cur - tsc_prev; + +if ( tsc_delta < tsc_min ) +{ +tsc_min = tsc_delta; +*tsc = tsc_cur; +best = pt; +} +else if ( i > 2 ) +break; + +tsc_prev = tsc_cur; +} + +return best; +} + +static uint64_t __init calibrate_tsc(const struct platform_timesource *pts) +{ +uint64_t start, end, elapsed; +uint32_t count = read_pt_and_tsc(, pts); +uint32_t target = CALIBRATE_VALUE(pts->frequency), actual; +uint32_t mask = (uint32_t)~0 >> (32 - pts->counter_bits); + +while ( ((pts->read_counter() - count) & mask) < target ) +continue; + +actual = read_pt_and_tsc(, pts) - count; +elapsed = end - start; + if ( likely(actual > target) ) { /* @@ -395,8 +433,7 @@ static u64 read_hpet_count(void) static int64_t __init init_hpet(struct platform_timesource *pts) { -uint64_t hpet_rate, start; -uint32_t count, target, elapsed; +uint64_t hpet_rate; /* * Allow HPET to be setup, but report a frequency of 0 so it's not selected * as a timer source. This is required so it can be used in legacy @@ -467,13 +504,7 @@ static int64_t __init init_hpet(struct p pts->frequency = hpet_rate; -count = hpet_read32(HPET_COUNTER); -start = rdtsc_ordered(); -target = CALIBRATE_VALUE(hpet_rate); -while ( (elapsed = hpet_read32(HPET_COUNTER) - count) < target ) -continue; - -return adjust_elapsed(rdtsc_ordered() - start, elapsed, target); +return calibrate_tsc(pts); } static void resume_hpet(struct platform_timesource *pts) @@ -508,22 +539,12 @@ static u64 read_pmtimer_count(void) static s64 __init init_pmtimer(struct platform_timesource *pts) { -uint64_t start; -uint32_t count, target, mask, elapsed; - if ( !pmtmr_ioport || (pmtmr_width != 24 && pmtmr_width != 32) ) return 0; pts->counter_bits = pmtmr_width; -mask = 0x >> (32 - pmtmr_width); - -count = inl(pmtmr_ioport); -start = rdtsc_ordered(); -target = CALIBRATE_VALUE(ACPI_PM_FREQUENCY); -while ( (elapsed = (inl(pmtmr_ioport) - count) & mask) < target ) -continue; -return adjust_elapsed(rdtsc_ordered() - start, elapsed, target); +return calibrate_tsc(pts); } static struct platform_timesource __initdata plt_pmtimer =
[PATCH v2 0/4] x86: further improve timer freq calibration accuracy
... plus some tidying (or so I hope). Only the 1st patch was submitted so far (i.e. as v1), all others are new. 1: time: further improve TSC / CPU freq calibration accuracy 2: APIC: calibrate against platform timer when possible 3: APIC: skip unnecessary parts of __setup_APIC_LVTT() 4: APIC: make connections between seemingly arbitrary numbers Jan