RE: [PATCH 1/1 v9] use crc32 instead of md5 for hibernation e820 integrity check
> From: Chris von Recklinghausen > Sent: Friday, April 16, 2021 6:17 AM > ... > Hibernation fails on a system in fips mode because md5 is used for the e820 > integrity check and is not available. Use crc32 instead. > > The check is intended to detect whether the E820 memory map provided > by the firmware after cold boot unexpectedly differs from the one that > was in use when the hibernation image was created. In this case, the > hibernation image cannot be restored, as it may cover memory regions > that are no longer available to the OS. > > A non-cryptographic checksum such as CRC-32 is sufficient to detect such > inadvertent deviations. > > Fixes: 62a03defeabd ("PM / hibernate: Verify the consistent of e820 memory > map >by md5 digest") > > Signed-off-by: Chris von Recklinghausen > --- Tested-by: Dexuan Cui Reviewed-by: Dexuan Cui Thanks Chris and all for the patch!
[PATCH v8 net-next 1/2] hv_netvsc: Make netvsc/VF binding check both MAC and serial number
Currently the netvsc/VF binding logic only checks the PCI serial number. The upcoming Microsoft Azure Network Adapter (MANA) supports multiple net_device interfaces (each such interface is called a "vPort", and has its unique MAC address) which are backed by the same VF PCI device, so the binding logic should check both the MAC address and the PCI serial number. The change should not break any other existing VF drivers, because Hyper-V NIC SR-IOV implementation requires the netvsc network interface and the VF network interface have the same MAC address. Co-developed-by: Haiyang Zhang Signed-off-by: Haiyang Zhang Co-developed-by: Shachar Raindel Signed-off-by: Shachar Raindel Signed-off-by: Dexuan Cui --- drivers/net/hyperv/netvsc_drv.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index 7349a70af083..f682a5572d84 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -2297,6 +2297,7 @@ static struct net_device *get_netvsc_byslot(const struct net_device *vf_netdev) { struct device *parent = vf_netdev->dev.parent; struct net_device_context *ndev_ctx; + struct net_device *ndev; struct pci_dev *pdev; u32 serial; @@ -2319,8 +2320,17 @@ static struct net_device *get_netvsc_byslot(const struct net_device *vf_netdev) if (!ndev_ctx->vf_alloc) continue; - if (ndev_ctx->vf_serial == serial) - return hv_get_drvdata(ndev_ctx->device_ctx); + if (ndev_ctx->vf_serial != serial) + continue; + + ndev = hv_get_drvdata(ndev_ctx->device_ctx); + if (ndev->addr_len != vf_netdev->addr_len || + memcmp(ndev->perm_addr, vf_netdev->perm_addr, + ndev->addr_len) != 0) + continue; + + return ndev; + } netdev_notice(vf_netdev, -- 2.25.1
[PATCH v8 net-next 0/2] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)
The patchset adds the VF driver for Microsoft Azure Network Adapter (MANA), and also changes the hv_netvsc driver's netvsc/VF binding logic to check both the MAC address and the serial number (this is required by the MANA VF driver). v7 contains both the netvsc change and the VF driver. This version (v8) posts them in 2 separate patches, as suggested by Stephen Hemminger. Please refer to "[PATCH v8 net-next 2/2]" for the history of v1~v7. Thanks, Dexuan Dexuan Cui (2): hv_netvsc: Make netvsc/VF binding check both MAC and serial number net: mana: Add a driver for Microsoft Azure Network Adapter (MANA) MAINTAINERS |4 +- drivers/net/ethernet/Kconfig |1 + drivers/net/ethernet/Makefile |1 + drivers/net/ethernet/microsoft/Kconfig| 29 + drivers/net/ethernet/microsoft/Makefile |5 + drivers/net/ethernet/microsoft/mana/Makefile |6 + drivers/net/ethernet/microsoft/mana/gdma.h| 673 ++ .../net/ethernet/microsoft/mana/gdma_main.c | 1415 .../net/ethernet/microsoft/mana/hw_channel.c | 843 .../net/ethernet/microsoft/mana/hw_channel.h | 190 ++ drivers/net/ethernet/microsoft/mana/mana.h| 533 + drivers/net/ethernet/microsoft/mana/mana_en.c | 1895 + .../ethernet/microsoft/mana/mana_ethtool.c| 250 +++ .../net/ethernet/microsoft/mana/shm_channel.c | 291 +++ .../net/ethernet/microsoft/mana/shm_channel.h | 21 + drivers/net/hyperv/netvsc_drv.c | 14 +- 16 files changed, 6168 insertions(+), 3 deletions(-) create mode 100644 drivers/net/ethernet/microsoft/Kconfig create mode 100644 drivers/net/ethernet/microsoft/Makefile create mode 100644 drivers/net/ethernet/microsoft/mana/Makefile create mode 100644 drivers/net/ethernet/microsoft/mana/gdma.h create mode 100644 drivers/net/ethernet/microsoft/mana/gdma_main.c create mode 100644 drivers/net/ethernet/microsoft/mana/hw_channel.c create mode 100644 drivers/net/ethernet/microsoft/mana/hw_channel.h create mode 100644 drivers/net/ethernet/microsoft/mana/mana.h create mode 100644 drivers/net/ethernet/microsoft/mana/mana_en.c create mode 100644 drivers/net/ethernet/microsoft/mana/mana_ethtool.c create mode 100644 drivers/net/ethernet/microsoft/mana/shm_channel.c create mode 100644 drivers/net/ethernet/microsoft/mana/shm_channel.h -- 2.25.1
RE: [PATCH v7 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)
> From: Stephen Hemminger > Sent: Friday, April 16, 2021 11:09 AM > ... > On Fri, 16 Apr 2021 17:58:45 +0000 > Dexuan Cui wrote: > > > > > > > > > This probably should be a separate patch. > > > > I think it is trying to address the case of VF discovery in > > > > Hyper-V/Azure > where > > > > the reported > > > > VF from Hypervisor is bogus or confused. > > > > > > This is for the Multi vPorts feature of MANA driver, which allows one VF > > > to > > > create multiple vPorts (NICs). They have the same PCI device and same VF > > > serial number, but different MACs. > > > > > > So we put the change in one patch to avoid distro vendors missing this > > > change when backporting the MANA driver. > > > > > > Thanks, > > > - Haiyang > > > > The netvsc change should come together in the same patch with this VF > > driver, otherwise the multi-vPorts functionality doesn't work properly. > > > > The netvsc change should not break any other existing VF drivers, because > > Hyper-V NIC SR-IOV implementation requires the the NetVSC network > > interface and the VF network interface should have the same MAC address, > > otherwise things won't work. > > > > Thanks, > > Dexuan > > Distro vendors should be able to handle a patch series. > Don't see why this could not be two patch series. Ok. Will split this into 2 patches (the first one is the netvsc change, and the second is the Linux VF driver) and post v8 shortly.
RE: [PATCH v7 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)
> From: Haiyang Zhang > Sent: Friday, April 16, 2021 10:11 AM > > From: Stephen Hemminger > > > ... > > > @@ -2319,8 +2320,17 @@ static struct net_device > > *get_netvsc_byslot(const struct net_device *vf_netdev) > > > if (!ndev_ctx->vf_alloc) > > > continue; > > > > > > - if (ndev_ctx->vf_serial == serial) > > > - return hv_get_drvdata(ndev_ctx->device_ctx); > > > + if (ndev_ctx->vf_serial != serial) > > > + continue; > > > + > > > + ndev = hv_get_drvdata(ndev_ctx->device_ctx); > > > + if (ndev->addr_len != vf_netdev->addr_len || > > > + memcmp(ndev->perm_addr, vf_netdev->perm_addr, > > > +ndev->addr_len) != 0) > > > + continue; > > > + > > > + return ndev; > > > + > > > } > > > > > > netdev_notice(vf_netdev, > > > > > > This probably should be a separate patch. > > I think it is trying to address the case of VF discovery in Hyper-V/Azure > > where > > the reported > > VF from Hypervisor is bogus or confused. > > This is for the Multi vPorts feature of MANA driver, which allows one VF to > create multiple vPorts (NICs). They have the same PCI device and same VF > serial number, but different MACs. > > So we put the change in one patch to avoid distro vendors missing this > change when backporting the MANA driver. > > Thanks, > - Haiyang The netvsc change should come together in the same patch with this VF driver, otherwise the multi-vPorts functionality doesn't work properly. The netvsc change should not break any other existing VF drivers, because Hyper-V NIC SR-IOV implementation requires the the NetVSC network interface and the VF network interface should have the same MAC address, otherwise things won't work. Thanks, Dexuan
RE: [PATCH v6 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)
> From: Stephen Hemminger > Sent: Thursday, April 15, 2021 2:15 PM > > ... > > + netif_carrier_off(ndev); > > + > > + get_random_bytes(apc->hashkey, MANA_HASH_KEY_SIZE); > > Current practice for network drivers is to use netdev_rss_key_fill() for this. Will change to netdev_rss_key_fill(). Thanks!
RE: [PATCH v6 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)
> From: Jakub Kicinski > Sent: Thursday, April 15, 2021 10:44 AM > ... > On Wed, 14 Apr 2021 22:45:19 -0700 Dexuan Cui wrote: > > + buf = dma_alloc_coherent(gmi->dev, length, &dma_handle, > > +GFP_KERNEL | __GFP_ZERO); > > No need for GFP_ZERO, dma_alloc_coherent() zeroes the memory these days. Yes, indeed. Will remove __GFP_ZERO. > > > +static int mana_gd_register_irq(struct gdma_queue *queue, > > + const struct gdma_queue_spec *spec) > > ... > > + struct gdma_irq_context *gic; > > + > > + struct gdma_context *gc; > > Why the empty line? No good reason. Will remove this line. I'll check the whole patch for similar issues. > > > + queue = kzalloc(sizeof(*queue), GFP_KERNEL); > > + if (!queue) > > + return -ENOMEM; > > + > > + gmi = &queue->mem_info; > > + err = mana_gd_alloc_memory(gc, spec->queue_size, gmi); > > + if (err) > > + return err; > > Leaks the memory from 'queue'? Sorry. This should be a bug I introduced when moving arouond some code. > Same code in mana_gd_create_mana_eq(), ...wq_cq(), etc. Will fix all of them, and check for the code similar issues. > > +int mana_do_attach(struct net_device *ndev, enum mana_attach_caller > caller) > > +{ > > + struct mana_port_context *apc = netdev_priv(ndev); > > + struct gdma_dev *gd = apc->ac->gdma_dev; > > + u32 max_txq, max_rxq, max_queues; > > + int port_idx = apc->port_idx; > > + u32 num_indirect_entries; > > + int err; > > + > > + if (caller == MANA_OPEN) > > + goto start_open; > > + > > + err = mana_init_port_context(apc); > > + if (err) > > + return err; > > + > > + err = mana_query_vport_cfg(apc, port_idx, &max_txq, &max_rxq, > > + &num_indirect_entries); > > + if (err) { > > + netdev_err(ndev, "Failed to query info for vPort 0\n"); > > + goto reset_apc; > > + } > > + > > + max_queues = min_t(u32, max_txq, max_rxq); > > + if (apc->max_queues > max_queues) > > + apc->max_queues = max_queues; > > + > > + if (apc->num_queues > apc->max_queues) > > + apc->num_queues = apc->max_queues; > > + > > + memcpy(ndev->dev_addr, apc->mac_addr, ETH_ALEN); > > + > > + if (caller == MANA_PROBE) > > + return 0; > > + > > +start_open: > > Why keep this as a single function, there is no overlap between what's > done for OPEN and PROBE, it seems. > > Similarly detach should probably be split into clearly distinct parts. Will improve the code. Thanks for the suggestion! Thanks, Dexuan
RE: [PATCH v5 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)
> From: Jakub Kicinski > Sent: Tuesday, April 13, 2021 12:03 PM > > On Mon, 12 Apr 2021 19:35:09 -0700 Dexuan Cui wrote: > > + apc->port_st_save = apc->port_is_up; > > + apc->port_is_up = false; > > + apc->start_remove = true; > > + > > + /* Ensure port state updated before txq state */ > > + smp_wmb(); > > + > > + netif_tx_disable(ndev); > > In your napi poll method there is no barrier between port_is_up check > and netif_tx_queue_stopped(). Thanks for spotting this! We'll make the below change: --- a/drivers/net/ethernet/microsoft/mana/mana_en.c +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c @@ -938,16 +938,19 @@ static void mana_poll_tx_cq(struct mana_cq *cq) avail_space = mana_gd_wq_avail_space(gdma_wq); /* Ensure tail updated before checking q stop */ smp_mb(); net_txq = txq->net_txq; txq_stopped = netif_tx_queue_stopped(net_txq); + /* Ensure checking txq_stopped before apc->port_is_up. */ + smp_rmb(); + if (txq_stopped && apc->port_is_up && avail_space >= MAX_TX_WQE_SIZE) { netif_tx_wake_queue(net_txq); apc->eth_stats.wake_queue++; } if (atomic_sub_return(pkt_transmitted, &txq->pending_sends) < 0) WARN_ON_ONCE(1); } > > + netif_carrier_off(ndev); > > + > > + /* No packet can be transmitted now since apc->port_is_up is false. > > +* There is still a tiny chance that mana_poll_tx_cq() can re-enable > > +* a txq because it may not timely see apc->port_is_up being cleared > > +* to false, but it doesn't matter since mana_start_xmit() drops any > > +* new packets due to apc->port_is_up being false. > > +* > > +* Drain all the in-flight TX packets > > +*/ > > + for (i = 0; i < apc->num_queues; i++) { > > + txq = &apc->tx_qp[i].txq; > > + > > + while (atomic_read(&txq->pending_sends) > 0) > > + usleep_range(1000, 2000); > > + } > > > + /* All cleanup actions should stay after rtnl_lock(), otherwise > > +* other functions may access partially cleaned up data. > > +*/ > > + rtnl_lock(); > > + > > + mana_detach(ndev); > > + > > + unregister_netdevice(ndev); > > + > > + rtnl_unlock(); > > I find the resource management somewhat strange. Why is mana_attach() > and mana_detach() called at probe/remove time, and not when the > interface is brought up? Presumably when the user ifdowns the interface > there is no point holding the resources? Your open/close methods are > rather empty. Thanks for the suggestion! Will move the functions to open/close(). > > + if ((eq_addr & PAGE_MASK) != eq_addr) > > + return -EINVAL; > > + > > + if ((cq_addr & PAGE_MASK) != cq_addr) > > + return -EINVAL; > > + > > + if ((rq_addr & PAGE_MASK) != rq_addr) > > + return -EINVAL; > > + > > + if ((sq_addr & PAGE_MASK) != sq_addr) > > + return -EINVAL; Will change to PAGE_ALIGNED(). Thanks, Dexuan
RE: [PATCH v4 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)
> From: Jakub Kicinski > Sent: Monday, April 12, 2021 11:21 AM > ... > On Sun, 11 Apr 2021 19:34:55 -0700 Dexuan Cui wrote: > > + for (i = 0; i < ANA_INDIRECT_TABLE_SIZE; i++) > > + apc->indir_table[i] = i % apc->num_queues; > > ethtool_rxfh_indir_default() Will use ethtool_rxfh_indir_default(). > > + err = mana_cfg_vport_steering(apc, rx, true, update_hash, update_tab); > > + return err; > > return mana_... > > please fix everywhere. Will fix this one, and will review if there is any similar issue. > > + netif_set_real_num_tx_queues(ndev, apc->num_queues); > > + > > + err = mana_add_rx_queues(apc, ndev); > > + if (err) > > + goto destroy_vport; > > + > > + apc->rss_state = apc->num_queues > 1 ? TRI_STATE_TRUE : > TRI_STATE_FALSE; > > + > > + netif_set_real_num_rx_queues(ndev, apc->num_queues); > > netif_set_real_num_.. can fail. Will fix the error handling. > > + rtnl_lock(); > > + > > + netdev_lockdep_set_classes(ndev); > > + > > + ndev->hw_features = NETIF_F_SG | NETIF_F_IP_CSUM | > NETIF_F_IPV6_CSUM; > > + ndev->hw_features |= NETIF_F_RXCSUM; > > + ndev->hw_features |= NETIF_F_TSO | NETIF_F_TSO6; > > + ndev->hw_features |= NETIF_F_RXHASH; > > + ndev->features = ndev->hw_features; > > + ndev->vlan_features = 0; > > + > > + err = register_netdevice(ndev); > > + if (err) { > > + netdev_err(ndev, "Unable to register netdev.\n"); > > + goto destroy_vport; > > + } > > + > > + rtnl_unlock(); > > + > > + return 0; > > +destroy_vport: > > + rtnl_unlock(); > > Why do you take rtnl_lock() explicitly around this code? It looks like there is no good reason, and I guess we just copied the code from netvsc_probe(), where the RTNL lock is indeed explicitly needed. Will change to directly use register_netdev(), which gets and release the RTNL lock automatically. > > +static int mana_set_channels(struct net_device *ndev, > > +struct ethtool_channels *channels) > > +{ > > + struct ana_port_context *apc = netdev_priv(ndev); > > + unsigned int new_count; > > + unsigned int old_count; > > + int err, err2; > > + > > + new_count = channels->combined_count; > > + old_count = apc->num_queues; > > + > > + if (new_count < 1 || new_count > apc->max_queues || > > + channels->rx_count || channels->tx_count || > channels->other_count) > > All these checks should be done by the core already. > > > + return -EINVAL; > > + > > + if (new_count == old_count) > > + return 0; > > And so is this one. Will change the code to avoid unnecessary checking. Thanks, Dexuan
RE: [PATCH v4 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)
> From: Haiyang Zhang > Sent: Monday, April 12, 2021 7:40 AM > > From: Andrew Lunn > > Sent: Monday, April 12, 2021 8:32 AM > > > ... > > > + /* At most num_online_cpus() + 1 interrupts are used. */ > > > + msix_index = queue->eq.msix_index; > > > + if (WARN_ON(msix_index > num_online_cpus())) > > > + return; > > > > Do you handle hot{un}plug of CPUs? > We don't have hot{un}plug of CPU feature yet. Hyper-V doesn't support vCPU hot plug/unplug for VMs running on it, but potentially the VM is able to offline and online its virtual CPUs, though this is not a typical usage at all in production system (to offline a vCPU, the VM also needs to unbind all the para-virtualized devices' vmbus channels from that vCPU, which is kind of undoable in a VM that has less than about 32 vCPUs, because typically all the vCPUs are bound to one of the vmbus channels of the NetVSC device(s) and StorVSC device(s), and can't be (easily) unbound. So I think the driver does need to handle cpu online/offlining properly, but I think we can do that in some future patch, because IMO that would need more careful consideration. IMO here the WARN_ON() is good as it at least lets us know something unexpected (if any) happens. > > > +static void mana_hwc_init_event_handler(void *ctx, struct gdma_queue > > *q_self, > > > + struct gdma_event *event) > > > +{ > > > + struct hw_channel_context *hwc = ctx; > > > + struct gdma_dev *gd = hwc->gdma_dev; > > > + union hwc_init_type_data type_data; > > > + union hwc_init_eq_id_db eq_db; > > > + u32 type, val; > > > + > > > + switch (event->type) { > > > + case GDMA_EQE_HWC_INIT_EQ_ID_DB: > > > + eq_db.as_uint32 = event->details[0]; > > > + hwc->cq->gdma_eq->id = eq_db.eq_id; > > > + gd->doorbell = eq_db.doorbell; > > > + break; > > > + > > > + case GDMA_EQE_HWC_INIT_DATA: > > > + > > > + type_data.as_uint32 = event->details[0]; > > > + > > > + case GDMA_EQE_HWC_INIT_DONE: > > > + complete(&hwc->hwc_init_eqe_comp); > > > + break; > > > > ... > > > > > + default: > > > + WARN_ON(1); > > > + break; > > > + } > > > > Are these events from the firmware? If you have newer firmware with an > > older driver, are you going to spam the kernel log with WARN_ON dumps? > For protocol version mismatch, the host and guest will either negotiate the > highest common version, or fail to probe. So this kind of warnings are not > expected. I agree, but I think we'd better remove the WARN_ON(1), which was mainly for debugging purposem, and was added just in case. Thanks, Dexuan
RE: [PATCH v4 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)
> From: Leon Romanovsky > Sent: Monday, April 12, 2021 12:46 AM > To: Dexuan Cui > > ... > > +#define ANA_MAJOR_VERSION 0 > > +#define ANA_MINOR_VERSION 1 > > +#define ANA_MICRO_VERSION 1 > > Please don't introduce drier versions. This is not the usual "driver version", though it's called "drv version" :-) As you can see, the driver does not use the macro MODULE_VERSION(). Here the "drv version" actually means the version of the VF-to-PF protocol, with which the Azure Network Adapter ethernet NIC driver (i.e. the VF driver) talks to the PF driver. The protocol version determines the formats of the messages that are sent from the VF driver to the PF driver, e.g. query the MAC address, create Send/Receive queues, configure RSS, etc. Currently the protocol versin is 0.1.1 You may ask why it's called "drv version" rather than "protocol version" -- it's because the PF driver calls it that way, so I think here the VF driver may as well use the same name. BTW, the "drv ver" info is passed to the PF driver in the below function: static int mana_query_client_cfg(struct ana_context *ac, u32 drv_major_ver, u32 drv_minor_ver, u32 drv_micro_ver, u16 *max_num_vports) { struct gdma_context *gc = ac->gdma_dev->gdma_context; struct ana_query_client_cfg_resp resp = {}; struct ana_query_client_cfg_req req = {}; struct device *dev = gc->dev; int err = 0; mana_gd_init_req_hdr(&req.hdr, ANA_QUERY_CLIENT_CONFIG, sizeof(req), sizeof(resp)); req.drv_major_ver = drv_major_ver; req.drv_minor_ver = drv_minor_ver; req.drv_micro_ver = drv_micro_ver; err = mana_send_request(ac, &req, sizeof(req), &resp, sizeof(resp)); Thanks, Dexuan
RE: [PATCH v3 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)
> From: Andrew Lunn > Sent: Thursday, April 8, 2021 5:30 PM > To: Stephen Hemminger > ... > > Linux kernel doesn't do namespaces in the code, so every new driver needs > > to worry about global symbols clashing > > This driver is called mana, yet the code uses ana. It would be good to > resolve this inconsistency as well. Ideally, you want to prefix > everything with ana_ or mana_, depending on what you choose, so we > have a clean namespace. > > Andrew Thanks for the suggestion! Let me think about this and work out a solution.
RE: [PATCH v3 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)
> From: David Miller > Sent: Thursday, April 8, 2021 5:41 PM > > ... > > In the driver code, all the structs/unions marked by __packed are used to > > talk with the hardware, so I think __packed is necessary here? > > It actually isan't in many cases, check with and without the __packed > directive > and see if anything chasnges. Will do. > > Do you think if it's better if we remove all the __packed, and add > > static_assert(sizeof(struct XXX) == YYY) instead? e.g. > > > > @@ -105,7 +105,8 @@ struct gdma_msg_hdr { > > u16 msg_version; > > u16 hwc_msg_id; > > u32 msg_size; > > -} __packed; > > +}; > > +static_assert(sizeof(struct gdma_msg_hdr) == 16); > > This won't make sure the structure member offsets are what you expect. > > I think you'll have to go through the structures one-by-one by hand to > figure out which ones really require the __packed attribute and which do not. Got it. Let me see if I can remove all the __packed.
RE: [PATCH v3 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)
> From: David Miller > Sent: Thursday, April 8, 2021 4:46 PM > ... > > +struct gdma_msg_hdr { > > + u32 hdr_type; > > + u32 msg_type; > > + u16 msg_version; > > + u16 hwc_msg_id; > > + u32 msg_size; > > +} __packed; > > + > > +struct gdma_dev_id { > > + union { > > + struct { > > + u16 type; > > + u16 instance; > > + }; > > + > > + u32 as_uint32; > > + }; > > +} __packed; > > Please don't use __packed unless absolutely necessary. It generates > suboptimal code (byte at a time > accesses etc.) and for many of these you don't even need it. In the driver code, all the structs/unions marked by __packed are used to talk with the hardware, so I think __packed is necessary here? Do you think if it's better if we remove all the __packed, and add static_assert(sizeof(struct XXX) == YYY) instead? e.g. @@ -105,7 +105,8 @@ struct gdma_msg_hdr { u16 msg_version; u16 hwc_msg_id; u32 msg_size; -} __packed; +}; +static_assert(sizeof(struct gdma_msg_hdr) == 16); struct gdma_dev_id { union { Now I'm trying to figure out how other NIC drivers define structs/unions. Thanks, Dexuan
RE: [PATCH v2 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)
> From: Stephen Hemminger > Sent: Thursday, April 8, 2021 9:52 AM Thanks all for your input! We'll make the below changes as suggested: Microsoft Azure Network Device ==> Microsoft Network Devices Drop the default m validated ==> supported We'll also fix some warnings reported by "kernel test robot". Will post v3 later today. Thanks, Dexuan diff --git a/drivers/net/ethernet/microsoft/Kconfig b/drivers/net/ethernet/microsoft/Kconfig index 12ef6b581566..e1ac0a5d808d 100644 --- a/drivers/net/ethernet/microsoft/Kconfig +++ b/drivers/net/ethernet/microsoft/Kconfig @@ -3,26 +3,25 @@ # config NET_VENDOR_MICROSOFT - bool "Microsoft Azure Network Device" + bool "Microsoft Network Devices" default y help If you have a network (Ethernet) device belonging to this class, say Y. Note that the answer to this question doesn't directly affect the kernel: saying N will just cause the configurator to skip the - question about Microsoft Azure network device. If you say Y, you - will be asked for your specific device in the following question. + question about Microsoft network devices. If you say Y, you will be + asked for your specific device in the following question. if NET_VENDOR_MICROSOFT config MICROSOFT_MANA tristate "Microsoft Azure Network Adapter (MANA) support" - default m depends on PCI_MSI && X86_64 select PCI_HYPERV help This driver supports Microsoft Azure Network Adapter (MANA). - So far, the driver is only validated on X86_64. + So far, the driver is only supported on X86_64. To compile this driver as a module, choose M here. The module will be called mana.
RE: [PATCH net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)
> From: Leon Romanovsky > Sent: Wednesday, April 7, 2021 5:45 AM > > > > > > BTW, you don't need to write { 0 }, the {} is enough. > > > > Thanks for the suggestion! I'll use {0} in v2. > > You missed the point, "{ 0 }" change to be "{}" without 0. Got it. Will make the suggested change. FWIW, {0} and { 0 } are still widely used, but it looks like {} is indeed more preferred: $ grep "= {};" drivers/net/ -nr | wc -l 829 $ grep "= {0};" drivers/net/ -nr | wc -l 708 $ grep "= {};" kernel/ -nr | wc -l 29 $ grep "= {0};" kernel/ -nr | wc -l 4
RE: [PATCH net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)
> From: Leon Romanovsky > Sent: Wednesday, April 7, 2021 1:10 AM > > <...> > > > +int gdma_verify_vf_version(struct pci_dev *pdev) > > +{ > > + struct gdma_context *gc = pci_get_drvdata(pdev); > > + struct gdma_verify_ver_req req = { 0 }; > > + struct gdma_verify_ver_resp resp = { 0 }; > > + int err; > > + > > + gdma_init_req_hdr(&req.hdr, GDMA_VERIFY_VF_DRIVER_VERSION, > > + sizeof(req), sizeof(resp)); > > + > > + req.protocol_ver_min = GDMA_PROTOCOL_FIRST; > > + req.protocol_ver_max = GDMA_PROTOCOL_LAST; > > + > > + err = gdma_send_request(gc, sizeof(req), &req, sizeof(resp), &resp); > > + if (err || resp.hdr.status) { > > + pr_err("VfVerifyVersionOutput: %d, status=0x%x\n", err, > > + resp.hdr.status); > > + return -EPROTO; > > + } > > + > > + return 0; > > +} > > <...> > > + err = gdma_verify_vf_version(pdev); > > + if (err) > > + goto remove_irq; > > Will this VF driver be used in the guest VM? What will prevent from users to > change it? > I think that such version negotiation scheme is not allowed. Yes, the VF driver is expected to run in a Linux VM that runs on Azure. Currently gdma_verify_vf_version() just tells the PF driver that the VF driver is only able to support GDMA_PROTOCOL_V1, and want to use GDMA_PROTOCOL_V1's message formats to talk to the PF driver later. enum { GDMA_PROTOCOL_UNDEFINED = 0, GDMA_PROTOCOL_V1 = 1, GDMA_PROTOCOL_FIRST = GDMA_PROTOCOL_V1, GDMA_PROTOCOL_LAST = GDMA_PROTOCOL_V1, GDMA_PROTOCOL_VALUE_MAX }; The PF driver is supposed to always support GDMA_PROTOCOL_V1, so I expect here gdma_verify_vf_version() should succeed. If a user changes the Linux VF driver and try to use a protocol version not supported by the PF driver, then gdma_verify_vf_version() will fail; later, if the VF driver tries to talk to the PF driver using an unsupported message format, the PF driver will return a failure. Thanks, -- Dexuan
RE: [PATCH net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)
> From: Leon Romanovsky > Sent: Wednesday, April 7, 2021 1:15 AM > > ... > > int gdma_test_eq(struct gdma_context *gc, struct gdma_queue *eq) > > { > > struct gdma_generate_test_event_req req = { 0 }; > > struct gdma_general_resp resp = { 0 }; > > BTW, you don't need to write { 0 }, the {} is enough. Thanks for the suggestion! I'll use {0} in v2. BTW, looks like both are widely used in the kernel. Maybe someone should update scripts/checkpatch.pl to add a warning agaist { 0 } in new code, if {0} is preferred. :-) dexuan@localhost:~/linux$ grep "= { 0 };" kernel/ -nr | wc -l 4 dexuan@localhost:~/linux$ grep "= {0};" kernel/ -nr | wc -l 4 dexuan@localhost:~/linux$ grep "= { 0 };" Documentation/ -nr Documentation/networking/page_pool.rst:117:struct page_pool_params pp_params = { 0 }; dexuan@localhost:~/linux$ grep "= { 0 };" drivers/ -nr | wc -l 1085 dexuan@localhost:~/linux$ grep "= {0};" drivers/ -nr | wc -l 1321 dexuan@localhost:~/linux$ grep "= { 0 };" drivers/net/ -nr | wc -l 408 dexuan@localhost:~/linux$ grep "= {0};" drivers/net/ -nr | wc -l 708
RE: [PATCH net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)
> From: kernel test robot > Sent: Tuesday, April 6, 2021 6:31 PM > ... > Hi Dexuan, > I love your patch! Perhaps something to improve: > > All warnings (new ones prefixed by >>): > >drivers/pci/controller/pci-hyperv.c: In function 'hv_irq_unmask': >drivers/pci/controller/pci-hyperv.c:1220:2: error: implicit declaration of > function 'hv_set_msi_entry_from_desc' > [-Werror=implicit-function-declaration] > 1220 | hv_set_msi_entry_from_desc(¶ms->int_entry.msi_entry, > msi_desc); This build error looks strange, because the patch doesn't even touch the driver drivers/pci/controller/pci-hyperv.c. I'll try to repro the build failure, and the other 2 failures in 2 separate emails, and figure out what's happening. PS, I tested building the patch in a fresh Ubuntu 20.04 VM and it was successful. Thanks, -- Dexuan
RE: [PATCH net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)
> From: Andrew Lunn > Sent: Tuesday, April 6, 2021 6:08 PM > To: Dexuan Cui > > > +static int gdma_query_max_resources(struct pci_dev *pdev) > > +{ > > + struct gdma_context *gc = pci_get_drvdata(pdev); > > + struct gdma_general_req req = { 0 }; > > + struct gdma_query_max_resources_resp resp = { 0 }; > > + int err; > > Network drivers need to use reverse christmas tree. I spotted a number > of functions getting this wrong. Please review the whole driver. Hi Andrew, thanks for the quick comments! I think In general the patch follows the reverse christmas tree style. For the function you pointed out, I didn't follow the reverse christmas tree style because I wanted to keep the variable defines more natural, i.e. first define 'req' and then 'resp'. I can swap the 2 lines here, i.e. define 'resp' first, but this looks a little strange to me, because in some other functions the 'req' should be defined first, e.g. int gdma_test_eq(struct gdma_context *gc, struct gdma_queue *eq) { struct gdma_generate_test_event_req req = { 0 }; struct gdma_general_resp resp = { 0 }; And, some variable defines can not follow the reverse christmas tree style due to dependency, e.g. static void hwc_init_event_handler(void *ctx, struct gdma_queue *q_self, struct gdma_event *event) { struct hw_channel_context *hwc = ctx; struct gdma_dev *gd = hwc->gdma_dev; struct gdma_context *gc = gdma_dev_to_context(gd); I failed to find the reverse christmas tree rule in the Documentation/ folder. I knew the rule and I thought it was documented there, but it looks like it's not. So my understanding is that in general we should follow the style, but there can be exceptions due to dependencies or logical grouping. > > + gdma_init_req_hdr(&req.hdr, GDMA_QUERY_MAX_RESOURCES, > > + sizeof(req), sizeof(resp)); > > + > > + err = gdma_send_request(gc, sizeof(req), &req, sizeof(resp), &resp); > > + if (err || resp.hdr.status) { > > + pr_err("%s, line %d: err=%d, err=0x%x\n", __func__, __LINE__, > > + err, resp.hdr.status); > > I expect checkpatch complained about this. Don't use pr_err(), use > dev_err(pdev->dev, ... of netdev_err(ndev, ... You should always have > access to dev or ndev, so please change all pr_ calls. I did run scripts/checkpatch.pl and it reported no error/warning. :-) But I think you're correct. I'll change the pr_* to dev_* or netdev_*. > > +static unsigned int num_queues = ANA_DEFAULT_NUM_QUEUE; > > +module_param(num_queues, uint, 0444); > > No module parameters please. > >Andrew This parameter was mostly for debugging purpose. I can remove it. BTW, I do remember some maintainers ask people to not add module parameters unless that's absolutely necessary, but I don't remember if the rule applies to only the net subsystem or to the whole drivers/ folder. It looks like the rule was also not documented in the Documentation/ folder? Please let me know if such a rule is explicitly defined somewhere. Thanks, -- Dexuan
RE: [PATCH v2 1/1] use crc32 instead of md5 for hibernation e820 integrity check
> From: Chris von Recklinghausen > Sent: Thursday, April 1, 2021 9:42 AM > To: a...@kernel.org; s...@redhat.com; raf...@kernel.org; Dexuan Cui > ; linux...@vger.kernel.org; > linux-cry...@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: [PATCH v2 1/1] use crc32 instead of md5 for hibernation e820 > integrity > check > > Suspend fails on a system in fips mode because md5 is used for the e820 > integrity check and is not available. Use crc32 instead. > > Fixes: 62a03defeabd ("PM / hibernate: Verify the consistent of e820 memory > map >by md5 digest") > Signed-off-by: Chris von Recklinghausen > --- > arch/x86/power/hibernate.c | 35 +++ > 1 file changed, 19 insertions(+), 16 deletions(-) Thanks Chris and all! This patch looks good to me. Tested-by: Dexuan Cui Reviewed-by: Dexuan Cui
RE: v5.12.0-rc5: the kernel panics if FIPS mode is on
> From: Eric Biggers > Sent: Monday, March 29, 2021 6:26 PM > ... > It looks like your userspace is using tcrypt.ko to request that the kernel > test > "ofb(aes)", but your kernel doesn't have CONFIG_CRYPTO_OFB enabled so the > test fails as expected. Hi Eric, Thanks for the explanation! Yes, that's it! Sorry for the false alarm! Actually the kernel is faultless here. > Are you sure that anything changed on the kernel side > besides the kconfig you are using? It looks like this was always the behavior > when tcrypt.ko is used to test a non-existing algorithm. After I rebuilt the kernel with the 3 options: CONFIG_CRYPTO_OFB=y CONFIG_CRYPTO_DEV_PADLOCK_AES=y CONFIG_CRYPTO_ANSI_CPRNG=y and generated the .hmac file: sha512hmac /boot/vmlinuz-5.12.0-rc5+ > /boot/.vmlinuz-5.12.0-rc5+.hmac now the kernel boots up successfully with fips=1. :-) > Is your userspace code intentionally trying to test "ofb(aes)", or is it > accidental? > > - Eric I'm not sure. This is a CentOS 8.3 VM, and I use the default configuration. I have been trying to build & run a v5.12.0-rc5+ kernel with fips=1, and now this is working for me, thanks to your explanation. Thanks again! Thanks, -- Dexuan
Fix hibernation in FIPS mode?
Hi, MD5 was marked incompliant with FIPS in 2009: a3bef3a31a19 ("crypto: testmgr - Skip algs not flagged fips_allowed in fips mode") a1915d51e8e7 ("crypto: testmgr - Mark algs allowed in fips mode") But hibernation_e820_save() is still using MD5, and fails in FIPS mode due to the 2018 patch: 749fa17093ff ("PM / hibernate: Check the success of generating md5 digest before hibernation") As a result, hibernation doesn't work when FIPS is on. Do you think if hibernation_e820_save() should be changed to use a FIPS-compliant algorithm like SHA-1? PS, currently it looks like FIPS mode is broken in the mainline: https://www.mail-archive.com/linux-crypto@vger.kernel.org/msg49414.html Thanks, -- Dexuan
v5.12.0-rc5: the kernel panics if FIPS mode is on
Hi all, The v5.12.0-rc5 kernel (1e43c377a79f) panics with fips=1. Please refer to the below panic call-trace. The kernel config file and the full kernel messages are also attached. Is this a known issue? Thanks, -- Dexuan Starting dracut pre-udev hook... [7.260424] alg: self-tests for sha512-generic (sha512) passed [7.265917] alg: self-tests for sha384-generic (sha384) passed [7.272426] alg: self-tests for sha512-ssse3 (sha512) passed [7.276500] alg: self-tests for sha384-ssse3 (sha384) passed [7.281722] alg: self-tests for sha512-avx (sha512) passed [7.286579] alg: self-tests for sha384-avx (sha384) passed [7.291631] alg: self-tests for sha512-avx2 (sha512) passed [7.296950] alg: self-tests for sha384-avx2 (sha384) passed [7.321040] alg: self-tests for sha3-224-generic (sha3-224) passed [7.330291] alg: self-tests for sha3-256-generic (sha3-256) passed [7.335918] alg: self-tests for sha3-384-generic (sha3-384) passed [7.341508] alg: self-tests for sha3-512-generic (sha3-512) passed [7.381918] alg: self-tests for crc32c-intel (crc32c) passed [7.396694] alg: self-tests for crct10dif-pclmul (crct10dif) passed [7.453515] alg: self-tests for ghash-clmulni (ghash) passed [7.469558] alg: self-tests for des3_ede-asm (des3_ede) passed [7.475355] alg: self-tests for ecb-des3_ede-asm (ecb(des3_ede)) passed [7.481361] alg: self-tests for cbc-des3_ede-asm (cbc(des3_ede)) passed [7.488656] alg: self-tests for des3_ede-generic (des3_ede) passed [7.304930] dracut-pre-udev[502]: modprobe: ERROR: could not insert 'padlock_aes': No such device [7.579580] alg: No test for fips(ansi_cprng) (fips_ansi_cprng) [7.606547] alg: self-tests for sha1 (sha1) passed [7.610624] alg: self-tests for ecb(des3_ede) (ecb(des3_ede)) passed [7.615746] alg: self-tests for cbc(des3_ede) (cbc(des3_ede)) passed [7.638067] alg: self-tests for ctr(des3_ede-asm) (ctr(des3_ede)) passed [7.644781] alg: self-tests for ctr(des3_ede) (ctr(des3_ede)) passed [7.653810] alg: self-tests for sha256 (sha256) passed [7.658945] alg: self-tests for ecb(aes) (ecb(aes)) passed [7.663493] alg: self-tests for cbc(aes) (cbc(aes)) passed [7.668421] alg: self-tests for xts(aes) (xts(aes)) passed [7.672389] alg: self-tests for ctr(aes) (ctr(aes)) passed [7.692973] alg: self-tests for rfc3686(ctr-aes-aesni) (rfc3686(ctr(aes))) passed [7.699446] alg: self-tests for rfc3686(ctr(aes)) (rfc3686(ctr(aes))) passed [7.730149] alg: skcipher: failed to allocate transform for ofb(aes): -2 [7.735959] Kernel panic - not syncing: alg: self-tests for ofb(aes) (ofb(aes)) failed in fips mode! [7.736952] CPU: 13 PID: 560 Comm: modprobe Tainted: GW 5.12.0-rc5+ #3 [7.736952] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018 [7.736952] Call Trace: [7.736952] dump_stack+0x64/0x7c [7.736952] panic+0xfb/0x2d7 [7.736952] alg_test+0x42d/0x460 [7.736952] ? __kernfs_new_node+0x175/0x1d0 [7.736952] do_test+0x3248/0x57ea [tcrypt] [7.736952] do_test+0x1f2c/0x57ea [tcrypt] [7.736952] ? 0xc031d000 [7.736952] tcrypt_mod_init+0x55/0x1000 [tcrypt] [7.736952] ? 0xc031d000 [7.736952] do_one_initcall+0x44/0x1d0 [7.736952] ? __cond_resched+0x15/0x30 [7.736952] ? kmem_cache_alloc_trace+0x3d/0x410 [7.736952] do_init_module+0x5a/0x230 [7.736952] load_module+0x1a5b/0x1bc0 [7.736952] ? __do_sys_finit_module+0xad/0x110 [7.736952] __do_sys_finit_module+0xad/0x110 [7.736952] do_syscall_64+0x33/0x40 [7.736952] entry_SYSCALL_64_after_hwframe+0x44/0xae [7.736952] RIP: 0033:0x7ff2e760978d [7.736952] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d cb 56 2c 00 f7 d8 64 89 01 48 [7.736952] RSP: 002b:7ffd80204308 EFLAGS: 0246 ORIG_RAX: 0139 [7.736952] RAX: ffda RBX: 563dcfe8f030 RCX: 7ff2e760978d [7.736952] RDX: RSI: 563dcf41d7b6 RDI: 0003 [7.736952] RBP: 563dcf41d7b6 R08: R09: [7.736952] R10: 0003 R11: 0246 R12: [7.736952] R13: 563dcfe934c0 R14: 0004 R15: [7.736952] Kernel Offset: 0x1080 from 0x8100 (relocation range: 0x8000-0xbfff) [7.736952] ---[ end Kernel panic - not syncing: alg: self-tests for ofb(aes) (ofb(aes)) failed in fips mode! ]--- v5.12-rc5.kernel.config.txt.tar.gz Description: v5.12-rc5.kernel.config.txt.tar.gz [0.00] Linux version 5.12.0-rc5+ (root@decui-co83-fips) (gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), GNU ld version 2.30-79.el8) #3 SMP Mon Mar 29 21:26:58 UTC 2021 [0.00
RE: [PATCH net-next] hv_netvsc: Add a comment clarifying batching logic
> From: LKML haiyangz On Behalf Of Haiyang Zhang > Sent: Friday, March 12, 2021 3:45 PM > To: linux-hyp...@vger.kernel.org; net...@vger.kernel.org > Cc: Haiyang Zhang ; KY Srinivasan > ; Stephen Hemminger ; > o...@aepfle.de; vkuznets ; da...@davemloft.net; > linux-kernel@vger.kernel.org; Shachar Raindel > Subject: [PATCH net-next] hv_netvsc: Add a comment clarifying batching logic > > From: Shachar Raindel > > The batching logic in netvsc_send is non-trivial, due to > a combination of the Linux API and the underlying hypervisor > interface. Add a comment explaining why the code is written this > way. > > Signed-off-by: Shachar Raindel > Signed-off-by: Haiyang Zhang > --- Reviewed-by: Dexuan Cui
RE: How can a userspace program tell if the system supports the ACPI S4 state (Suspend-to-Disk)?
> From: James Morse > Sent: Tuesday, February 9, 2021 10:15 AM > To: Dexuan Cui > Cc: Rafael J. Wysocki ; linux-a...@vger.kernel.org; > linux-kernel@vger.kernel.org; linux-hyp...@vger.kernel.org; Michael Kelley > ; Leandro Pereira > Subject: Re: How can a userspace program tell if the system supports the ACPI > S4 state (Suspend-to-Disk)? > > Hi Dexuan, > > On 05/02/2021 19:36, Dexuan Cui wrote: > >> From: Rafael J. Wysocki > >> Sent: Friday, February 5, 2021 5:06 AM > >> To: Dexuan Cui > >> Cc: linux-a...@vger.kernel.org; linux-kernel@vger.kernel.org; > >> linux-hyp...@vger.kernel.org; Michael Kelley > >> Subject: Re: How can a userspace program tell if the system supports the > ACPI > >> S4 state (Suspend-to-Disk)? > >> > >> On Sat, Dec 12, 2020 at 2:22 AM Dexuan Cui wrote: > >>> > >>> Hi all, > >>> It looks like Linux can hibernate even if the system does not support the > ACPI > >>> S4 state, as long as the system can shut down, so "cat /sys/power/state" > >>> always contains "disk", unless we specify the kernel parameter > "nohibernate" > >>> or we use LOCKDOWN_HIBERNATION. > > >>> In some scenarios IMO it can still be useful if the userspace is able to > >>> detect if the ACPI S4 state is supported or not, e.g. when a Linux guest > >>> runs on Hyper-V, Hyper-V uses the virtual ACPI S4 state as an indicator > >>> of the proper support of the tool stack on the host, i.e. the guest is > >>> discouraged from trying hibernation if the state is not supported. > > What goes wrong? This sounds like a funny way of signalling hypervisor policy. Hi James, Sorry if I have caused any confusion. My above description only applies to x86 Linux VM on Hyper-V. For ARM64 Hyper-V, I suspect the mainline Linux kernel still can't work in a Linux VM running on ARM64 Hyper-V, so AFAIK the VM hibernation hasn't been tested: it may work or may not work. This is in our To-Do list. Linux VM on Hyper-V needs to know if hibernation is supported/enabled for the VM, mainly due to 2 reasons: 1. In the VM, the hv_balloon driver should disable the balloon up/down operations if hibernation is enabled for the VM, otherwise bad things can happen, e.g. the hv_balloon driver allocates some pages and gives the pages to the hyperpvisor; now if the VM is allowed to hibernate, later when the VM resumes back, the VM loses the pages for ever, because Hyper-V doesn't save the info of the pages that were from the VM, so Hyper-V thinks no pages need to be returned to the VM. 2. If hibernation is enabled for a VM, the VM has a Linux agent program that automatically creates and sets up the swap file for hibernation. If hibernation is not enabled for the VM, the agent should not try to create and set up the swap file for hibernation. > >>> I know we can check the S4 state by 'dmesg': > >>> > >>> # dmesg |grep ACPI: | grep support > >>> [3.034134] ACPI: (supports S0 S4 S5) > >>> > >>> But this method is unreliable because the kernel msg buffer can be filled > >>> and overwritten. Is there any better method? If not, do you think if the > >>> below patch is appropriate? Thanks! > >> > >> Sorry for the delay. > >> > >> If ACPI S4 is supported, /sys/power/disk will list "platform" as one > >> of the options (and it will be the default one then). Otherwise, > >> "platform" is not present in /sys/power/disk, because ACPI is the only > >> user of hibernation_ops. > > > This works on x86. Thanks a lot! > > > > BTW, does this also work on ARM64? > > Not today. The S4/S5 stuff is part of 'ACPI_SYSTEM_POWER_STATES_SUPPORT', > which arm64 > doesn't enable as it has a firmware mechanism that covers this on both DT and > ACPI > systems. That code is what calls hibernation_set_ops() to enable ACPI's > platform mode. Thanks for the explanation! > Regardless: hibernate works fine. What does your hypervisor do that causes > problems? > (I think all we expect from firmware is it doesn't randomise the placement of > the ACPI > tables as they aren't necessarily part of the hibernate image) > > Thanks, > James I have explained the problems above for Linux VM on ARM64 Hyper-V. I suppose you mean hibernation works fine for ARM64 bare metal. For Linux VM on ARM64 Hyper-V, I suspect some Hyper-V specific states may have to be saved and restored for hibernation. This is in our To-Do lsit. Thanks, Dexuan
RE: How can a userspace program tell if the system supports the ACPI S4 state (Suspend-to-Disk)?
> From: Rafael J. Wysocki > Sent: Friday, February 5, 2021 5:06 AM > To: Dexuan Cui > Cc: linux-a...@vger.kernel.org; linux-kernel@vger.kernel.org; > linux-hyp...@vger.kernel.org; Michael Kelley > Subject: Re: How can a userspace program tell if the system supports the ACPI > S4 state (Suspend-to-Disk)? > > On Sat, Dec 12, 2020 at 2:22 AM Dexuan Cui wrote: > > > > Hi all, > > It looks like Linux can hibernate even if the system does not support the > > ACPI > > S4 state, as long as the system can shut down, so "cat /sys/power/state" > > always contains "disk", unless we specify the kernel parameter "nohibernate" > > or we use LOCKDOWN_HIBERNATION. > > > > In some scenarios IMO it can still be useful if the userspace is able to > > detect > > if the ACPI S4 state is supported or not, e.g. when a Linux guest runs on > > Hyper-V, Hyper-V uses the virtual ACPI S4 state as an indicator of the > > proper > > support of the tool stack on the host, i.e. the guest is discouraged from > > trying hibernation if the state is not supported. > > > > I know we can check the S4 state by 'dmesg': > > > > # dmesg |grep ACPI: | grep support > > [3.034134] ACPI: (supports S0 S4 S5) > > > > But this method is unreliable because the kernel msg buffer can be filled > > and overwritten. Is there any better method? If not, do you think if the > > below patch is appropriate? Thanks! > > Sorry for the delay. > > If ACPI S4 is supported, /sys/power/disk will list "platform" as one > of the options (and it will be the default one then). Otherwise, > "platform" is not present in /sys/power/disk, because ACPI is the only > user of hibernation_ops. > > HTH This works on x86. Thanks a lot! BTW, does this also work on ARM64? Thanks, -- Dexuan
RE: kdump always hangs in rcu_barrier() -> wait_for_completion()
> From: Paul E. McKenney > Sent: Thursday, November 26, 2020 3:55 PM > To: Dexuan Cui > Cc: boqun.f...@gmail.com; Ingo Molnar ; > r...@vger.kernel.org; vkuznets ; Michael Kelley > ; linux-kernel@vger.kernel.org > Subject: Re: kdump always hangs in rcu_barrier() -> wait_for_completion() > > On Thu, Nov 26, 2020 at 10:59:19PM +, Dexuan Cui wrote: > > > From: Paul E. McKenney > > > Sent: Thursday, November 26, 2020 1:42 PM > > > > > > > > Another possibility is that rcu_state.gp_kthread is non-NULL, but that > > > > > something else is preventing RCU grace periods from completing, but in > > > > > > > > It looks like somehow the scheduling is not working here: in > > > > rcu_barrier() > > > > , if I replace the wait_for_completion() with > > > > wait_for_completion_timeout(&rcu_state.barrier_completion, 30*HZ), > the > > > > issue persists. > > > > > > Have you tried using sysreq-t to see what the various tasks are doing? > > > > Will try it. > > > > BTW, this is a "Generation 2" VM on Hyper-V, meaning sysrq only starts to > > work after the Hyper-V para-virtualized keyboard driver loads... So, at this > > early point, sysrq is not working. :-( I'll have to hack the code and use a > > virtual NMI interrupt to force the sysrq handler to be called. > > Whatever works! > > > > Having interrupts disabled on all CPUs would have the effect of disabling > > > the RCU CPU stall warnings. > > > Thanx, Paul > > > > I'm sure the interrupts are not disabled. Here the VM only has 1 virtual > > CPU, > > and when the hang issue happens the virtual serial console is still > > responding > > when I press Enter (it prints a new line) or Ctrl+C (it prints ^C). > > > > Here the VM does not use the "legacy timers" (PIT, Local APIC timer, etc.) > > at > all. > > Instead, the VM uses the Hyper-V para-virtualized timers. It looks the > Hyper-V > > timer never fires in the kdump kernel when the hang issue happens. I'm > > looking into this... I suspect this hang issue may only be specific to > > Hyper-V. > > Fair enough, given that timers not working can also suppress RCU CPU > stall warnings. ;-) > > Thanx, Paul FYI: the issue has been fixed by this fix: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fff7b5e6ee63c5d20406a131b260c619cdd24fd1 Thanks, -- Dexuan
[PATCH] x86/hyperv: Initialize clockevents after LAPIC is initialized
With commit 4df4cb9e99f8, the Hyper-V direct-mode STIMER is actually initialized before LAPIC is initialized: see apic_intr_mode_init() x86_platform.apic_post_init() hyperv_init() hv_stimer_alloc() apic_bsp_setup() setup_local_APIC() setup_local_APIC() temporarily disables LAPIC, initializes it and re-eanble it. The direct-mode STIMER depends on LAPIC, and when it's registered, it can be programmed immediately and the timer can fire very soon: hv_stimer_init clockevents_config_and_register clockevents_register_device tick_check_new_device tick_setup_device tick_setup_periodic(), tick_setup_oneshot() clockevents_program_event When the timer fires in the hypervisor, if the LAPIC is in the disabled state, new versions of Hyper-V ignore the event and don't inject the timer interrupt into the VM, and hence the VM hangs when it boots. Note: when the VM starts/reboots, the LAPIC is pre-enabled by the firmware, so the window of LAPIC being temporarily disabled is pretty small, and the issue can only happen once out of 100~200 reboots for a 40-vCPU VM on one dev host, and on another host the issue doesn't reproduce after 2000 reboots. The issue is more noticeable for kdump/kexec, because the LAPIC is disabled by the first kernel, and stays disabled until the kdump/kexec kernel enables it. This is especially an issue to a Generation-2 VM (for which Hyper-V doesn't emulate the PIT timer) when CONFIG_HZ=1000 (rather than CONFIG_HZ=250) is used. Fix the issue by moving hv_stimer_alloc() to a later place where the LAPIC timer is initialized. Fixes: 4df4cb9e99f8 ("x86/hyperv: Initialize clockevents earlier in CPU onlining") Signed-off-by: Dexuan Cui --- arch/x86/hyperv/hv_init.c | 29 ++--- 1 file changed, 26 insertions(+), 3 deletions(-) diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index 4638a52d8eae..6375967a8244 100644 --- a/arch/x86/hyperv/hv_init.c +++ b/arch/x86/hyperv/hv_init.c @@ -315,6 +315,25 @@ static struct syscore_ops hv_syscore_ops = { .resume = hv_resume, }; +static void (* __initdata old_setup_percpu_clockev)(void); + +static void __init hv_stimer_setup_percpu_clockev(void) +{ + /* +* Ignore any errors in setting up stimer clockevents +* as we can run with the LAPIC timer as a fallback. +*/ + (void)hv_stimer_alloc(); + + /* +* Still register the LAPIC timer, because the direct-mode STIMER is +* not supported by old versions of Hyper-V. This also allows users +* to switch to LAPIC timer via /sys, if they want to. +*/ + if (old_setup_percpu_clockev) + old_setup_percpu_clockev(); +} + /* * This function is to be invoked early in the boot sequence after the * hypervisor has been detected. @@ -393,10 +412,14 @@ void __init hyperv_init(void) wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64); /* -* Ignore any errors in setting up stimer clockevents -* as we can run with the LAPIC timer as a fallback. +* hyperv_init() is called before LAPIC is initialized: see +* apic_intr_mode_init() -> x86_platform.apic_post_init() and +* apic_bsp_setup() -> setup_local_APIC(). The direct-mode STIMER +* depends on LAPIC, so hv_stimer_alloc() should be called from +* x86_init.timers.setup_percpu_clockev. */ - (void)hv_stimer_alloc(); + old_setup_percpu_clockev = x86_init.timers.setup_percpu_clockev; + x86_init.timers.setup_percpu_clockev = hv_stimer_setup_percpu_clockev; hv_apic_init(); -- 2.19.1
RE: [PATCH] ACPI: scan: Fix a Hyper-V Linux VM panic caused by buffer overflow
> From: Andy Shevchenko > Sent: Saturday, January 9, 2021 12:52 AM >> >> Hi Rafael, Len, and all, >> Can you please take a look at the v2 patch? >> >> The Linux mainline has been broken for several weeks when it >> runs as a guest on Hyper-V, so we'd like this to be fixed ASAP, >> as more people are being affected > > I would like to see a warning printed when the dupped > string violates the spec. Hi Andy, Do you want a simple strlen() check like the below, or a full check of the AAA or format? Can we have the v2 (https://lkml.org/lkml/2021/1/8/53) merged first, and then we can add another patch for the format checking? I'm trying to do one thing in one patch so the patch is small enough for easy reviewing. diff --git a/drivers/acpi/internal.h b/drivers/acpi/internal.h index cb229e24c563..e6a5d997241c 100644 --- a/drivers/acpi/internal.h +++ b/drivers/acpi/internal.h @@ -97,7 +97,7 @@ void acpi_scan_table_handler(u32 event, void *table, void *context); extern struct list_head acpi_bus_id_list; struct acpi_device_bus_id { - char bus_id[15]; + const char *bus_id; unsigned int instance_no; struct list_head node; }; diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c index a1b226eb2ce2..3b9902e5d965 100644 --- a/drivers/acpi/scan.c +++ b/drivers/acpi/scan.c @@ -486,6 +486,7 @@ static void acpi_device_del(struct acpi_device *device) acpi_device_bus_id->instance_no--; else { list_del(&acpi_device_bus_id->node); + kfree_const(acpi_device_bus_id->bus_id); kfree(acpi_device_bus_id); } break; @@ -674,7 +675,23 @@ int acpi_device_add(struct acpi_device *device, } if (!found) { acpi_device_bus_id = new_bus_id; - strcpy(acpi_device_bus_id->bus_id, acpi_device_hid(device)); + acpi_device_bus_id->bus_id = + kstrdup_const(acpi_device_hid(device), GFP_KERNEL); + if (!acpi_device_bus_id->bus_id) { + pr_err(PREFIX "Memory allocation error for bus id\n"); + result = -ENOMEM; + goto err_free_new_bus_id; + } + + /* +* ACPI Spec v6.2, Section 6.1.5 _HID (Hardware ID): if the +* ID is a string, it must be of the form AAA or , +* i.e. 7 chars or 8 characters. +*/ + if (strlen(acpi_device_bus_id->bus_id) > 8) + pr_warn(PREFIX "too long HID name: %s\n", + acpi_device_bus_id->bus_id); + acpi_device_bus_id->instance_no = 0; list_add_tail(&acpi_device_bus_id->node, &acpi_bus_id_list); } @@ -709,6 +726,10 @@ int acpi_device_add(struct acpi_device *device, if (device->parent) list_del(&device->node); list_del(&device->wakeup_list); + + err_free_new_bus_id: + if (!found) + kfree(new_bus_id); mutex_unlock(&acpi_device_lock); err_detach:
RE: [PATCH v2] ACPI: scan: Fix a Hyper-V Linux VM panic caused by buffer overflow
> From: Dexuan Cui > Sent: Thursday, January 7, 2021 11:24 PM > ... > Linux VM on Hyper-V crashes with the latest mainline: > ... > > Changes in v2: > strlcpy -> kstrdup_const. Thanks Rafael J. Wysocki! > Change commit log accordingly. Hi Rafael, Len, and all, Can you please take a look at the v2 patch? The Linux mainline has been broken for several weeks when it runs as a guest on Hyper-V, so we'd like this to be fixed ASAP, as more people are being affected, e.g. https://bugzilla.kernel.org/show_bug.cgi?id=210449 Thanks, -- Dexuan
[PATCH v2] ACPI: scan: Fix a Hyper-V Linux VM panic caused by buffer overflow
Linux VM on Hyper-V crashes with the latest mainline: [4.069624] detected buffer overflow in strcpy [4.077733] kernel BUG at lib/string.c:1149! .. [4.085819] RIP: 0010:fortify_panic+0xf/0x11 ... [4.085819] Call Trace: [4.085819] acpi_device_add.cold.15+0xf2/0xfb [4.085819] acpi_add_single_object+0x2a6/0x690 [4.085819] acpi_bus_check_add+0xc6/0x280 [4.085819] acpi_ns_walk_namespace+0xda/0x1aa [4.085819] acpi_walk_namespace+0x9a/0xc2 [4.085819] acpi_bus_scan+0x78/0x90 [4.085819] acpi_scan_init+0xfa/0x248 [4.085819] acpi_init+0x2c1/0x321 [4.085819] do_one_initcall+0x44/0x1d0 [4.085819] kernel_init_freeable+0x1ab/0x1f4 This is because of the recent buffer overflow detection in the commit 6a39e62abbaf ("lib: string.h: detect intra-object overflow in fortified string functions") Here acpi_device_bus_id->bus_id can only hold 14 characters, while the the acpi_device_hid(device) returns a 22-char string "HYPER_V_GEN_COUNTER_V1". Per ACPI Spec v6.2, Section 6.1.5 _HID (Hardware ID), if the ID is a string, it must be of the form AAA or , i.e. 7 chars or 8 chars. The field bus_id in struct acpi_device_bus_id was originally defined as char bus_id[9], and later was enlarged to char bus_id[15] in 2007 in the commit bb0958544f3c ("ACPI: use more understandable bus_id for ACPI devices") Fix the issue by changing the field bus_id to const char *, and use kstrdup_const() to initialize it. Signed-off-by: Dexuan Cui --- Changes in v2: strlcpy -> kstrdup_const. Thanks Rafael J. Wysocki! Change commit log accordingly. drivers/acpi/internal.h | 2 +- drivers/acpi/scan.c | 14 +- 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/drivers/acpi/internal.h b/drivers/acpi/internal.h index cb229e24c563..e6a5d997241c 100644 --- a/drivers/acpi/internal.h +++ b/drivers/acpi/internal.h @@ -97,7 +97,7 @@ void acpi_scan_table_handler(u32 event, void *table, void *context); extern struct list_head acpi_bus_id_list; struct acpi_device_bus_id { - char bus_id[15]; + const char *bus_id; unsigned int instance_no; struct list_head node; }; diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c index a1b226eb2ce2..8d49d192d1c1 100644 --- a/drivers/acpi/scan.c +++ b/drivers/acpi/scan.c @@ -486,6 +486,7 @@ static void acpi_device_del(struct acpi_device *device) acpi_device_bus_id->instance_no--; else { list_del(&acpi_device_bus_id->node); + kfree_const(acpi_device_bus_id->bus_id); kfree(acpi_device_bus_id); } break; @@ -674,7 +675,14 @@ int acpi_device_add(struct acpi_device *device, } if (!found) { acpi_device_bus_id = new_bus_id; - strcpy(acpi_device_bus_id->bus_id, acpi_device_hid(device)); + acpi_device_bus_id->bus_id = + kstrdup_const(acpi_device_hid(device), GFP_KERNEL); + if (!acpi_device_bus_id->bus_id) { + pr_err(PREFIX "Memory allocation error for bus id\n"); + result = -ENOMEM; + goto err_free_new_bus_id; + } + acpi_device_bus_id->instance_no = 0; list_add_tail(&acpi_device_bus_id->node, &acpi_bus_id_list); } @@ -709,6 +717,10 @@ int acpi_device_add(struct acpi_device *device, if (device->parent) list_del(&device->node); list_del(&device->wakeup_list); + + err_free_new_bus_id: + if (!found) + kfree(new_bus_id); mutex_unlock(&acpi_device_lock); err_detach: -- 2.19.1
RE: [PATCH] ACPI: scan: Fix a Hyper-V Linux VM panic caused by buffer overflow
> From: Rafael J. Wysocki > Sent: Thursday, January 7, 2021 5:48 AM > > > > Linux VM on Hyper-V crashes with the latest mainline: > > > > ... > The root cause is a VM issue AFAICS, though. Yes. > It doesn't look like the right fix to me, though. > > The problem appears to be that the string coming from _HID is too long > (which is a spec violation). Yes. We have requested Hyper-V team to fix the buggy BIOS/firmware, but we have to cope with the existing buggy Hyper-V hosts, at least before the Hyper-V fix is deployed to the hosts, and some old versions of the hosts may not get updated. :-( > The patch truncates it to match the > length of the target buffer, but that is not particularly useful. > > It would be better to use something like kstrdup_const() to initialize > acpi_device_bus_id->bus_id IMV. Makes sense. I'll submit v2 shortly. Thanks, -- Dexuan
[PATCH v3] Drivers: hv: vmbus: Add /sys/bus/vmbus/hibernation
When a Linux VM runs on Hyper-V, if the host toolstack doesn't support hibernation for the VM (this happens on old Hyper-V hosts like Windows Server 2016, or new Hyper-V hosts if the admin or user doesn't declare the hibernation intent for the VM), the VM is discouraged from trying hibernation (because the host doesn't guarantee that the VM's virtual hardware configuration will remain exactly the same across hibernation), i.e. the VM should not try to set up the swap partition/file for hibernation, etc. x86 Hyper-V uses the presence of the virtual ACPI S4 state as the indication of the host toolstack support for a VM. Currently there is no easy and reliable way for the userspace to detect the presence of the state (see https://lkml.org/lkml/2020/12/11/1097). Add /sys/bus/vmbus/hibernation for this purpose. Signed-off-by: Dexuan Cui --- This v3 is similar to v1, and the changes are: Updated the documentation changes. Updated the commit log. /sys/bus/vmbus/supported_features -> /sys/bus/vmbus/hibernation The patch is targeted at the Hyper-V tree's hyperv-next branch. Documentation/ABI/stable/sysfs-bus-vmbus | 7 +++ drivers/hv/vmbus_drv.c | 18 ++ 2 files changed, 25 insertions(+) diff --git a/Documentation/ABI/stable/sysfs-bus-vmbus b/Documentation/ABI/stable/sysfs-bus-vmbus index c27b7b89477c..42599d9fa161 100644 --- a/Documentation/ABI/stable/sysfs-bus-vmbus +++ b/Documentation/ABI/stable/sysfs-bus-vmbus @@ -1,3 +1,10 @@ +What: /sys/bus/vmbus/hibernation +Date: Jan 2021 +KernelVersion: 5.12 +Contact: Dexuan Cui +Description: Whether the host supports hibernation for the VM. +Users: Daemon that sets up swap partition/file for hibernation. + What: /sys/bus/vmbus/devices//id Date: Jul 2009 KernelVersion: 2.6.31 diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index d491fdcee61f..4c544473b1d9 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -678,6 +678,23 @@ static const struct attribute_group vmbus_dev_group = { }; __ATTRIBUTE_GROUPS(vmbus_dev); +/* Set up the attribute for /sys/bus/vmbus/hibernation */ +static ssize_t hibernation_show(struct bus_type *bus, char *buf) +{ + return sprintf(buf, "%d\n", !!hv_is_hibernation_supported()); +} + +static BUS_ATTR_RO(hibernation); + +static struct attribute *vmbus_bus_attrs[] = { + &bus_attr_hibernation.attr, + NULL, +}; +static const struct attribute_group vmbus_bus_group = { + .attrs = vmbus_bus_attrs, +}; +__ATTRIBUTE_GROUPS(vmbus_bus); + /* * vmbus_uevent - add uevent for our device * @@ -1024,6 +1041,7 @@ static struct bus_type hv_bus = { .uevent = vmbus_uevent, .dev_groups = vmbus_dev_groups, .drv_groups = vmbus_drv_groups, + .bus_groups = vmbus_bus_groups, .pm = &vmbus_pm, }; -- 2.19.1
RE: [PATCH] Drivers: hv: vmbus: Add /sys/bus/vmbus/supported_features
> From: Michael Kelley > Sent: Wednesday, January 6, 2021 9:38 AM > From: Dexuan Cui > Sent: Tuesday, December 22, 2020 4:12 PM > > > > When a Linux VM runs on Hyper-V, if the host toolstack doesn't support > > hibernation for the VM (this happens on old Hyper-V hosts like Windows > > Server 2016, or new Hyper-V hosts if the admin or user doesn't declare > > the hibernation intent for the VM), the VM is discouraged from trying > > hibernation (because the host doesn't guarantee that the VM's virtual > > hardware configuration will remain exactly the same across hibernation), > > i.e. the VM should not try to set up the swap partition/file for > > hibernation, etc. > > > > x86 Hyper-V uses the presence of the virtual ACPI S4 state as the > > indication of the host toolstack support for a VM. Currently there is > > no easy and reliable way for the userspace to detect the presence of > > the state (see ...). Add > > /sys/bus/vmbus/supported_features for this purpose. > > I'm OK with surfacing the hibernation capability via an entry in > /sys/bus/vmbus. Correct me if I'm wrong, but I think the concept > being surfaced is not "ACPI S4 state" precisely, but slightly more > generally whether hibernation is supported for the VM. While > those two concepts may be 1:1 for the moment, there might be > future configurations where "hibernation is supported" depends > on other factors as well. For x86, I believe the virtual ACPI S4 state exists only when the admin/user declares the intent of "enable hibernation for the VM" via some PowwerShell/WMI command. On Azure, if a VM size is not suitable for hibernation (e.g. an existing VM has an ephemeral local disk), the toolstack on the host should not enable the ACPI S4 state for the VM. That's why we implemented hv_is_hibernation_supported() for x86 by checking the ACPI S4 state, and we have used the function hv_is_hibernation_supported() in hv_utils and hv_balloon for quite a while. For ARM, IIRC there is no concept of ACPI S4 state, so currently hv_is_hibernation_supported() is actually not implemented. Not sure why hv_utils and hv_balloon can build successfully... :-) Probably Boqun can help to take a look. > > The guidance for things in /sys is that they generally should > be single valued (see Documentation/filesystems/sysfs.rst). So my > recommendation is to create a "hibernation" entry that has a value > of 0 or 1. > > Michael Got it. Then let's use /sys/bus/vmbus/hibernation. Will post v3. Thanks, -- Dexuan
RE: [PATCH] Drivers: hv: vmbus: Add /sys/bus/vmbus/supported_features
> From: Wei Liu > Sent: Wednesday, January 6, 2021 8:24 AM > To: Dexuan Cui > > > > That said, I don't know if there is any hard rule in regard of > > the timing here. If there is, then v5.12 is OK to me. :-) > > > > By the time you posted this (Dec 22), 5.11 was already more or less > "frozen". Linus wanted -next patches to be merged before Christmas. Got it. Thanks for the explanation! > The way I see it this is a new sysfs interface so I think this is > something new, which is for 5.12. Ok. > Do you think this should be considered a bug fix? > > Wei. Then let's not consider it for v5.11. For now I think the userspace tool/daemon can check 'dmesg' for the existence of ACPI S4 state as a workaround. This is not ideal, but it should work reasonably well, assuming the tool/daemon runs early enough, so the kernel msg buffer is not yet filled up and overwritten. :-) Thanks, -- Dexuan
[PATCH v2] Drivers: hv: vmbus: Add /sys/bus/vmbus/supported_features
When a Linux VM runs on Hyper-V, if the host toolstack doesn't support hibernation for the VM (this happens on old Hyper-V hosts like Windows Server 2016, or new Hyper-V hosts if the admin or user doesn't declare the hibernation intent for the VM), the VM is discouraged from trying hibernation (because the host doesn't guarantee that the VM's virtual hardware configuration will remain exactly the same across hibernation), i.e. the VM should not try to set up the swap partition/file for hibernation, etc. x86 Hyper-V uses the presence of the virtual ACPI S4 state as the indication of the host toolstack support for a VM. Currently there is no easy and reliable way for the userspace to detect the presence of the state (see https://lkml.org/lkml/2020/12/11/1097). Add /sys/bus/vmbus/supported_features for this purpose. Signed-off-by: Dexuan Cui --- Change in v2: Added a "Values:" section Updated "Date:" Documentation/ABI/stable/sysfs-bus-vmbus | 9 + drivers/hv/vmbus_drv.c | 20 2 files changed, 29 insertions(+) diff --git a/Documentation/ABI/stable/sysfs-bus-vmbus b/Documentation/ABI/stable/sysfs-bus-vmbus index c27b7b89477c..c8d56389b7be 100644 --- a/Documentation/ABI/stable/sysfs-bus-vmbus +++ b/Documentation/ABI/stable/sysfs-bus-vmbus @@ -1,3 +1,12 @@ +What: /sys/bus/vmbus/supported_features +Date: Jan 2021 +KernelVersion: 5.11 +Contact: Dexuan Cui +Description: Features specific to VMs running on Hyper-V +Values:A list of strings. + hibernation: the host toolstack supports hibernation for the VM. +Users: Daemon that sets up swap partition/file for hibernation + What: /sys/bus/vmbus/devices//id Date: Jul 2009 KernelVersion: 2.6.31 diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index d491fdcee61f..958487a40a18 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -678,6 +678,25 @@ static const struct attribute_group vmbus_dev_group = { }; __ATTRIBUTE_GROUPS(vmbus_dev); +/* Set up bus attribute(s) for /sys/bus/vmbus/supported_features */ +static ssize_t supported_features_show(struct bus_type *bus, char *buf) +{ + bool hb = hv_is_hibernation_supported(); + + return sprintf(buf, "%s\n", hb ? "hibernation" : ""); +} + +static BUS_ATTR_RO(supported_features); + +static struct attribute *vmbus_bus_attrs[] = { + &bus_attr_supported_features.attr, + NULL, +}; +static const struct attribute_group vmbus_bus_group = { + .attrs = vmbus_bus_attrs, +}; +__ATTRIBUTE_GROUPS(vmbus_bus); + /* * vmbus_uevent - add uevent for our device * @@ -1024,6 +1043,7 @@ static struct bus_type hv_bus = { .uevent = vmbus_uevent, .dev_groups = vmbus_dev_groups, .drv_groups = vmbus_drv_groups, + .bus_groups = vmbus_bus_groups, .pm = &vmbus_pm, }; -- 2.19.1
RE: [PATCH] Drivers: hv: vmbus: Add /sys/bus/vmbus/supported_features
> From: Wei Liu > Sent: Tuesday, January 5, 2021 4:58 AM > > ... > > Documentation/ABI/stable/sysfs-bus-vmbus | 7 +++ > > drivers/hv/vmbus_drv.c | 20 > > > 2 files changed, 27 insertions(+) > > > > diff --git a/Documentation/ABI/stable/sysfs-bus-vmbus > b/Documentation/ABI/stable/sysfs-bus-vmbus > > index c27b7b89477c..3ba765ae6695 100644 > > --- a/Documentation/ABI/stable/sysfs-bus-vmbus > > +++ b/Documentation/ABI/stable/sysfs-bus-vmbus > > @@ -1,3 +1,10 @@ > > +What: /sys/bus/vmbus/supported_features > > +Date: Dec 2020 > > +KernelVersion: 5.11 > > Too late for 5.11 now. The patch was posted 2 weeks ago, before 5.11-rc1. :-) If possible, we still want this patch to be in 5.11, because: 1. The patch is small and IMO pretty safe. 2. The patch adds new code and doesn't really change any old code. 3. The patch adds a new sys file, which is needed by some user space tool to auto-setup the configuration for hibernation. We'd like to integrate the patch into the Linux distros ASAP and as we know some distros don't accept a patch if it's not in the mainline. So, if the patch itself looks good, IMO it would be better to be in v5.11 rather than in v5.12, which will need extra time of 2~3 months. That said, I don't know if there is any hard rule in regard of the timing here. If there is, then v5.12 is OK to me. :-) > Given this is a list of strings, do you want to enumerate them in a > Values section or the Description section? > > Wei. Currently there is only one possible string "hibernation". Not sure if we would need to add more strings in future, but yes it's a good idea to list the string in a "Value" section. Will post v2 shortly. Thanks, -- Dexuan
RE: [PATCH] ACPI: scan: Fix a Hyper-V Linux VM panic caused by buffer overflow
> From: Michael Kelley > Sent: Tuesday, December 22, 2020 5:56 AM > From: Dexuan Cui > Sent: Thursday, December 17, 2020 > 8:08 PM > > > > Linux VM on Hyper-V crashes with the latest mainline: > > ... > > --- a/drivers/acpi/scan.c > > +++ b/drivers/acpi/scan.c > > @@ -674,7 +674,8 @@ int acpi_device_add(struct acpi_device *device, > > } > > if (!found) { > > acpi_device_bus_id = new_bus_id; > > - strcpy(acpi_device_bus_id->bus_id, acpi_device_hid(device)); > > + strlcpy(acpi_device_bus_id->bus_id, acpi_device_hid(device), > > + sizeof(acpi_device_bus_id->bus_id)); > > acpi_device_bus_id->instance_no = 0; > > list_add_tail(&acpi_device_bus_id->node, &acpi_bus_id_list); > > } > > Reviewed-by: Michael Kelley Hi, ACPI maintainers, Would you please take a look at the small fix? Currently the mainline Linux kernel, running in a VM on Hyper-V, has been broken for almost 3 weeks, i.e. the VM always panics when it boots. The patch has already had Michael's Reviewed-by. BTW, the patch should have a stable tag: Cc: Or, do you want the patch to go through the Hyper-V tree? https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-fixes The small patch is unlikely to cause a merge conflict, and it only affects Linux VMs on Hyper-V so far. Thanks, -- Dexuan
RE: How can a userspace program tell if the system supports the ACPI S4 state (Suspend-to-Disk)?
> From: Pavel Machek > Sent: Monday, December 21, 2020 11:08 AM > > On Sat 2020-12-12 01:20:30, Dexuan Cui wrote: > > Hi all, > > It looks like Linux can hibernate even if the system does not support the > > ACPI > > S4 state, as long as the system can shut down, so "cat /sys/power/state" > > always contains "disk", unless we specify the kernel parameter "nohibernate" > > or we use LOCKDOWN_HIBERNATION. > > > > In some scenarios IMO it can still be useful if the userspace is able to > > detect > > if the ACPI S4 state is supported or not, e.g. when a Linux guest runs on > > Hyper-V, Hyper-V uses the virtual ACPI S4 state as an indicator of the > > proper > > support of the tool stack on the host, i.e. the guest is discouraged from > > trying hibernation if the state is not supported. > > Umm. Does not sound like exactly strong reason to me. Hi Pavel, Thanks for the reply. I realized that it may be better for me to add the code to Hyper-V specific driver: https://lkml.org/lkml/2020/12/22/719 Let's see how it goes that way. :-) Thanks, Dexuan
RE: [PATCH -tip V2 00/10] workqueue: break affinity initiatively
> From: Dexuan Cui > Sent: Wednesday, December 23, 2020 12:27 PM > ... > The warning only repros if there are more than 1 node, and it only prints once > for the first vCPU of the second node (i.e. node #1). A correction: if I configure the 32 vCPUs evenly into 4 nodes, I get the warning once for node #1~#3, respectively. Thanks, -- Dexuan --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2376,9 +2376,14 @@ static int __set_cpus_allowed_ptr(struct task_struct *p, * For kernel threads that do indeed end up on online && * !active we want to ensure they are strict per-CPU threads. */ - WARN_ON(cpumask_intersects(new_mask, cpu_online_mask) && + WARN(cpumask_intersects(new_mask, cpu_online_mask) && !cpumask_intersects(new_mask, cpu_active_mask) && - p->nr_cpus_allowed != 1); + p->nr_cpus_allowed != 1, "%*pbl, %*pbl, %*pbl, %d\n", + cpumask_pr_args(new_mask), + cpumask_pr_args(cpu_online_mask), + cpumask_pr_args(cpu_active_mask), + p->nr_cpus_allowed + ); } [1.791611] smp: Bringing up secondary CPUs ... [1.795225] x86: Booting SMP configuration: [1.798964] node #0, CPUs:#1 #2 #3 #4 #5 #6 #7 [1.807068] node #1, CPUs:#8 [1.094226] smpboot: CPU 8 Converting physical 0 to logical die 1 [1.895211] [ cut here ] [1.899058] 8-15, 0-8, 0-7, 8 [1.899058] WARNING: CPU: 8 PID: 50 at kernel/sched/core.c:2386 __set_cpus_allowed_ptr+0x1c7/0x1e0 [1.899058] CPU: 8 PID: 50 Comm: cpuhp/8 Not tainted 5.10.0+ #4 [1.899058] RIP: 0010:__set_cpus_allowed_ptr+0x1c7/0x1e0 [1.899058] Call Trace: [1.899058] worker_attach_to_pool+0x53/0xd0 [1.899058] create_worker+0xf9/0x190 [1.899058] alloc_unbound_pwq+0x3a5/0x3b0 [1.899058] wq_update_unbound_numa+0x112/0x1c0 [1.899058] workqueue_online_cpu+0x1d0/0x220 [1.899058] cpuhp_invoke_callback+0x82/0x4a0 [1.899058] cpuhp_thread_fun+0xb8/0x120 [1.899058] smpboot_thread_fn+0x198/0x230 [1.899058] kthread+0x13d/0x160 [1.899058] ret_from_fork+0x22/0x30 [1.903058] #9 #10 #11 #12 #13 #14 #15 [1.907092] node #2, CPUs: #16 [1.094226] smpboot: CPU 16 Converting physical 0 to logical die 2 [1.995205] [ cut here ] [1.999058] 16-23, 0-16, 0-15, 8 [1.999058] WARNING: CPU: 16 PID: 91 at kernel/sched/core.c:2386 __set_cpus_allowed_ptr+0x1c7/0x1e0 [1.999058] CPU: 16 PID: 91 Comm: cpuhp/16 Tainted: GW 5.10.0+ #4 [1.999058] RIP: 0010:__set_cpus_allowed_ptr+0x1c7/0x1e0 [1.999058] Call Trace: [1.999058] worker_attach_to_pool+0x53/0xd0 [1.999058] create_worker+0xf9/0x190 [1.999058] alloc_unbound_pwq+0x3a5/0x3b0 [1.999058] wq_update_unbound_numa+0x112/0x1c0 [1.999058] workqueue_online_cpu+0x1d0/0x220 [1.999058] cpuhp_invoke_callback+0x82/0x4a0 [1.999058] cpuhp_thread_fun+0xb8/0x120 [1.999058] smpboot_thread_fn+0x198/0x230 [1.999058] kthread+0x13d/0x160 [1.999058] ret_from_fork+0x22/0x30 [2.003058] #17 #18 #19 #20 #21 #22 #23 [2.007092] node #3, CPUs: #24 [1.094226] smpboot: CPU 24 Converting physical 0 to logical die 3 [2.095220] [ cut here ] [2.099058] 24-31, 0-24, 0-23, 8 [2.099058] WARNING: CPU: 24 PID: 132 at kernel/sched/core.c:2386 __set_cpus_allowed_ptr+0x1c7/0x1e0 [2.099058] CPU: 24 PID: 132 Comm: cpuhp/24 Tainted: GW 5.10.0+ #4 [2.099058] Call Trace: [2.099058] worker_attach_to_pool+0x53/0xd0 [2.099058] create_worker+0xf9/0x190 [2.099058] alloc_unbound_pwq+0x3a5/0x3b0 [2.099058] wq_update_unbound_numa+0x112/0x1c0 [2.099058] workqueue_online_cpu+0x1d0/0x220 [2.099058] cpuhp_invoke_callback+0x82/0x4a0 [2.099058] cpuhp_thread_fun+0xb8/0x120 [2.099058] smpboot_thread_fn+0x198/0x230 [2.099058] kthread+0x13d/0x160 [2.099058] ret_from_fork+0x22/0x30 [2.103058] #25 #26 #27 #28 #29 #30 #31 [2.108091] smp: Brought up 4 nodes, 32 CPUs [2.115065] smpboot: Max logical packages: 4 [2.119067] smpboot: Total of 32 processors activated (146992.31 BogoMIPS)
RE: [PATCH -tip V2 00/10] workqueue: break affinity initiatively
> From: Lai Jiangshan > Sent: Wednesday, December 23, 2020 7:02 AM > > > > Hi, > > I tested this patchset on today's tip.git's master branch > > (981316394e35 ("Merge branch 'locking/urgent'")). > > > > Every time the kernel boots with 32 CPUs (I'm running the Linux VM on > > Hyper-V), I get the below warning. > > (BTW, with 8 or 16 CPUs, I don't see the warning). > > By printing the cpumasks with "%*pbl", I know the warning happens > > because: > > new_mask = 16-31 > > cpu_online_mask= 0-16 > > cpu_active_mask= 0-15 > > p->nr_cpus_allowed=16 > > > > Hello, Dexuan > > Could you omit patch4 of the patchset and test it again, please? > ("workqueue: don't set the worker's cpumask when kthread_bind_mask()") > > kthread_bind_mask() set the worker task to the pool's cpumask without > any check. And set_cpus_allowed_ptr() finds that the task's cpumask > is unchanged (already set by kthread_bind_mask()) and skips all the checks. > > And I found that numa=fake=2U seems broken on cpumask_of_node() in my > box. > > Thanks, > Lai Looks like your analysis is correct: the warning can't repro if I configure all the 32 vCPUs into 1 virtual NUMA node (and I don't see the message "smpboot: CPU 16 Converting physical 0 to logical die 1"): [1.495440] smp: Bringing up secondary CPUs ... [1.499207] x86: Booting SMP configuration: [1.503038] node #0, CPUs:#1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 [1.531930] smp: Brought up 1 node, 32 CPUs [1.538779] smpboot: Max logical packages: 1 [1.539041] smpboot: Total of 32 processors activated (146859.90 BogoMIPS) The warning only repros if there are more than 1 node, and it only prints once for the first vCPU of the second node (i.e. node #1). With more than 1 node, if I don't use patch4, the warning does not repro. Thanks, -- Dexuan
[PATCH] Drivers: hv: vmbus: Add /sys/bus/vmbus/supported_features
When a Linux VM runs on Hyper-V, if the host toolstack doesn't support hibernation for the VM (this happens on old Hyper-V hosts like Windows Server 2016, or new Hyper-V hosts if the admin or user doesn't declare the hibernation intent for the VM), the VM is discouraged from trying hibernation (because the host doesn't guarantee that the VM's virtual hardware configuration will remain exactly the same across hibernation), i.e. the VM should not try to set up the swap partition/file for hibernation, etc. x86 Hyper-V uses the presence of the virtual ACPI S4 state as the indication of the host toolstack support for a VM. Currently there is no easy and reliable way for the userspace to detect the presence of the state (see https://lkml.org/lkml/2020/12/11/1097). Add /sys/bus/vmbus/supported_features for this purpose. Signed-off-by: Dexuan Cui --- Documentation/ABI/stable/sysfs-bus-vmbus | 7 +++ drivers/hv/vmbus_drv.c | 20 2 files changed, 27 insertions(+) diff --git a/Documentation/ABI/stable/sysfs-bus-vmbus b/Documentation/ABI/stable/sysfs-bus-vmbus index c27b7b89477c..3ba765ae6695 100644 --- a/Documentation/ABI/stable/sysfs-bus-vmbus +++ b/Documentation/ABI/stable/sysfs-bus-vmbus @@ -1,3 +1,10 @@ +What: /sys/bus/vmbus/supported_features +Date: Dec 2020 +KernelVersion: 5.11 +Contact: Dexuan Cui +Description: Features specific to VMs running on Hyper-V +Users: Daemon that sets up swap partition/file for hibernation + What: /sys/bus/vmbus/devices//id Date: Jul 2009 KernelVersion: 2.6.31 diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index d491fdcee61f..958487a40a18 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -678,6 +678,25 @@ static const struct attribute_group vmbus_dev_group = { }; __ATTRIBUTE_GROUPS(vmbus_dev); +/* Set up bus attribute(s) for /sys/bus/vmbus/supported_features */ +static ssize_t supported_features_show(struct bus_type *bus, char *buf) +{ + bool hb = hv_is_hibernation_supported(); + + return sprintf(buf, "%s\n", hb ? "hibernation" : ""); +} + +static BUS_ATTR_RO(supported_features); + +static struct attribute *vmbus_bus_attrs[] = { + &bus_attr_supported_features.attr, + NULL, +}; +static const struct attribute_group vmbus_bus_group = { + .attrs = vmbus_bus_attrs, +}; +__ATTRIBUTE_GROUPS(vmbus_bus); + /* * vmbus_uevent - add uevent for our device * @@ -1024,6 +1043,7 @@ static struct bus_type hv_bus = { .uevent = vmbus_uevent, .dev_groups = vmbus_dev_groups, .drv_groups = vmbus_drv_groups, + .bus_groups = vmbus_bus_groups, .pm = &vmbus_pm, }; -- 2.19.1
RE: v5.10: sched_cpu_dying() hits BUG_ON during hibernation: kernel BUG at kernel/sched/core.c:7596!
> From: Valentin Schneider > Sent: Tuesday, December 22, 2020 5:40 AM > To: Dexuan Cui > Cc: mi...@redhat.com; pet...@infradead.org; juri.le...@redhat.com; > vincent.guit...@linaro.org; dietmar.eggem...@arm.com; > rost...@goodmis.org; bseg...@google.com; mgor...@suse.de; > bris...@redhat.com; x...@kernel.org; linux...@vger.kernel.org; > linux-kernel@vger.kernel.org; linux-hyp...@vger.kernel.org; Michael Kelley > > Subject: Re: v5.10: sched_cpu_dying() hits BUG_ON during hibernation: kernel > BUG at kernel/sched/core.c:7596! > > > Hi, > > On 22/12/20 09:13, Dexuan Cui wrote: > > Hi, > > I'm running a Linux VM with the recent mainline (48342fc07272, 12/20/2020) > on Hyper-V. > > When I test hibernation, the VM can easily hit the below BUG_ON during the > resume > > procedure (I estimate this can repro about 1/5 of the time). BTW, my VM has > 40 vCPUs. > > > > I can't repro the BUG_ON with v5.9.0, so I suspect something in v5.10.0 may > be broken? > > > > In v5.10.0, when the BUG_ON happens, rq->nr_running==2, and > rq->nr_pinned==0: > > > > 7587 int sched_cpu_dying(unsigned int cpu) > > 7588 { > > 7589 struct rq *rq = cpu_rq(cpu); > > 7590 struct rq_flags rf; > > 7591 > > 7592 /* Handle pending wakeups and then migrate everything off > */ > > 7593 sched_tick_stop(cpu); > > 7594 > > 7595 rq_lock_irqsave(rq, &rf); > > 7596 BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq)); > > 7597 rq_unlock_irqrestore(rq, &rf); > > 7598 > > 7599 calc_load_migrate(rq); > > 7600 update_max_interval(); > > 7601 nohz_balance_exit_idle(rq); > > 7602 hrtick_clear(rq); > > 7603 return 0; > > 7604 } > > > > The last commit that touches the BUG_ON line is the commit > > 3015ef4b98f5 ("sched/core: Make migrate disable and CPU hotplug > cooperative") > > but the commit looks good to me. > > > > Any idea? > > > > I'd wager this extra task is a kworker; could you give this series a try? > > > https ://lore.kernel.org/lkml/20201218170919.2950-1-jiangshan...@gmail.com/ Thanks, Valentin! It looks like the patchset can fix the BUG_ON, though I see a warning, which I reported here: https://lkml.org/lkml/2020/12/22/648 Thanks, -- Dexuan
v5.10: sched_cpu_dying() hits BUG_ON during hibernation: kernel BUG at kernel/sched/core.c:7596!
Hi, I'm running a Linux VM with the recent mainline (48342fc07272, 12/20/2020) on Hyper-V. When I test hibernation, the VM can easily hit the below BUG_ON during the resume procedure (I estimate this can repro about 1/5 of the time). BTW, my VM has 40 vCPUs. I can't repro the BUG_ON with v5.9.0, so I suspect something in v5.10.0 may be broken? In v5.10.0, when the BUG_ON happens, rq->nr_running==2, and rq->nr_pinned==0: 7587 int sched_cpu_dying(unsigned int cpu) 7588 { 7589 struct rq *rq = cpu_rq(cpu); 7590 struct rq_flags rf; 7591 7592 /* Handle pending wakeups and then migrate everything off */ 7593 sched_tick_stop(cpu); 7594 7595 rq_lock_irqsave(rq, &rf); 7596 BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq)); 7597 rq_unlock_irqrestore(rq, &rf); 7598 7599 calc_load_migrate(rq); 7600 update_max_interval(); 7601 nohz_balance_exit_idle(rq); 7602 hrtick_clear(rq); 7603 return 0; 7604 } The last commit that touches the BUG_ON line is the commit 3015ef4b98f5 ("sched/core: Make migrate disable and CPU hotplug cooperative") but the commit looks good to me. Any idea? Thanks, -- Dexuan [5.383378] PM: Image signature found, resuming [5.388027] PM: hibernation: resume from hibernation [5.395794] Freezing user space processes ... (elapsed 0.001 seconds) done. [5.397058] OOM killer disabled. [5.397059] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. [5.413013] PM: hibernation: Marking nosave pages: [mem 0x-0x0fff] [5.419331] PM: hibernation: Marking nosave pages: [mem 0x0009f000-0x000f] [5.425502] PM: hibernation: Marking nosave pages: [mem 0xf7ff-0x] [5.431706] PM: hibernation: Basic memory bitmaps created [5.465205] PM: Using 3 thread(s) for decompression [5.469590] PM: Loading and decompressing image data (505553 pages)... [5.492790] PM: Image loading progress: 0% [6.532641] PM: Image loading progress: 10% [6.899683] PM: Image loading progress: 20% [7.251672] PM: Image loading progress: 30% [7.598808] PM: Image loading progress: 40% [8.056153] PM: Image loading progress: 50% [8.534077] PM: Image loading progress: 60% [9.029886] PM: Image loading progress: 70% [9.542015] PM: Image loading progress: 80% [ 10.025326] PM: Image loading progress: 90% [ 10.525804] PM: Image loading progress: 100% [ 10.530241] PM: Image loading done [ 10.533295] PM: hibernation: Read 2022212 kbytes in 5.05 seconds (400.43 MB/s) [ 10.540827] PM: Image successfully loaded [ 10.599243] serial 00:04: disabled [ 10.619935] vmbus 242ff919-07db-4180-9c2e-b86cb68c8c55: parent VMBUS:01 should not be sleeping [ 10.646542] Disabling non-boot CPUs ... [ 10.694380] smpboot: CPU 1 is now offline [ 10.729044] smpboot: CPU 2 is now offline [ 10.756819] smpboot: CPU 3 is now offline [ 10.776784] smpboot: CPU 4 is now offline [ 10.800866] smpboot: CPU 5 is now offline [ 10.828923] smpboot: CPU 6 is now offline [ 10.849013] smpboot: CPU 7 is now offline [ 10.876722] smpboot: CPU 8 is now offline [ 10.909426] smpboot: CPU 9 is now offline [ 10.929360] smpboot: CPU 10 is now offline [ 10.957059] smpboot: CPU 11 is now offline [ 10.976542] smpboot: CPU 12 is now offline [ 11.008470] smpboot: CPU 13 is now offline [ 11.036356] smpboot: CPU 14 is now offline [ 11.072396] smpboot: CPU 15 is now offline [ 11.100229] smpboot: CPU 16 is now offline [ 11.128638] smpboot: CPU 17 is now offline [ 11.148479] smpboot: CPU 18 is now offline [ 11.173537] smpboot: CPU 19 is now offline [ 11.205680] smpboot: CPU 20 is now offline [ 11.231799] smpboot: CPU 21 is now offline [ 11.259562] smpboot: CPU 22 is now offline [ 11.288672] smpboot: CPU 23 is now offline [ 11.319691] smpboot: CPU 24 is now offline [ 11.355523] smpboot: CPU 25 is now offline [ 11.399469] smpboot: CPU 26 is now offline [ 11.435438] smpboot: CPU 27 is now offline [ 11.471402] smpboot: CPU 28 is now offline [ 11.515488] smpboot: CPU 29 is now offline [ 11.551355] smpboot: CPU 30 is now offline [ 11.591326] smpboot: CPU 31 is now offline [ 11.624004] smpboot: CPU 32 is now offline [ 11.659326] smpboot: CPU 33 is now offline [ 11.687478] smpboot: CPU 34 is now offline [ 11.719243] smpboot: CPU 35 is now offline [ 11.747252] smpboot: CPU 36 is now offline [ 11.771224] smpboot: CPU 37 is now offline [ 11.795089] [ cut here ] [ 11.798052] kernel BUG at kernel/sched/core.c:7596! [ 11.798052] invalid opcode: [#1] PREEMPT SMP PTI [ 11.798052] CPU: 38 PID: 202 Comm: migration/38 Tainted: GE 5.10.0+ #6 [ 11.798052] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018 [ 11.798052] Stopper: multi_cpu_stop+0x0/0xf0 <- 0x0 [ 11.798052] RIP: 0010:sched_cpu_dying+0xa3/0xc0 [ 11.
[PATCH v2] x86/hyperv: Fix kexec panic/hang issues
Currently the kexec kernel can panic or hang due to 2 causes: 1) hv_cpu_die() is not called upon kexec, so the hypervisor corrupts the old VP Assist Pages when the kexec kernel runs. The same issue is fixed for hibernation in commit 421f090c819d ("x86/hyperv: Suspend/resume the VP assist page for hibernation"). Now fix it for kexec. 2) hyperv_cleanup() is called too early. In the kexec path, the other CPUs are stopped in hv_machine_shutdown() -> native_machine_shutdown(), so between hv_kexec_handler() and native_machine_shutdown(), the other CPUs can still try to access the hypercall page and cause panic. The workaround "hv_hypercall_pg = NULL;" in hyperv_cleanup() is unreliabe. Move hyperv_cleanup() to a better place. Signed-off-by: Dexuan Cui --- Changes in v2: Improved the commit log as Michael Kelley suggested. No change to v1 otherwise. arch/x86/hyperv/hv_init.c | 4 arch/x86/include/asm/mshyperv.h | 2 ++ arch/x86/kernel/cpu/mshyperv.c | 18 ++ drivers/hv/vmbus_drv.c | 2 -- 4 files changed, 24 insertions(+), 2 deletions(-) diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index e04d90af4c27..4638a52d8eae 100644 --- a/arch/x86/hyperv/hv_init.c +++ b/arch/x86/hyperv/hv_init.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include @@ -26,6 +27,8 @@ #include #include +int hyperv_init_cpuhp; + void *hv_hypercall_pg; EXPORT_SYMBOL_GPL(hv_hypercall_pg); @@ -401,6 +404,7 @@ void __init hyperv_init(void) register_syscore_ops(&hv_syscore_ops); + hyperv_init_cpuhp = cpuhp; return; remove_cpuhp_state: diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h index ffc289992d1b..30f76b966857 100644 --- a/arch/x86/include/asm/mshyperv.h +++ b/arch/x86/include/asm/mshyperv.h @@ -74,6 +74,8 @@ static inline void hv_disable_stimer0_percpu_irq(int irq) {} #if IS_ENABLED(CONFIG_HYPERV) +extern int hyperv_init_cpuhp; + extern void *hv_hypercall_pg; extern void __percpu **hyperv_pcpu_input_arg; diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c index f628e3dc150f..43b54bef5448 100644 --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -135,14 +135,32 @@ static void hv_machine_shutdown(void) { if (kexec_in_progress && hv_kexec_handler) hv_kexec_handler(); + + /* +* Call hv_cpu_die() on all the CPUs, otherwise later the hypervisor +* corrupts the old VP Assist Pages and can crash the kexec kernel. +*/ + if (kexec_in_progress && hyperv_init_cpuhp > 0) + cpuhp_remove_state(hyperv_init_cpuhp); + + /* The function calls stop_other_cpus(). */ native_machine_shutdown(); + + /* Disable the hypercall page when there is only 1 active CPU. */ + if (kexec_in_progress) + hyperv_cleanup(); } static void hv_machine_crash_shutdown(struct pt_regs *regs) { if (hv_crash_handler) hv_crash_handler(regs); + + /* The function calls crash_smp_send_stop(). */ native_machine_crash_shutdown(regs); + + /* Disable the hypercall page when there is only 1 active CPU. */ + hyperv_cleanup(); } #endif /* CONFIG_KEXEC_CORE */ #endif /* CONFIG_HYPERV */ diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index 502f8cd95f6d..d491fdcee61f 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -2550,7 +2550,6 @@ static void hv_kexec_handler(void) /* Make sure conn_state is set as hv_synic_cleanup checks for it */ mb(); cpuhp_remove_state(hyperv_cpuhp_online); - hyperv_cleanup(); }; static void hv_crash_handler(struct pt_regs *regs) @@ -2566,7 +2565,6 @@ static void hv_crash_handler(struct pt_regs *regs) cpu = smp_processor_id(); hv_stimer_cleanup(cpu); hv_synic_disable_regs(cpu); - hyperv_cleanup(); }; static int hv_synic_suspend(void) -- 2.19.1
RE: [PATCH] x86/hyperv: Fix kexec panic/hang issues
> From: Michael Kelley > Sent: Monday, December 21, 2020 7:36 PM > ... > Since we don't *need* to do this, I think it might be less risky to just leave > the code "as is". But I'm OK either way. Ok, then I'll leave it as is in v2. Thanks, -- Dexuan
RE: [PATCH] x86/hyperv: Fix kexec panic/hang issues
> From: Michael Kelley > Sent: Monday, December 21, 2020 3:33 PM > From: Dexuan Cui > Sent: Sunday, December 20, 2020 2:30 PM > > > > Currently the kexec kernel can panic or hang due to 2 causes: > > > > 1) hv_cpu_die() is not called upon kexec, so the hypervisor corrupts the > > VP Assist Pages when the kexec kernel runs. We ever fixed the same issue > > Spurious word "ever". And avoid first person "we". Perhaps: > > The same issue is fixed for hibernation in commit . . Now fix it for > kexec. Thanks! Will use this in v2. > > for hibernation in > > commit 421f090c819d ("x86/hyperv: Suspend/resume the VP assist page for > hibernation") > > and now let's fix it for kexec. > > Is the VP Assist Page getting cleared anywhere on the panic path? We can It's not. > only do it for the CPU that runs panic(), but I don't think it is getting > cleared > even for that CPU. It is cleared only in hv_cpu_die(), and that's not > called on > the panic path. IMO kdump is different from the non-kdump kexec in that the kdump kernel runs without depending on the memory used by the first kernel, so it looks unnecessary to clear the first kernel's VP Assist Page (and the hypercallpage). According to my test, the second kernel can re-enble the VP Asist Page and the hypercall page using different GPAs, without disabling the old pages first. Of course, in the future Hyper-V may require the guest to disable the pages first before trying to re-enabling them, so I agree we'd better clear the pages in the first kernell like this: diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index 4638a52d8eae..8022f51c9c05 100644 --- a/arch/x86/hyperv/hv_init.c +++ b/arch/x86/hyperv/hv_init.c @@ -202,7 +202,7 @@ void clear_hv_tscchange_cb(void) } EXPORT_SYMBOL_GPL(clear_hv_tscchange_cb); -static int hv_cpu_die(unsigned int cpu) +int hv_cpu_die(unsigned int cpu) { struct hv_reenlightenment_control re_ctrl; unsigned int new_cpu; diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h index 30f76b966857..d090e781d216 100644 --- a/arch/x86/include/asm/mshyperv.h +++ b/arch/x86/include/asm/mshyperv.h @@ -76,6 +76,8 @@ static inline void hv_disable_stimer0_percpu_irq(int irq) {} #if IS_ENABLED(CONFIG_HYPERV) extern int hyperv_init_cpuhp; +int hv_cpu_die(unsigned int cpu); + extern void *hv_hypercall_pg; extern void __percpu **hyperv_pcpu_input_arg; diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c index 43b54bef5448..e54f8262bfe0 100644 --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -156,6 +156,9 @@ static void hv_machine_crash_shutdown(struct pt_regs *regs) if (hv_crash_handler) hv_crash_handler(regs); + /* Only call this on the faulting CPU. */ + hv_cpu_die(raw_smp_processor_id()); + /* The function calls crash_smp_send_stop(). */ native_machine_crash_shutdown(regs); > > 2) hyperv_cleanup() is called too early. In the kexec path, the other CPUs > > are stopped in hv_machine_shutdown() -> native_machine_shutdown(), so > > between hv_kexec_handler() and native_machine_shutdown(), the other > CPUs > > can still try to access the hypercall page and cause panic. The workaround > > "hv_hypercall_pg = NULL;" in hyperv_cleanup() can't work reliably. > > I would note that the comment in hv_suspend() is also incorrect on this > point. Setting hv_hypercall_pg to NULL does not cause subsequent > hypercalls to fail safely. The fast hypercalls don't test for it, and even > if they > did test like hv_do_hypercall(), the test just creates a race condition. The comment in hv_suspend() should be correct because hv_suspend() is only called during hibernation from the syscore_ops path where only one CPU is active, e.g. for the suspend operation, it's called from state_store hibernate hibernation_snapshot create_image suspend_disable_secondary_cpus syscore_suspend hv_suspend It's similar for the resume operation: resume_store software_resume load_image_and_restore hibernation_restore resume_target_kernel hibernate_resume_nonboot_cpu_disable syscore_suspend hv_suspend > > static void hv_machine_crash_shutdown(struct pt_regs *regs) > > { > > if (hv_crash_handler) > > hv_crash_handler(regs); > > + > > + /* The function calls crash_smp_send_stop(). */ > > Actually, crash_smp_send_stop() or smp_send_stop() has already been > called earlier by panic(), This is true only when the Hyper-V host supports the feature HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE. On an old Hyper-V hos
[PATCH] x86/hyperv: Fix kexec panic/hang issues
Currently the kexec kernel can panic or hang due to 2 causes: 1) hv_cpu_die() is not called upon kexec, so the hypervisor corrupts the VP Assist Pages when the kexec kernel runs. We ever fixed the same issue for hibernation in commit 421f090c819d ("x86/hyperv: Suspend/resume the VP assist page for hibernation") and now let's fix it for kexec. 2) hyperv_cleanup() is called too early. In the kexec path, the other CPUs are stopped in hv_machine_shutdown() -> native_machine_shutdown(), so between hv_kexec_handler() and native_machine_shutdown(), the other CPUs can still try to access the hypercall page and cause panic. The workaround "hv_hypercall_pg = NULL;" in hyperv_cleanup() can't work reliably. Move hyperv_cleanup() to the right place. Signed-off-by: Dexuan Cui --- arch/x86/hyperv/hv_init.c | 4 arch/x86/include/asm/mshyperv.h | 2 ++ arch/x86/kernel/cpu/mshyperv.c | 18 ++ drivers/hv/vmbus_drv.c | 2 -- 4 files changed, 24 insertions(+), 2 deletions(-) diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index e04d90af4c27..4638a52d8eae 100644 --- a/arch/x86/hyperv/hv_init.c +++ b/arch/x86/hyperv/hv_init.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include @@ -26,6 +27,8 @@ #include #include +int hyperv_init_cpuhp; + void *hv_hypercall_pg; EXPORT_SYMBOL_GPL(hv_hypercall_pg); @@ -401,6 +404,7 @@ void __init hyperv_init(void) register_syscore_ops(&hv_syscore_ops); + hyperv_init_cpuhp = cpuhp; return; remove_cpuhp_state: diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h index ffc289992d1b..30f76b966857 100644 --- a/arch/x86/include/asm/mshyperv.h +++ b/arch/x86/include/asm/mshyperv.h @@ -74,6 +74,8 @@ static inline void hv_disable_stimer0_percpu_irq(int irq) {} #if IS_ENABLED(CONFIG_HYPERV) +extern int hyperv_init_cpuhp; + extern void *hv_hypercall_pg; extern void __percpu **hyperv_pcpu_input_arg; diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c index f628e3dc150f..43b54bef5448 100644 --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -135,14 +135,32 @@ static void hv_machine_shutdown(void) { if (kexec_in_progress && hv_kexec_handler) hv_kexec_handler(); + + /* +* Call hv_cpu_die() on all the CPUs, otherwise later the hypervisor +* corrupts the old VP Assist Pages and can crash the kexec kernel. +*/ + if (kexec_in_progress && hyperv_init_cpuhp > 0) + cpuhp_remove_state(hyperv_init_cpuhp); + + /* The function calls stop_other_cpus(). */ native_machine_shutdown(); + + /* Disable the hypercall page when there is only 1 active CPU. */ + if (kexec_in_progress) + hyperv_cleanup(); } static void hv_machine_crash_shutdown(struct pt_regs *regs) { if (hv_crash_handler) hv_crash_handler(regs); + + /* The function calls crash_smp_send_stop(). */ native_machine_crash_shutdown(regs); + + /* Disable the hypercall page when there is only 1 active CPU. */ + hyperv_cleanup(); } #endif /* CONFIG_KEXEC_CORE */ #endif /* CONFIG_HYPERV */ diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index 502f8cd95f6d..d491fdcee61f 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -2550,7 +2550,6 @@ static void hv_kexec_handler(void) /* Make sure conn_state is set as hv_synic_cleanup checks for it */ mb(); cpuhp_remove_state(hyperv_cpuhp_online); - hyperv_cleanup(); }; static void hv_crash_handler(struct pt_regs *regs) @@ -2566,7 +2565,6 @@ static void hv_crash_handler(struct pt_regs *regs) cpu = smp_processor_id(); hv_stimer_cleanup(cpu); hv_synic_disable_regs(cpu); - hyperv_cleanup(); }; static int hv_synic_suspend(void) -- 2.19.1
RE: [PATCH] ACPI: scan: Fix a Hyper-V Linux VM panic caused by buffer overflow
> From: Dexuan Cui > Sent: Thursday, December 17, 2020 8:08 PM > > Linux VM on Hyper-V crashes with the latest mainline: > ... > This is because of the recent buffer overflow detection in the > commit 6a39e62abbaf ("lib: string.h: detect intra-object overflow in fortified > string functions") > > Here acpi_device_bus_id->bus_id can only hold 14 characters, while the > the acpi_device_hid(device) returns a 22-char string > "HYPER_V_GEN_COUNTER_V1". > > Per ACPI Spec v6.2, Section 6.1.5 _HID (Hardware ID), if the ID is a > string, it must be of the form AAA or , i.e. 7 chars or 8 > chars. > > The field bus_id in struct acpi_device_bus_id was originally defined as > char bus_id[9], and later was enlarged to char bus_id[15] in 2007 in the > commit bb0958544f3c ("ACPI: use more understandable bus_id for ACPI > devices") > > It looks like so far an ID string of >=15 chars is only seen in the guest > BIOS/firmware by Hyper-V, and AFAIK the ID string > "HYPER_V_GEN_COUNTER_V1" > is never used by Linux VM on Hyper-V, so let's just truncate the string to > fix the panic. > > Signed-off-by: Dexuan Cui IMO this patch should also go to the stable trees, so please add Cc: Thanks, -- Dexuan
[PATCH] ACPI: scan: Fix a Hyper-V Linux VM panic caused by buffer overflow
Linux VM on Hyper-V crashes with the latest mainline: [4.069624] detected buffer overflow in strcpy [4.077733] kernel BUG at lib/string.c:1149! .. [4.085819] RIP: 0010:fortify_panic+0xf/0x11 ... [4.085819] Call Trace: [4.085819] acpi_device_add.cold.15+0xf2/0xfb [4.085819] acpi_add_single_object+0x2a6/0x690 [4.085819] acpi_bus_check_add+0xc6/0x280 [4.085819] acpi_ns_walk_namespace+0xda/0x1aa [4.085819] acpi_walk_namespace+0x9a/0xc2 [4.085819] acpi_bus_scan+0x78/0x90 [4.085819] acpi_scan_init+0xfa/0x248 [4.085819] acpi_init+0x2c1/0x321 [4.085819] do_one_initcall+0x44/0x1d0 [4.085819] kernel_init_freeable+0x1ab/0x1f4 This is because of the recent buffer overflow detection in the commit 6a39e62abbaf ("lib: string.h: detect intra-object overflow in fortified string functions") Here acpi_device_bus_id->bus_id can only hold 14 characters, while the the acpi_device_hid(device) returns a 22-char string "HYPER_V_GEN_COUNTER_V1". Per ACPI Spec v6.2, Section 6.1.5 _HID (Hardware ID), if the ID is a string, it must be of the form AAA or , i.e. 7 chars or 8 chars. The field bus_id in struct acpi_device_bus_id was originally defined as char bus_id[9], and later was enlarged to char bus_id[15] in 2007 in the commit bb0958544f3c ("ACPI: use more understandable bus_id for ACPI devices") It looks like so far an ID string of >=15 chars is only seen in the guest BIOS/firmware by Hyper-V, and AFAIK the ID string "HYPER_V_GEN_COUNTER_V1" is never used by Linux VM on Hyper-V, so let's just truncate the string to fix the panic. Signed-off-by: Dexuan Cui --- drivers/acpi/scan.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c index a1b226eb2ce2..b801442b6b1b 100644 --- a/drivers/acpi/scan.c +++ b/drivers/acpi/scan.c @@ -674,7 +674,8 @@ int acpi_device_add(struct acpi_device *device, } if (!found) { acpi_device_bus_id = new_bus_id; - strcpy(acpi_device_bus_id->bus_id, acpi_device_hid(device)); + strlcpy(acpi_device_bus_id->bus_id, acpi_device_hid(device), + sizeof(acpi_device_bus_id->bus_id)); acpi_device_bus_id->instance_no = 0; list_add_tail(&acpi_device_bus_id->node, &acpi_bus_id_list); } -- 2.19.1
RE: [PATCH 2/3] scsi: storvsc: Resolve data race in storvsc_probe()
> From: Andrea Parri (Microsoft) > Sent: Thursday, December 17, 2020 12:33 PM > > vmscsi_size_delta can be written concurrently by multiple instances of > storvsc_probe(), corresponding to multiple synthetic IDE/SCSI devices; > cf. storvsc_drv's probe_type == PROBE_PREFER_ASYNCHRONOUS. Change > the > global variable vmscsi_size_delta to per-synthetic-IDE/SCSI-device. > > Suggested-by: Dexuan Cui > Signed-off-by: Andrea Parri (Microsoft) > Cc: "James E.J. Bottomley" > Cc: "Martin K. Petersen" > Cc: linux-s...@vger.kernel.org Reviewed-by: Dexuan Cui
RE: [PATCH 3/3] scsi: storvsc: Validate length of incoming packet in storvsc_on_channel_callback()
> From: Andrea Parri (Microsoft) > Sent: Thursday, December 17, 2020 12:33 PM > > Check that the packet is of the expected size at least, don't copy data > past the packet. > > Reported-by: Saruhan Karademir > Signed-off-by: Andrea Parri (Microsoft) > Cc: "James E.J. Bottomley" > Cc: "Martin K. Petersen" > Cc: linux-s...@vger.kernel.org Reviewed-by: Dexuan Cui
RE: [PATCH 1/3] scsi: storvsc: Fix max_outstanding_req_per_channel for Win8 and newer
> From: Andrea Parri (Microsoft) > Sent: Thursday, December 17, 2020 12:33 PM > > Current code overestimates the value of max_outstanding_req_per_channel > for Win8 and newer hosts, since vmscsi_size_delta is set to the initial > value of sizeof(vmscsi_win8_extension) rather than zero. This may lead > to wrong decisions when using ring_avail_percent_lowater equals to zero. > The estimate of max_outstanding_req_per_channel is 'exact' for Win7 and > older hosts. A better choice, keeping the algorithm for the estimation > simple, is to err the other way around, i.e., to underestimate for Win7 > and older but to use the exact value for Win8 and newer. > > Suggested-by: Dexuan Cui > Signed-off-by: Andrea Parri (Microsoft) > Cc: "James E.J. Bottomley" > Cc: "Martin K. Petersen" > Cc: linux-s...@vger.kernel.org Reviewed-by: Dexuan Cui
RE: static_branch_enable() does not work from a __init function?
> From: Peter Zijlstra > Sent: Wednesday, December 16, 2020 2:59 AM > ... > So I think the reason your above module doesn't work, while the one in > vmx_init() does work (for 5.10) should be fixed by the completely > untested below. > > I've no clue about 5.4 and no desire to investigate. That's what distro > people are for. > > Can you verify? > > --- > diff --git a/kernel/jump_label.c b/kernel/jump_label.c > index 015ef903ce8c..c6a39d662935 100644 > --- a/kernel/jump_label.c > +++ b/kernel/jump_label.c > @@ -793,6 +793,7 @@ int jump_label_text_reserved(void *start, void *end) > static void jump_label_update(struct static_key *key) > { > struct jump_entry *stop = __stop___jump_table; > + bool init = system_state < SYSTEM_RUNNING; > struct jump_entry *entry; > #ifdef CONFIG_MODULES > struct module *mod; > @@ -804,15 +805,16 @@ static void jump_label_update(struct static_key > *key) > > preempt_disable(); > mod = __module_address((unsigned long)key); > - if (mod) > + if (mod) { > stop = mod->jump_entries + mod->num_jump_entries; > + init = mod->state == MODULE_STATE_COMING; > + } > preempt_enable(); > #endif > entry = static_key_entries(key); > /* if there are no users, entry can be NULL */ > if (entry) > - __jump_label_update(key, entry, stop, > - system_state < SYSTEM_RUNNING); > + __jump_label_update(key, entry, stop, init); > } > > #ifdef CONFIG_STATIC_KEYS_SELFTEST Yes, this patch fixes the issue found by the test module for both v5.10 and v5.4. Thank you, Peter! Dexuan
static_branch_enable() does not work from a __init function?
Hi, The below init_module() prints "foo: false". This is strange since static_branch_enable() is called before the static_branch_unlikely(). This strange behavior happens to v5.10 and an old v5.4 kernel. If I remove the "__init" marker from the init_module() function, then I get the expected output of "foo: true"! I guess here I'm missing something with Static Keys? #include #include #include static DEFINE_STATIC_KEY_FALSE(enable_foo); int __init init_module(void) { static_branch_enable(&enable_foo); if (static_branch_unlikely(&enable_foo)) printk("foo: true\n"); else printk("foo: false\n"); return 0; } void cleanup_module(void) { static_branch_disable(&enable_foo); } MODULE_LICENSE("GPL"); PS, I originally found: in arch/x86/kvm/vmx/vmx.c: vmx_init(), it looks like the line "static_branch_enable(&enable_evmcs);" does not take effect in a v5.4-based kernel, but does take effect in the v5.10 kernel in the same x86-64 virtual machine on Hyper-V, so I made the above test module to test static_branch_enable(), and found that static_branch_enable() in the test module does not work with both v5.10 and my v5.4 kernel, if the __init marker is used. Thanks, -- Dexuan
How can a userspace program tell if the system supports the ACPI S4 state (Suspend-to-Disk)?
Hi all, It looks like Linux can hibernate even if the system does not support the ACPI S4 state, as long as the system can shut down, so "cat /sys/power/state" always contains "disk", unless we specify the kernel parameter "nohibernate" or we use LOCKDOWN_HIBERNATION. In some scenarios IMO it can still be useful if the userspace is able to detect if the ACPI S4 state is supported or not, e.g. when a Linux guest runs on Hyper-V, Hyper-V uses the virtual ACPI S4 state as an indicator of the proper support of the tool stack on the host, i.e. the guest is discouraged from trying hibernation if the state is not supported. I know we can check the S4 state by 'dmesg': # dmesg |grep ACPI: | grep support [3.034134] ACPI: (supports S0 S4 S5) But this method is unreliable because the kernel msg buffer can be filled and overwritten. Is there any better method? If not, do you think if the below patch is appropriate? Thanks! diff --git a/kernel/power/main.c b/kernel/power/main.c index 0aefd6f57e0a..931a1526ea69 100644 --- a/kernel/power/main.c +++ b/kernel/power/main.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -600,8 +601,12 @@ static ssize_t state_show(struct kobject *kobj, struct kobj_attribute *attr, s += sprintf(s,"%s ", pm_states[i]); #endif - if (hibernation_available()) - s += sprintf(s, "disk "); + if (hibernation_available()) { + if (acpi_sleep_state_supported(ACPI_STATE_S4)) + s += sprintf(s, "disk+ "); + else + s += sprintf(s, "disk "); + } if (s != buf) /* convert the last space to a newline */ *(s-1) = '\n'; Thanks, -- Dexuan
RE: [PATCH] iommu/hyper-v: Fix panic on a host without the 15-bit APIC ID support
> From: Thomas Gleixner > Sent: Wednesday, December 2, 2020 1:56 AM > > On Tue, Dec 01 2020 at 16:45, Dexuan Cui wrote: > > The commit f36a74b9345a itself is good, but it causes a panic in a > > Linux VM that runs on a Hyper-V host that doesn't have the 15-bit > > Extended APIC ID support: > > kernel BUG at arch/x86/kernel/apic/io_apic.c:2408! > > This has nothing to do with the 15bit APIC ID support, really. > > The point is that the select() function only matches when I/O APIC ID is > 0, which is not guaranteed. That's independent of the 15bit extended > APIC ID feature. But the I/O-APIC ID is irrelevant on hyperv because > there is only one. > > > This happens because the Hyper-V ioapic_ir_domain (which is defined in > > drivers/iommu/hyperv-iommu.c) can not be found. Fix the panic by > > properly claiming the only I/O APIC emulated by Hyper-V. > > We don't fix a panic. We fix a bug in the code :) > > I'll amend the changelog. > > Thanks, > > tglx Thank you for reworking the commit log, tglx! Dexuan
[tip: x86/apic] iommu/hyper-v: Remove I/O-APIC ID check from hyperv_irq_remapping_select()
The following commit has been merged into the x86/apic branch of tip: Commit-ID: 26ab12bb9d96133b7880141d68b5e01a8783de9d Gitweb: https://git.kernel.org/tip/26ab12bb9d96133b7880141d68b5e01a8783de9d Author:Dexuan Cui AuthorDate:Tue, 01 Dec 2020 16:45:10 -08:00 Committer: Thomas Gleixner CommitterDate: Wed, 02 Dec 2020 11:22:55 +01:00 iommu/hyper-v: Remove I/O-APIC ID check from hyperv_irq_remapping_select() commit a491bb19f728 ("iommu/hyper-v: Implement select() method on remapping irqdomain") restricted the irq_domain_ops::select() callback to match on I/O-APIC index 0, which was correct until the parameter was changed to carry the I/O APIC ID in commit f36a74b9345a. If the ID is not 0 then the match fails. Therefore I/O-APIC init fails to retrieve the parent irqdomain for the I/O-APIC resulting in a boot panic: kernel BUG at arch/x86/kernel/apic/io_apic.c:2408! Fix it by matching the I/O-APIC independent of the ID as there is only one I/O APIC emulated by Hyper-V. [ tglx: Amended changelog ] Fixes: f36a74b9345a ("x86/ioapic: Use I/O-APIC ID for finding irqdomain, not index") Signed-off-by: Dexuan Cui Signed-off-by: Thomas Gleixner Reviewed-by: David Woodhouse Link: https://lore.kernel.org/r/20201202004510.1818-1-de...@microsoft.com --- drivers/iommu/hyperv-iommu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c index 9438daa..1d21a0b 100644 --- a/drivers/iommu/hyperv-iommu.c +++ b/drivers/iommu/hyperv-iommu.c @@ -105,8 +105,8 @@ static int hyperv_irq_remapping_select(struct irq_domain *d, struct irq_fwspec *fwspec, enum irq_domain_bus_token bus_token) { - /* Claim only the first (and only) I/OAPIC */ - return x86_fwspec_is_ioapic(fwspec) && fwspec->param[0] == 0; + /* Claim the only I/O APIC emulated by Hyper-V */ + return x86_fwspec_is_ioapic(fwspec); } static const struct irq_domain_ops hyperv_ir_domain_ops = {
[PATCH] iommu/hyper-v: Fix panic on a host without the 15-bit APIC ID support
The commit f36a74b9345a itself is good, but it causes a panic in a Linux VM that runs on a Hyper-V host that doesn't have the 15-bit Extended APIC ID support: kernel BUG at arch/x86/kernel/apic/io_apic.c:2408! This happens because the Hyper-V ioapic_ir_domain (which is defined in drivers/iommu/hyperv-iommu.c) can not be found. Fix the panic by properly claiming the only I/O APIC emulated by Hyper-V. Cc: David Woodhouse Cc: Vitaly Kuznetsov Fixes: f36a74b9345a ("x86/ioapic: Use I/O-APIC ID for finding irqdomain, not index") Signed-off-by: Dexuan Cui --- drivers/iommu/hyperv-iommu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) This patch is for the tip.git tree's x86/apic branch. diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c index 9438daa24fdb..1d21a0b5f724 100644 --- a/drivers/iommu/hyperv-iommu.c +++ b/drivers/iommu/hyperv-iommu.c @@ -105,8 +105,8 @@ static int hyperv_irq_remapping_select(struct irq_domain *d, struct irq_fwspec *fwspec, enum irq_domain_bus_token bus_token) { - /* Claim only the first (and only) I/OAPIC */ - return x86_fwspec_is_ioapic(fwspec) && fwspec->param[0] == 0; + /* Claim the only I/O APIC emulated by Hyper-V */ + return x86_fwspec_is_ioapic(fwspec); } static const struct irq_domain_ops hyperv_ir_domain_ops = { base-commit: d1adcfbb520c43c10fc22fcdccdd4204e014fb53 -- 2.27.0
RE: kdump always hangs in rcu_barrier() -> wait_for_completion()
> From: Paul E. McKenney > Sent: Thursday, November 26, 2020 1:42 PM > > > > Another possibility is that rcu_state.gp_kthread is non-NULL, but that > > > something else is preventing RCU grace periods from completing, but in > > > > It looks like somehow the scheduling is not working here: in rcu_barrier() > > , if I replace the wait_for_completion() with > > wait_for_completion_timeout(&rcu_state.barrier_completion, 30*HZ), the > > issue persists. > > Have you tried using sysreq-t to see what the various tasks are doing? Will try it. BTW, this is a "Generation 2" VM on Hyper-V, meaning sysrq only starts to work after the Hyper-V para-virtualized keyboard driver loads... So, at this early point, sysrq is not working. :-( I'll have to hack the code and use a virtual NMI interrupt to force the sysrq handler to be called. > Having interrupts disabled on all CPUs would have the effect of disabling > the RCU CPU stall warnings. > Thanx, Paul I'm sure the interrupts are not disabled. Here the VM only has 1 virtual CPU, and when the hang issue happens the virtual serial console is still responding when I press Enter (it prints a new line) or Ctrl+C (it prints ^C). Here the VM does not use the "legacy timers" (PIT, Local APIC timer, etc.) at all. Instead, the VM uses the Hyper-V para-virtualized timers. It looks the Hyper-V timer never fires in the kdump kernel when the hang issue happens. I'm looking into this... I suspect this hang issue may only be specific to Hyper-V. Thanks, -- Dexuan
RE: kdump always hangs in rcu_barrier() -> wait_for_completion()
> From: Paul E. McKenney > Sent: Thursday, November 26, 2020 7:47 AM > ... > The rcu_segcblist_n_cbs() function returns non-zero because something > invoked call_rcu() some time previously. The ftrace facility (or just > a printk) should help you work out where that call_rcu() is located. call_rcu() is indeed called multiple times, but as you said, this should be normal. > My best guess is that the underlying bug is that you are invoking > rcu_barrier() before the RCU grace-period kthread has been created. > This means that RCU grace periods cannot complete, which in turn means > that if there has been even one invocation of call_rcu() since boot, > rcu_barrier() cannot complete, which is what you are in fact seeing. > Please note that it is perfectly legal to invoke call_rcu() very early in > the boot process, as in even before the call to rcu_init(). Therefore, > if this is the case, the bug is the early call to rcu_barrier(), not > the early calls to call_rcu(). > > To check this, at the beginning of rcu_barrier(), check the value of > rcu_state.gp_kthread. If my guess is correct, it will be NULL. Unluckily, it's not NULL here. :-) > > Another possibility is that rcu_state.gp_kthread is non-NULL, but that > something else is preventing RCU grace periods from completing, but in It looks like somehow the scheduling is not working here: in rcu_barrier() , if I replace the wait_for_completion() with wait_for_completion_timeout(&rcu_state.barrier_completion, 30*HZ), the issue persists. > that case you should see RCU CPU stall warnings. Unless of course they > have been disabled. > Thanx, Paul I guess I didn't disable the wanrings (I don't even know how to do that :) grep RCU .config # RCU Subsystem CONFIG_TREE_RCU=y # CONFIG_RCU_EXPERT is not set CONFIG_SRCU=y CONFIG_TREE_SRCU=y CONFIG_TASKS_RCU_GENERIC=y CONFIG_TASKS_RUDE_RCU=y CONFIG_TASKS_TRACE_RCU=y CONFIG_RCU_STALL_COMMON=y CONFIG_RCU_NEED_SEGCBLIST=y CONFIG_RCU_NOCB_CPU=y # end of RCU Subsystem CONFIG_MMU_GATHER_RCU_TABLE_FREE=y # RCU Debugging # CONFIG_RCU_SCALE_TEST is not set # CONFIG_RCU_TORTURE_TEST is not set # CONFIG_RCU_REF_SCALE_TEST is not set CONFIG_RCU_CPU_STALL_TIMEOUT=30 CONFIG_RCU_TRACE=y CONFIG_RCU_EQS_DEBUG=y # end of RCU Debugging Thanks, -- Dexuan
RE: [PATCH 1/2] x86/kexec: Use up-to-dated screen_info copy to fill boot params
> From: Dexuan Cui > Sent: Monday, November 16, 2020 7:40 PM > > diff --git a/arch/x86/kernel/kexec-bzimage64.c > > b/arch/x86/kernel/kexec-bzimage64.c > > index 57c2ecf43134..ce831f9448e7 100644 > > --- a/arch/x86/kernel/kexec-bzimage64.c > > +++ b/arch/x86/kernel/kexec-bzimage64.c > > @@ -200,8 +200,7 @@ setup_boot_parameters(struct kimage *image, struct > > boot_params *params, > > params->hdr.hardware_subarch = boot_params.hdr.hardware_subarch; > > > > /* Copying screen_info will do? */ > > - memcpy(¶ms->screen_info, &boot_params.screen_info, > > - sizeof(struct screen_info)); > > + memcpy(¶ms->screen_info, &screen_info, sizeof(struct screen_info)); > > > > /* Fill in memsize later */ > > params->screen_info.ext_mem_k = 0; > > -- > > Hi Kairui, > According to "man kexec", kdump/kexec can use 2 different syscalls to set up > the > kdump kernel: > > -s (--kexec-file-syscall) >Specify that the new KEXEC_FILE_LOAD syscall should be used > exclusively. > > -c (--kexec-syscall) >Specify that the old KEXEC_LOAD syscall should be used exclusively > (the default). > > It looks I can only reproduce the call-trace > (https://bugzilla.redhat.com/show_bug.cgi?id=1867887#c5) with > KEXEC_FILE_LOAD: > I did kdump tests in Ubuntu 20.04 VM and by default the VM used the > KEXEC_LOAD > syscall and I couldn't reproduce the call-trace; after I added the "-s" > parameter > to use > the KEXEC_FILE_LOAD syscall, I could reproduce the call-trace and I can > confirm > your > patch can eliminate the call-trace because the "efifb" driver doesn't even > load > with > your patch. > > Your patch is only for the KEXEC_FILE_LOAD syscall, and I'm sure it's not > used in > the code path of the KEXEC_LOAD syscall. > > So, in the case of the KEXEC_LOAD syscall, do you know how the *kexec* > kernel's boot_params.screen_info.lfb_base is intialized? I haven't figured it > out yet. FYI: in the case of the KEXEC_LOAD syscall, I think the lfb_base of the kexec kernel is pre-setup by the kexec tool (see the function setup_linux_vesafb()): https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/kexec/arch/i386/x86-linux-setup.c#n126 static int setup_linux_vesafb(struct x86_linux_param_header *real_mode) { struct fb_fix_screeninfo fix; struct fb_var_screeninfo var; int fd; fd = open("/dev/fb0", O_RDONLY); if (-1 == fd) return -1; if (-1 == ioctl(fd, FBIOGET_FSCREENINFO, &fix)) goto out; if (-1 == ioctl(fd, FBIOGET_VSCREENINFO, &var)) goto out; if (0 == strcmp(fix.id, "VESA VGA")) { /* VIDEO_TYPE_VLFB */ real_mode->orig_video_isVGA = 0x23; } else if (0 == strcmp(fix.id, "EFI VGA")) { /* VIDEO_TYPE_EFI */ real_mode->orig_video_isVGA = 0x70; } else if (arch_options.reuse_video_type) { int err; off_t offset = offsetof(typeof(*real_mode), orig_video_isVGA); /* blindly try old boot time video type */ err = get_bootparam(&real_mode->orig_video_isVGA, offset, 1); if (err) goto out; } else { real_mode->orig_video_isVGA = 0; close(fd); return 0; } When a Ubuntu 20.10 VM (kexec-tools-2.0.20) runs on Hyper-V, we should fall into the last condition, i.e. setting "real_mode->orig_video_isVGA = 0;", so the "efifb" driver does not load in the kdump kernel. Ubuntu 20.04 (kexec-tools-2.0.18) is a little old in that it does not have Kairui's patch https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/commit/?id=fb5a8792e6e4ee7de7ae3e06d193ea5beaaececc , so it re-uses the VRAM location set up by the hyperv_fb driver, which is undesirable because the "efifb" driver doesn't know it's accessing an "incompatible" framebuffer -- IMO this may be just a small issue, but anyay I hope Ubuntu 20.04's kexec-tools will pick up your patch. So, now we should cover all the combinations if we use the latest kernel and the latest kexec-tools, and the "efifb" driver in the kdump kernel doesn't load. Thanks, -- Dexuan
kdump always hangs in rcu_barrier() -> wait_for_completion()
Hi, I happened to hit a kdump hang issue in a Linux VM running on some Hyper-V host. Please see the attached log: the kdump kernel always hangs, even if I configure only 1 virtual CPU to the VM. I firstly hit the issue in RHEL 8.3's 4.18.x kernel, but later I found that the latest upstream v5.10-rc5 also has the same issue (at least the symptom is exactly the same), so I dug into v5.10-rc5 and found that the kdump kernel always hangs in kernel_init() -> mark_readonly() -> rcu_barrier() -> wait_for_completion(&rcu_state.barrier_completion). Let's take the 1-vCPU case for example (refer to the attached log): in the below code, somehow rcu_segcblist_n_cbs() returns true, so the call smp_call_function_single(cpu, rcu_barrier_func, (void *)cpu, 1) increases the counter by 1, and hence later the counter is still 1 after the atomic_sub_and_test(), and the complete() is not called. static void rcu_barrier_func(void *cpu_in) { ... if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) { atomic_inc(&rcu_state.barrier_cpu_count); } else { ... } void rcu_barrier(void) { ... atomic_set(&rcu_state.barrier_cpu_count, 2); ... for_each_possible_cpu(cpu) { rdp = per_cpu_ptr(&rcu_data, cpu); ... if (rcu_segcblist_n_cbs(&rdp->cblist) && cpu_online(cpu)) { ... smp_call_function_single(cpu, rcu_barrier_func, (void *)cpu, 1); ... } } ... if (atomic_sub_and_test(2, &rcu_state.barrier_cpu_count)) complete(&rcu_state.barrier_completion); ... wait_for_completion(&rcu_state.barrier_completion); Sorry for my ignorance of RCU -- I'm not sure why the rcu_segcblist_n_cbs() returns 1 here. In the normal kernel, it returns 0, so the normal kernel does not hang. Note: in the case of kdump kernel, if I remove the kernel parameter console=ttyS0 OR if I build the kernel with CONFIG_HZ=250, the issue can no longer reproduce. Currently my kernel uses CONFIG_HZ=1000 and I use console=ttyS0, so I'm able to reproduce the isue every time. Note: the same kernel binary can not reproduce the issue when the VM runs on another Hyper-V host. It looks there is some kind of race condition? Looking forward to your insights! I'm happy to test any patch or enable more tracing, if necessary. Thanks! Thanks, -- Dexuan bad-hz-1000.log Description: bad-hz-1000.log
RE: [PATCH] video: hyperv_fb: Directly use the MMIO VRAM
Hi Wei Liu, Please do not pick up this patch, because actually MMIO VRAM can not work with fb_deferred_io. Previously I didn't test Xorg -- sorry. As soon as I tested it, I got the below warning and the Xorg program ternimated immediately: [ 28.148432] WARNING: CPU: 19 PID: 1410 at mm/vmalloc.c:383 vmalloc_to_page+0x14b/0x150 ... [ 28.192959] CPU: 19 PID: 1410 Comm: Xorg Tainted: GE 5.10.0-rc1+ #4 ... [ 28.208720] RIP: 0010:vmalloc_to_page+0x14b/0x150 ... [ 28.299231] Call Trace: [ 28.301428] fb_deferred_io_fault+0x3a/0xa0 [ 28.305276] __do_fault+0x36/0x120 [ 28.308276] handle_mm_fault+0x1144/0x1950 [ 28.311963] exc_page_fault+0x290/0x510 [ 28.315551] ? asm_exc_page_fault+0x8/0x30 [ 28.319186] asm_exc_page_fault+0x1e/0x30 [ 28.322969] RIP: 0033:0x7fbeda3ec2f5 The issue is that fb_deferred_io_page() requires that the PFN be backed by a struct page, but it looks the MMIO address does not have the struct page backed. So I have to drop this patch. Thanks Wei Hu and Michael for pointing this out! FYI: drivers/video/fbdev/core/fb_defio.c: static struct page *fb_deferred_io_page(struct fb_info *info, unsigned long offs) { void *screen_base = (void __force *) info->screen_base; struct page *page; if (is_vmalloc_addr(screen_base + offs)) page = vmalloc_to_page(screen_base + offs); else page = pfn_to_page((info->fix.smem_start + offs) >> PAGE_SHIFT); return page; } /* this is to find and return the vmalloc-ed fb pages */ static vm_fault_t fb_deferred_io_fault(struct vm_fault *vmf) { unsigned long offset; struct page *page; struct fb_info *info = vmf->vma->vm_private_data; offset = vmf->pgoff << PAGE_SHIFT; if (offset >= info->fix.smem_len) return VM_FAULT_SIGBUS; page = fb_deferred_io_page(info, offset); if (!page) return VM_FAULT_SIGBUS; Thanks, -- Dexuan
[PATCH] video: hyperv_fb: Directly use the MMIO VRAM
Late in 2019, 2 commits (see the 2 Fixes tags) were introduced to mitigate the slow framebuffer issue. Now that we have fixed the slowness issue by correctly mapping the MMIO VRAM (see commit 5f1251a48c17 ("video: hyperv_fb: Fix the cache type when mapping the VRAM")), we can remove most of the code introduced by the 2 commits, i.e. we no longer need to allocate physical memory and use it to back up the VRAM in Generation-1 VM, and we also no longer need to allocate physical memory to back up the framebuffer in a Generation-2 VM and copy the framebuffer to the real VRAM. synthvid_deferred_io() is kept, because it's still desirable to send the SYNTHVID_DIRT message only for the exact dirty rectangle, and only when needed. Fixes: d21987d709e8 ("video: hyperv: hyperv_fb: Support deferred IO for Hyper-V frame buffer driver") Fixes: 3a6fb6c4255c ("video: hyperv: hyperv_fb: Use physical memory for fb on HyperV Gen 1 VMs.") Cc: Wei Hu Cc: Boqun Feng Signed-off-by: Dexuan Cui --- This patch changes drivers/video/fbdev/Kconfig, but I hope this can still go through the Hyper-V tree https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-next because it's unlikely to cause any build issue to other fbdev drivers (that line was introduced by 3a6fb6c4255c only for hyperv_fb.c) Note: this patch is based on the Hyper-V tree's hyperv-fixes branch, but it should also apply cleanly to the branch hyperv-next if the commit 5f1251a48c17 is applied first. This patch is for v5.11 rather than v5.10. drivers/video/fbdev/Kconfig | 1 - drivers/video/fbdev/hyperv_fb.c | 170 ++-- 2 files changed, 9 insertions(+), 162 deletions(-) diff --git a/drivers/video/fbdev/Kconfig b/drivers/video/fbdev/Kconfig index 402e85450bb5..05b37fb3c6d6 100644 --- a/drivers/video/fbdev/Kconfig +++ b/drivers/video/fbdev/Kconfig @@ -2205,7 +2205,6 @@ config FB_HYPERV select FB_CFB_COPYAREA select FB_CFB_IMAGEBLIT select FB_DEFERRED_IO - select DMA_CMA if HAVE_DMA_CONTIGUOUS && CMA help This framebuffer driver supports Microsoft Hyper-V Synthetic Video. diff --git a/drivers/video/fbdev/hyperv_fb.c b/drivers/video/fbdev/hyperv_fb.c index 58c74d2356ba..8131f4e66f98 100644 --- a/drivers/video/fbdev/hyperv_fb.c +++ b/drivers/video/fbdev/hyperv_fb.c @@ -31,16 +31,6 @@ * "set-vmvideo" command. For example * set-vmvideo -vmname name -horizontalresolution:1920 \ * -verticalresolution:1200 -resolutiontype single - * - * Gen 1 VMs also support direct using VM's physical memory for framebuffer. - * It could improve the efficiency and performance for framebuffer and VM. - * This requires to allocate contiguous physical memory from Linux kernel's - * CMA memory allocator. To enable this, supply a kernel parameter to give - * enough memory space to CMA allocator for framebuffer. For example: - *cma=130m - * This gives 130MB memory to CMA allocator that can be allocated to - * framebuffer. For reference, 8K resolution (7680x4320) takes about - * 127MB memory. */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt @@ -267,16 +257,8 @@ struct hvfb_par { /* If true, the VSC notifies the VSP on every framebuffer change */ bool synchronous_fb; - /* If true, need to copy from deferred IO mem to framebuffer mem */ - bool need_docopy; - struct notifier_block hvfb_panic_nb; - /* Memory for deferred IO and frame buffer itself */ - unsigned char *dio_vp; - unsigned char *mmio_vp; - phys_addr_t mmio_pp; - /* Dirty rectangle, protected by delayed_refresh_lock */ int x1, y1, x2, y2; bool delayed_refresh; @@ -405,21 +387,6 @@ synthvid_update(struct fb_info *info, int x1, int y1, int x2, int y2) return 0; } -static void hvfb_docopy(struct hvfb_par *par, - unsigned long offset, - unsigned long size) -{ - if (!par || !par->mmio_vp || !par->dio_vp || !par->fb_ready || - size == 0 || offset >= dio_fb_size) - return; - - if (offset + size > dio_fb_size) - size = dio_fb_size - offset; - - memcpy(par->mmio_vp + offset, par->dio_vp + offset, size); -} - -/* Deferred IO callback */ static void synthvid_deferred_io(struct fb_info *p, struct list_head *pagelist) { @@ -444,10 +411,6 @@ static void synthvid_deferred_io(struct fb_info *p, y2 = end / p->fix.line_length; miny = min_t(int, miny, y1); maxy = max_t(int, maxy, y2); - - /* Copy from dio space to mmio address */ - if (par->fb_ready && par->need_docopy) - hvfb_docopy(par, start, PAGE_SIZE); } if (par->fb_ready && par->update) @@ -
[PATCH] video: hyperv_fb: Fix the cache type when mapping the VRAM
x86 Hyper-V used to essentially always overwrite the effective cache type of guest memory accesses to WB. This was problematic in cases where there is a physical device assigned to the VM, since that often requires that the VM should have control over cache types. Thus, on newer Hyper-V since 2018, Hyper-V always honors the VM's cache type, but unexpectedly Linux VM users start to complain that Linux VM's VRAM becomes very slow, and it turns out that Linux VM should not map the VRAM uncacheable by ioremap(). Fix this slowness issue by using ioremap_cache(). On ARM64, ioremap_cache() is also required as the host also maps the VRAM cacheable, otherwise VM Connect can't display properly with ioremap() or ioremap_wc(). With this change, the VRAM on new Hyper-V is as fast as regular RAM, so it's no longer necessary to use the hacks we added to mitigate the slowness, i.e. we no longer need to allocate physical memory and use it to back up the VRAM in Generation-1 VM, and we also no longer need to allocate physical memory to back up the framebuffer in a Generation-2 VM and copy the framebuffer to the real VRAM. A further big change will address these for v5.11. Fixes: 68a2d20b79b1 ("drivers/video: add Hyper-V Synthetic Video Frame Buffer Driver") Tested-by: Boqun Feng Signed-off-by: Dexuan Cui --- Hi Wei Liu, can you please pick this up into the hyperv/linux.git tree's hyperv-fixes branch? I really hope this patch can be in v5.10 since it fixes a longstanding issue: https://github.com/LIS/lis-next/issues/655 drivers/video/fbdev/hyperv_fb.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/video/fbdev/hyperv_fb.c b/drivers/video/fbdev/hyperv_fb.c index 5bc86f481a78..c8b0ae676809 100644 --- a/drivers/video/fbdev/hyperv_fb.c +++ b/drivers/video/fbdev/hyperv_fb.c @@ -1093,7 +1093,12 @@ static int hvfb_getmem(struct hv_device *hdev, struct fb_info *info) goto err1; } - fb_virt = ioremap(par->mem->start, screen_fb_size); + /* +* Map the VRAM cacheable for performance. This is also required for +* VM Connect to display properly for ARM64 Linux VM, as the host also +* maps the VRAM cacheable. +*/ + fb_virt = ioremap_cache(par->mem->start, screen_fb_size); if (!fb_virt) goto err2; -- 2.19.1
RE: [PATCH 1/2] x86/kexec: Use up-to-dated screen_info copy to fill boot params
> From: Kairui Song > Sent: Wednesday, October 14, 2020 2:24 AM > To: linux-kernel@vger.kernel.org > > kexec_file_load now just reuse the old boot_params.screen_info. > But if drivers have change the hardware state, boot_param.screen_info > could contain invalid info. > > For example, the video type might be no longer VGA, or frame buffer > address changed. If kexec kernel keep using the old screen_info, > kexec'ed kernel may attempt to write to an invalid framebuffer > memory region. > > There are two screen_info globally available, boot_params.screen_info > and screen_info. Later one is a copy, and could be updated by drivers. > > So let kexec_file_load use the updated copy. > > Signed-off-by: Kairui Song > --- > arch/x86/kernel/kexec-bzimage64.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/arch/x86/kernel/kexec-bzimage64.c > b/arch/x86/kernel/kexec-bzimage64.c > index 57c2ecf43134..ce831f9448e7 100644 > --- a/arch/x86/kernel/kexec-bzimage64.c > +++ b/arch/x86/kernel/kexec-bzimage64.c > @@ -200,8 +200,7 @@ setup_boot_parameters(struct kimage *image, struct > boot_params *params, > params->hdr.hardware_subarch = boot_params.hdr.hardware_subarch; > > /* Copying screen_info will do? */ > - memcpy(¶ms->screen_info, &boot_params.screen_info, > - sizeof(struct screen_info)); > + memcpy(¶ms->screen_info, &screen_info, sizeof(struct screen_info)); > > /* Fill in memsize later */ > params->screen_info.ext_mem_k = 0; > -- Hi Kairui, According to "man kexec", kdump/kexec can use 2 different syscalls to set up the kdump kernel: -s (--kexec-file-syscall) Specify that the new KEXEC_FILE_LOAD syscall should be used exclusively. -c (--kexec-syscall) Specify that the old KEXEC_LOAD syscall should be used exclusively (the default). It looks I can only reproduce the call-trace (https://bugzilla.redhat.com/show_bug.cgi?id=1867887#c5) with KEXEC_FILE_LOAD: I did kdump tests in Ubuntu 20.04 VM and by default the VM used the KEXEC_LOAD syscall and I couldn't reproduce the call-trace; after I added the "-s" parameter to use the KEXEC_FILE_LOAD syscall, I could reproduce the call-trace and I can confirm your patch can eliminate the call-trace because the "efifb" driver doesn't even load with your patch. Your patch is only for the KEXEC_FILE_LOAD syscall, and I'm sure it's not used in the code path of the KEXEC_LOAD syscall. So, in the case of the KEXEC_LOAD syscall, do you know how the *kexec* kernel's boot_params.screen_info.lfb_base is intialized? I haven't figured it out yet. Thanks, -- Dexuan
[tip: x86/apic] x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it
The following commit has been merged into the x86/apic branch of tip: Commit-ID: d981059e13ffa9ed03a73472e932d070323bd057 Gitweb: https://git.kernel.org/tip/d981059e13ffa9ed03a73472e932d070323bd057 Author:Dexuan Cui AuthorDate:Mon, 02 Nov 2020 17:11:36 -08:00 Committer: Thomas Gleixner CommitterDate: Wed, 04 Nov 2020 11:10:52 +01:00 x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it When a Linux VM runs on Hyper-V, if the VM has CPUs with >255 APIC IDs, the CPUs can't be the destination of IOAPIC interrupts, because the IOAPIC RTE's Dest Field has only 8 bits. Currently the hackery driver drivers/iommu/hyperv-iommu.c is used to ensure IOAPIC interrupts are only routed to CPUs that don't have >255 APIC IDs. However, there is an issue with kdump, because the kdump kernel can run on any CPU, and hence IOAPIC interrupts can't work if the kdump kernel run on a CPU with a >255 APIC ID. The kdump issue can be fixed by the Extended Dest ID, which is introduced recently by David Woodhouse (for IOAPIC, see the field virt_destid_8_14 in struct IO_APIC_route_entry). Of course, the Extended Dest ID needs the support of the underlying hypervisor. The latest Hyper-V has added the support recently: with this commit, on such a Hyper-V host, Linux VM does not use hyperv-iommu.c because hyperv_prepare_irq_remapping() returns -ENODEV; instead, Linux kernel's generic support of Extended Dest ID from David is used, meaning that Linux VM is able to support up to 32K CPUs, and IOAPIC interrupts can be routed to all the CPUs. On an old Hyper-V host that doesn't support the Extended Dest ID, nothing changes with this commit: Linux VM is still able to bring up the CPUs with > 255 APIC IDs with the help of hyperv-iommu.c, but IOAPIC interrupts still can not go to such CPUs, and the kdump kernel still can not work properly on such CPUs. [ tglx: Updated comment as suggested by David ] Signed-off-by: Dexuan Cui Signed-off-by: Thomas Gleixner Acked-by: David Woodhouse Link: https://lore.kernel.org/r/20201103011136.59108-1-de...@microsoft.com --- arch/x86/include/asm/hyperv-tlfs.h | 7 +++- arch/x86/kernel/cpu/mshyperv.c | 29 +- 2 files changed, 36 insertions(+) diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h index 0ed20e8..6bf42ae 100644 --- a/arch/x86/include/asm/hyperv-tlfs.h +++ b/arch/x86/include/asm/hyperv-tlfs.h @@ -23,6 +23,13 @@ #define HYPERV_CPUID_IMPLEMENT_LIMITS 0x4005 #define HYPERV_CPUID_NESTED_FEATURES 0x400A +#define HYPERV_CPUID_VIRT_STACK_INTERFACE 0x4081 +#define HYPERV_VS_INTERFACE_EAX_SIGNATURE 0x31235356 /* "VS#1" */ + +#define HYPERV_CPUID_VIRT_STACK_PROPERTIES 0x4082 +/* Support for the extended IOAPIC RTE format */ +#define HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE BIT(2) + #define HYPERV_HYPERVISOR_PRESENT_BIT 0x8000 #define HYPERV_CPUID_MIN 0x4005 #define HYPERV_CPUID_MAX 0x4000 diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c index 05ef1f4..f628e3d 100644 --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -366,9 +366,38 @@ static void __init ms_hyperv_init_platform(void) #endif } +static bool __init ms_hyperv_x2apic_available(void) +{ + return x2apic_supported(); +} + +/* + * If ms_hyperv_msi_ext_dest_id() returns true, hyperv_prepare_irq_remapping() + * returns -ENODEV and the Hyper-V IOMMU driver is not used; instead, the + * generic support of the 15-bit APIC ID is used: see __irq_msi_compose_msg(). + * + * Note: for a VM on Hyper-V, the I/O-APIC is the only device which + * (logically) generates MSIs directly to the system APIC irq domain. + * There is no HPET, and PCI MSI/MSI-X interrupts are remapped by the + * pci-hyperv host bridge. + */ +static bool __init ms_hyperv_msi_ext_dest_id(void) +{ + u32 eax; + + eax = cpuid_eax(HYPERV_CPUID_VIRT_STACK_INTERFACE); + if (eax != HYPERV_VS_INTERFACE_EAX_SIGNATURE) + return false; + + eax = cpuid_eax(HYPERV_CPUID_VIRT_STACK_PROPERTIES); + return eax & HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE; +} + const __initconst struct hypervisor_x86 x86_hyper_ms_hyperv = { .name = "Microsoft Hyper-V", .detect = ms_hyperv_platform, .type = X86_HYPER_MS_HYPERV, + .init.x2apic_available = ms_hyperv_x2apic_available, + .init.msi_ext_dest_id = ms_hyperv_msi_ext_dest_id, .init.init_platform = ms_hyperv_init_platform, };
RE: [PATCH] x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it
> From: David Woodhouse > Sent: Tuesday, November 3, 2020 12:03 AM > > +/* > > + * If ms_hyperv_msi_ext_dest_id() returns true, > > hyperv_prepare_irq_remapping() > > + * returns -ENODEV and the Hyper-V IOMMU driver is not used; instead, the > > + * generic support of the 15-bit APIC ID is used: see > > __irq_msi_compose_msg(). > > + * > > + * Note: For a VM on Hyper-V, no emulated legacy device supports PCI > MSI/MSI-X, > > + * and PCI MSI/MSI-X only come from the assigned physical PCIe device, and > the > > + * PCI MSI/MSI-X interrupts are handled by the pci-hyperv driver. Here > despite > > + * the word "msi" in the name "msi_ext_dest_id", actually the callback only > > + * affects how IOAPIC interrupts are routed. > > + */ > > I named it like that on purpose to make the point that the I/OAPIC is > just a device for turning line interrupts into MSIs. Some VMMs, just > like real hardware, really do implement their I/OAPIC emulation that > way. It makes a lot of sense to do so if you support interrupt > remapping. I totally agree. > FWIW I might have phrased your last paragraph in that comment as > > Note: for a VM on Hyper-V, the I/OAPIC is the only device which > (logically) generates MSIs directly to the system APIC irq domain. > There is no HPET, and PCI MSI/MSI-X interrupts are remapped by the > pci-hyperv host bridge. I agree. This version is much better. > But don't bother to change it; I think I've made my point quite well > enough with https://git.kernel.org/tip/tip/c/5d5a97133 :) > > -- > dwmw2 Hi David, This patch has been in the x86/apic branch (with a line missing in the commit log). If possible, I hope tglx can help make this change you suggested, and add the missing line in the commit log. :-) Thanks, -- Dexuan
RE: [PATCH] x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it
> From: Dexuan Cui > Sent: Monday, November 2, 2020 5:12 PM > > ... Hi tglx, Now this patch is in the x86/apic branch, which is great! Thanks for the quick action! But the third line of the below paragraph of the commit log is missing... Sorry I just realized I should have not prefixed that line with the ">255 APIC IDs" -- it looks a line is ignored if it starts with 2 chars of ">>". :-( > On an old Hyper-V host that doesn't support the Extended Dest ID, nothing > changes with this commit: Linux VM is still able to bring up the CPUs with > >255 APIC IDs with the help of hyperv-iommu.c, but IOAPIC interrupts still > can not go to such CPUs, and the kdump kernel still can not work properly > on such CPUs.
[tip: x86/apic] x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it
The following commit has been merged into the x86/apic branch of tip: Commit-ID: af2abc92c5ddf5fc5a2036bc106c4d9a80a4d5f7 Gitweb: https://git.kernel.org/tip/af2abc92c5ddf5fc5a2036bc106c4d9a80a4d5f7 Author:Dexuan Cui AuthorDate:Mon, 02 Nov 2020 17:11:36 -08:00 Committer: Thomas Gleixner CommitterDate: Tue, 03 Nov 2020 09:16:46 +01:00 x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it When a Linux VM runs on Hyper-V, if the VM has CPUs with >255 APIC IDs, the CPUs can't be the destination of IOAPIC interrupts, because the IOAPIC RTE's Dest Field has only 8 bits. Currently the hackery driver drivers/iommu/hyperv-iommu.c is used to ensure IOAPIC interrupts are only routed to CPUs that don't have >255 APIC IDs. However, there is an issue with kdump, because the kdump kernel can run on any CPU, and hence IOAPIC interrupts can't work if the kdump kernel run on a CPU with a >255 APIC ID. The kdump issue can be fixed by the Extended Dest ID, which is introduced recently by David Woodhouse (for IOAPIC, see the field virt_destid_8_14 in struct IO_APIC_route_entry). Of course, the Extended Dest ID needs the support of the underlying hypervisor. The latest Hyper-V has added the support recently: with this commit, on such a Hyper-V host, Linux VM does not use hyperv-iommu.c because hyperv_prepare_irq_remapping() returns -ENODEV; instead, Linux kernel's generic support of Extended Dest ID from David is used, meaning that Linux VM is able to support up to 32K CPUs, and IOAPIC interrupts can be routed to all the CPUs. On an old Hyper-V host that doesn't support the Extended Dest ID, nothing changes with this commit: Linux VM is still able to bring up the CPUs with can not go to such CPUs, and the kdump kernel still can not work properly on such CPUs. Signed-off-by: Dexuan Cui Signed-off-by: Thomas Gleixner Acked-by: David Woodhouse Link: https://lore.kernel.org/r/20201103011136.59108-1-de...@microsoft.com --- arch/x86/include/asm/hyperv-tlfs.h | 7 +++- arch/x86/kernel/cpu/mshyperv.c | 30 +- 2 files changed, 37 insertions(+) diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h index 0ed20e8..6bf42ae 100644 --- a/arch/x86/include/asm/hyperv-tlfs.h +++ b/arch/x86/include/asm/hyperv-tlfs.h @@ -23,6 +23,13 @@ #define HYPERV_CPUID_IMPLEMENT_LIMITS 0x4005 #define HYPERV_CPUID_NESTED_FEATURES 0x400A +#define HYPERV_CPUID_VIRT_STACK_INTERFACE 0x4081 +#define HYPERV_VS_INTERFACE_EAX_SIGNATURE 0x31235356 /* "VS#1" */ + +#define HYPERV_CPUID_VIRT_STACK_PROPERTIES 0x4082 +/* Support for the extended IOAPIC RTE format */ +#define HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE BIT(2) + #define HYPERV_HYPERVISOR_PRESENT_BIT 0x8000 #define HYPERV_CPUID_MIN 0x4005 #define HYPERV_CPUID_MAX 0x4000 diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c index 05ef1f4..cc4037d 100644 --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -366,9 +366,39 @@ static void __init ms_hyperv_init_platform(void) #endif } +static bool __init ms_hyperv_x2apic_available(void) +{ + return x2apic_supported(); +} + +/* + * If ms_hyperv_msi_ext_dest_id() returns true, hyperv_prepare_irq_remapping() + * returns -ENODEV and the Hyper-V IOMMU driver is not used; instead, the + * generic support of the 15-bit APIC ID is used: see __irq_msi_compose_msg(). + * + * Note: For a VM on Hyper-V, no emulated legacy device supports PCI MSI/MSI-X, + * and PCI MSI/MSI-X only come from the assigned physical PCIe device, and the + * PCI MSI/MSI-X interrupts are handled by the pci-hyperv driver. Here despite + * the word "msi" in the name "msi_ext_dest_id", actually the callback only + * affects how IOAPIC interrupts are routed. + */ +static bool __init ms_hyperv_msi_ext_dest_id(void) +{ + u32 eax; + + eax = cpuid_eax(HYPERV_CPUID_VIRT_STACK_INTERFACE); + if (eax != HYPERV_VS_INTERFACE_EAX_SIGNATURE) + return false; + + eax = cpuid_eax(HYPERV_CPUID_VIRT_STACK_PROPERTIES); + return eax & HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE; +} + const __initconst struct hypervisor_x86 x86_hyper_ms_hyperv = { .name = "Microsoft Hyper-V", .detect = ms_hyperv_platform, .type = X86_HYPER_MS_HYPERV, + .init.x2apic_available = ms_hyperv_x2apic_available, + .init.msi_ext_dest_id = ms_hyperv_msi_ext_dest_id, .init.init_platform = ms_hyperv_init_platform, };
RE: [PATCH] x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it
> From: Dexuan Cui > Sent: Monday, November 2, 2020 5:12 PM Sorry I forgot to mention that this patch is based on tip.git's x86/apic branch. Thanks, -- Dexuan
[PATCH] x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it
When a Linux VM runs on Hyper-V, if the VM has CPUs with >255 APIC IDs, the CPUs can't be the destination of IOAPIC interrupts, because the IOAPIC RTE's Dest Field has only 8 bits. Currently the hackery driver drivers/iommu/hyperv-iommu.c is used to ensure IOAPIC interrupts are only routed to CPUs that don't have >255 APIC IDs. However, there is an issue with kdump, because the kdump kernel can run on any CPU, and hence IOAPIC interrupts can't work if the kdump kernel run on a CPU with a >255 APIC ID. The kdump issue can be fixed by the Extended Dest ID, which is introduced recently by David Woodhouse (for IOAPIC, see the field virt_destid_8_14 in struct IO_APIC_route_entry). Of course, the Extended Dest ID needs the support of the underlying hypervisor. The latest Hyper-V has added the support recently: with this commit, on such a Hyper-V host, Linux VM does not use hyperv-iommu.c because hyperv_prepare_irq_remapping() returns -ENODEV; instead, Linux kernel's generic support of Extended Dest ID from David is used, meaning that Linux VM is able to support up to 32K CPUs, and IOAPIC interrupts can be routed to all the CPUs. On an old Hyper-V host that doesn't support the Extended Dest ID, nothing changes with this commit: Linux VM is still able to bring up the CPUs with >255 APIC IDs with the help of hyperv-iommu.c, but IOAPIC interrupts still can not go to such CPUs, and the kdump kernel still can not work properly on such CPUs. Signed-off-by: Dexuan Cui --- arch/x86/include/asm/hyperv-tlfs.h | 7 +++ arch/x86/kernel/cpu/mshyperv.c | 30 ++ 2 files changed, 37 insertions(+) diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h index 0ed20e8bba9e..6bf42aed387e 100644 --- a/arch/x86/include/asm/hyperv-tlfs.h +++ b/arch/x86/include/asm/hyperv-tlfs.h @@ -23,6 +23,13 @@ #define HYPERV_CPUID_IMPLEMENT_LIMITS 0x4005 #define HYPERV_CPUID_NESTED_FEATURES 0x400A +#define HYPERV_CPUID_VIRT_STACK_INTERFACE 0x4081 +#define HYPERV_VS_INTERFACE_EAX_SIGNATURE 0x31235356 /* "VS#1" */ + +#define HYPERV_CPUID_VIRT_STACK_PROPERTIES 0x4082 +/* Support for the extended IOAPIC RTE format */ +#define HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE BIT(2) + #define HYPERV_HYPERVISOR_PRESENT_BIT 0x8000 #define HYPERV_CPUID_MIN 0x4005 #define HYPERV_CPUID_MAX 0x4000 diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c index 05ef1f4550cb..cc4037d841df 100644 --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -366,9 +366,39 @@ static void __init ms_hyperv_init_platform(void) #endif } +static bool __init ms_hyperv_x2apic_available(void) +{ + return x2apic_supported(); +} + +/* + * If ms_hyperv_msi_ext_dest_id() returns true, hyperv_prepare_irq_remapping() + * returns -ENODEV and the Hyper-V IOMMU driver is not used; instead, the + * generic support of the 15-bit APIC ID is used: see __irq_msi_compose_msg(). + * + * Note: For a VM on Hyper-V, no emulated legacy device supports PCI MSI/MSI-X, + * and PCI MSI/MSI-X only come from the assigned physical PCIe device, and the + * PCI MSI/MSI-X interrupts are handled by the pci-hyperv driver. Here despite + * the word "msi" in the name "msi_ext_dest_id", actually the callback only + * affects how IOAPIC interrupts are routed. + */ +static bool __init ms_hyperv_msi_ext_dest_id(void) +{ + u32 eax; + + eax = cpuid_eax(HYPERV_CPUID_VIRT_STACK_INTERFACE); + if (eax != HYPERV_VS_INTERFACE_EAX_SIGNATURE) + return false; + + eax = cpuid_eax(HYPERV_CPUID_VIRT_STACK_PROPERTIES); + return eax & HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE; +} + const __initconst struct hypervisor_x86 x86_hyper_ms_hyperv = { .name = "Microsoft Hyper-V", .detect = ms_hyperv_platform, .type = X86_HYPER_MS_HYPERV, + .init.x2apic_available = ms_hyperv_x2apic_available, + .init.msi_ext_dest_id = ms_hyperv_msi_ext_dest_id, .init.init_platform = ms_hyperv_init_platform, }; -- 2.25.1
RE: irq_build_affinity_masks() allocates improper affinity if num_possible_cpus() > num_present_cpus()?
> From: Thomas Gleixner > Sent: Tuesday, October 6, 2020 11:58 AM > > ... > > I pass through an MSI-X-capable PCI device to the Linux VM (which has > > only 1 virtual CPU), and the below code does *not* report any error > > (i.e. pci_alloc_irq_vectors_affinity() returns 2, and request_irq() > > returns 0), but the code does not work: the second MSI-X interrupt is not > > happening while the first interrupt does work fine. > > > > int nr_irqs = 2; > > int i, nvec, irq; > > > > nvec = pci_alloc_irq_vectors_affinity(pdev, nr_irqs, nr_irqs, > > PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, NULL); > > Why should it return an error? The above code returns -ENOSPC if num_possible_cpus() is also 1, and returns 0 if num_possible_cpus() is 128. So it looks the above code is not using the API correctly, and hence gets undefined results. > > for (i = 0; i < nvec; i++) { > > irq = pci_irq_vector(pdev, i); > > err = request_irq(irq, test_intr, 0, "test_intr", &intr_cxt[i]); > > } > > And why do you expect that the second interrupt works? > > This is about managed interrupts and the spreading code has two vectors > to which it can spread the interrupts. One is assigned to one half of > the possible CPUs and the other one to the other half. Now you have only > one CPU online so only the interrupt with has the online CPU in the > assigned affinity mask is started up. > > That's how managed interrupts work. If you don't want managed interrupts > then don't use them. > > Thanks, > > tglx Thanks for the clarification! It looks with PCI_IRQ_AFFINITY the kernel guarantees that the allocated interrutps are 1:1 bound to CPUs, and the userspace is unable to change the affinities. This is very useful to support per-CPU I/O queues. Thanks, -- Dexuan
irq_build_affinity_masks() allocates improper affinity if num_possible_cpus() > num_present_cpus()?
Hi all, I'm running a single-CPU Linux VM on Hyper-V. The Linux kernel is v5.9-rc7 and I have CONFIG_NR_CPUS=256. The Hyper-V Host (Version 17763-10.0-1-0.1457) provides a guest firmware, which always reports 128 Local APIC entries in the ACPI MADT table. Here only the first Local APIC entry's "Processor Enabled" is 1 since this Linux VM is configured to have only 1 CPU. This means: in the Linux kernel, the "cpu_present_mask" and " cpu_online_mask " have only 1 CPU (i.e. CPU0), while the "cpu_possible_mask" has 128 CPUs, and the "nr_cpu_ids" is 128. I pass through an MSI-X-capable PCI device to the Linux VM (which has only 1 virtual CPU), and the below code does *not* report any error (i.e. pci_alloc_irq_vectors_affinity() returns 2, and request_irq() returns 0), but the code does not work: the second MSI-X interrupt is not happening while the first interrupt does work fine. int nr_irqs = 2; int i, nvec, irq; nvec = pci_alloc_irq_vectors_affinity(pdev, nr_irqs, nr_irqs, PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, NULL); for (i = 0; i < nvec; i++) { irq = pci_irq_vector(pdev, i); err = request_irq(irq, test_intr, 0, "test_intr", &intr_cxt[i]); } It turns out that pci_alloc_irq_vectors_affinity() -> ... -> irq_create_affinity_masks() allocates an improper affinity for the second interrupt. The below printk() shows that the second interrupt's affinity is 1-64, but only CPU0 is present in the system! As a result, later, request_irq() -> ... -> irq_startup() -> __irq_startup_managed() returns IRQ_STARTUP_ABORT because cpumask_any_and(aff, cpu_online_mask) is empty (i.e. >= nr_cpu_ids), and irq_startup() *silently* fails (i.e. "return 0;"), since __irq_startup() is only called for IRQ_STARTUP_MANAGED and IRQ_STARTUP_NORMAL. --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -484,6 +484,9 @@ struct irq_affinity_desc * for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++) masks[i].is_managed = 1; + for (i = 0; i < nvecs; i++) + printk("i=%d, affi = %*pbl\n", i, + cpumask_pr_args(&masks[i].mask)); return masks; } [ 43.770477] i=0, affi = 0,65-127 [ 43.770484] i=1, affi = 1-64 Though here the issue happens to a Linux VM on Hyper-V, I think the same issue can also happen to a physical machine, if the physical machine also uses a lot of static MADT entries, of which only the entries of the present CPUs are marked to be "Processor Enabled == 1". I think pci_alloc_irq_vectors_affinity() -> __pci_enable_msix_range() -> irq_calc_affinity_vectors() -> cpumask_weight(cpu_possible_mask) should use cpu_present_mask rather than cpu_possible_mask (), so here irq_calc_affinity_vectors() would return 1, and __pci_enable_msix_range() would immediately return -ENOSPC to avoid a *silent* failure. However, git-log shows that this 2018 commit intentionally changed the cpu_present_mask to cpu_possible_mask: 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs") so I'm not sure whether (and how?) we should address the *silent* failure. BTW, here I use a single-CPU VM to simplify the discussion. Actually, if the VM has n CPUs, with the above usage of pci_alloc_irq_vectors_affinity() (which might seem incorrect, but my point is that it's really not good to have a silent failure, which makes it a lot more difficult to figure out what goes wrong), it looks only the first n MSI-X interrupts can work, and the (n+1)'th MSI-X interrupt can not work due to the allocated improper affinity. According to my tests, if we need n+1 MSI-X interrupts in such a VM that has n CPUs, it looks we have 2 options (the second should be better): 1. Do not use the PCI_IRQ_AFFINITY flag, i.e. pci_alloc_irq_vectors_affinity(pdev, n+1, n+1, PCI_IRQ_MSIX, NULL); 2. Use the PCI_IRQ_AFFINITY flag, and pass a struct irq_affinity affd, which tells the API that we don't care about the first interrupt's affinity: struct irq_affinity affd = { .pre_vectors = 1, ... }; pci_alloc_irq_vectors_affinity(pdev, n+1, n+1, PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, &affd); PS, irq_create_affinity_masks() is complicated. Let me know if you're interested to know how it allocates the invalid affinity "1-64" for the second MSI-X interrupt. PS2, the latest Hyper-V provides only one ACPI MADT entry to a 1-CPU VM, so the issue described above can not reproduce there. Thanks, -- Dexuan
[PATCH v3] PCI: hv: Fix hibernation in case interrupts are not re-created
pci_restore_msi_state() directly writes the MSI/MSI-X related registers via MMIO. On a physical machine, this works perfectly; for a Linux VM running on a hypervisor, which typically enables IOMMU interrupt remapping, the hypervisor usually should trap and emulate the MMIO accesses in order to re-create the necessary interrupt remapping table entries in the IOMMU, otherwise the interrupts can not work in the VM after hibernation. Hyper-V is different from other hypervisors in that it does not trap and emulate the MMIO accesses, and instead it uses a para-virtualized method, which requires the VM to call hv_compose_msi_msg() to notify the hypervisor of the info that would be passed to the hypervisor in the case of the trap-and-emulate method. This is not an issue to a lot of PCI device drivers, which destroy and re-create the interrupts across hibernation, so hv_compose_msi_msg() is called automatically. However, some PCI device drivers (e.g. the in-tree GPU driver nouveau and the out-of-tree Nvidia proprietary GPU driver) do not destroy and re-create MSI/MSI-X interrupts across hibernation, so hv_pci_resume() has to call hv_compose_msi_msg(), otherwise the PCI device drivers can no longer receive interrupts after the VM resumes from hibernation. Hyper-V is also different in that chip->irq_unmask() may fail in a Linux VM running on Hyper-V (on a physical machine, chip->irq_unmask() can not fail because unmasking an MSI/MSI-X register just means an MMIO write): during hibernation, when a CPU is offlined, the kernel tries to move the interrupt to the remaining CPUs that haven't been offlined yet. In this case, hv_irq_unmask() -> hv_do_hypercall() always fails because the vmbus channel has been closed: here the early "return" in hv_irq_unmask() means the pci_msi_unmask_irq() is not called, i.e. the desc->masked remains "true", so later after hibernation, the MSI interrupt always remains masked, which is incorrect. Refer to cpu_disable_common() -> fixup_irqs() -> irq_migrate_all_off_this_cpu() -> migrate_one_irq(): static bool migrate_one_irq(struct irq_desc *desc) { ... if (maskchip && chip->irq_mask) chip->irq_mask(d); ... err = irq_do_set_affinity(d, affinity, false); ... if (maskchip && chip->irq_unmask) chip->irq_unmask(d); Fix the issue by calling pci_msi_unmask_irq() unconditionally in hv_irq_unmask(). Also suppress the error message for hibernation because the hypercall failure during hibernation does not matter (at this time all the devices have been frozen). Note: the correct affinity info is still updated into the irqdata data structure in migrate_one_irq() -> irq_do_set_affinity() -> hv_set_affinity(), so later when the VM resumes, hv_pci_restore_msi_state() is able to correctly restore the interrupt with the correct affinity. Fixes: ac82fc832708 ("PCI: hv: Add hibernation support") Reviewed-by: Jake Oshins Signed-off-by: Dexuan Cui --- Changes in v2: Fixed a typo in the comment in hv_irq_unmask. Thanks to Michael! Added Jake's Reviewed-by. Changes in v3: Improved the commit log. Improved the comments. Improved the change in hv_irq_unmask(): removed the early "return" and call pci_msi_unmask_irq() unconditionally. drivers/pci/controller/pci-hyperv.c | 50 +++-- 1 file changed, 47 insertions(+), 3 deletions(-) diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index fc4c3a15e570..a9df492fbffa 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -1276,11 +1276,25 @@ static void hv_irq_unmask(struct irq_data *data) exit_unlock: spin_unlock_irqrestore(&hbus->retarget_msi_interrupt_lock, flags); - if (res) { + /* +* During hibernation, when a CPU is offlined, the kernel tries +* to move the interrupt to the remaining CPUs that haven't +* been offlined yet. In this case, the below hv_do_hypercall() +* always fails since the vmbus channel has been closed: +* refer to cpu_disable_common() -> fixup_irqs() -> +* irq_migrate_all_off_this_cpu() -> migrate_one_irq(). +* +* Suppress the error message for hibernation because the failure +* during hibernation does not matter (at this time all the devices +* have been frozen). Note: the correct affinity info is still updated +* into the irqdata data structure in migrate_one_irq() -> +* irq_do_set_affinity() -> hv_set_affinity(), so later when the VM +* resumes, hv_pci_restore_msi_state() is able to correctly restore +* the interrupt with the correct affinity. +*/ + if (res && hbus->state != hv_pcibus_removing) dev_err(&hbus->hdev->device,
RE: [PATCH v2] PCI: hv: Fix hibernation in case interrupts are not re-created
> From: Lorenzo Pieralisi > Sent: Thursday, October 1, 2020 3:13 AM > > ... > > I mean this is a Hyper-V specific problem, so IMO we should fix the > > pci-hyperv driver rather than change the PCI device drivers, which > > work perfectly on a physical machine and on other hypervisors. > > Also it can be difficult or impossible to ask the authors of the > > aforementioned PCI device drivers to destroy and re-create > > MSI/MSI-X across hibernation, especially for the out-of-tree driver(s). > > Good, so why did you mention PCI drivers in the commit log if they > are not related to the problem you are fixing ? I mentioned the names of the PCI device drivers because IMO people want to know how the issue can reproduce (i.e. which PCI devices are affected and which are not), so they know how to test this patch. I'll remove the names of the unaffected PCI device drivers from the commit log, and only keep the name of the Nvidia GPU drivers (which are so far the only drivers I have identified that are affected, when Linux VM runs on Hyper-V and hibernates). > > > Regardless, this commit log does not provide the information that > > > it should. > > > > Hi Lozenzo, I'm happy to add more info. Can you please let me know > > what extra info I should provide? > > s/Lozenzo/Lorenzo Sorry! Will fix the typo. > The info you describe properly below, namely what the _actual_ problem > is. I will send v3 with the below info. > > Here if hv_irq_unmask does not call pci_msi_unmask_irq(), the > > desc->masked remains "true", so later after hibernation, the MSI > > interrupt line always reamins masked, which is incorrect. > > > > Here the slient failure of hv_irq_unmask() does not matter since all the > > non-boot CPUs are being offlined (meaning all the devices have been > > frozen). Note: the correct affinity info is still updated into the > > irqdata data structure in migrate_one_irq() -> irq_do_set_affinity() -> > > hv_set_affinity(), so when the VM resumes, hv_pci_resume() -> > > hv_pci_restore_msi_state() is able to correctly restore the irqs with > > the correct affinity. > > > > I hope the explanation can help clarify things. I understand this is > > not as natual as tht case that Linux runs on a physical machine, but > > due to the unique PCI pass-through implementation of Hyper-V, IMO this > > is the only viable fix for the problem here. BTW, this patch is only > > confined to the pci-hyperv driver and I believe it can no cause any > > regression. > > Understood, write this in the commit log and I won't nag you any further. Ok. I treat it as an opportunity to practise and improve my writing :-) > Side note: this issue is there because the hypcall failure triggers > an early return from hv_irq_unmask(). Yes. > Is that early return really correct ? Good question. IMO it's incorrect, because hv_irq_unmask() is called when the interrupt is activated for the first time, and when the interrupt's affinity is being changed. In these cases, we may as well call pci_msi_unmask_irq() unconditionally, even if the hypercall fails. BTW, AFAIK, in practice the hypercall only fails in 2 cases: 1. The device is removed when Linux VM has not finished the device's initialization. 2. In hibernation, the device has been disabled while the generic hibernation code tries to migrate the interrupt, as I explained. In the 2 cases, the hypercall returns the same error code HV_STATUS_INVALID_PARAMETER(0x5). > Another possibility is just logging the error and let > hv_irq_unmask() continue and call pci_msi_unmask_irq() in the exit > path. This is a good idea. I'll make this change in v3. > Is there a hypcall return value that you can use to detect fatal vs > non-fatal (ie hibernation) hypcall failures ? Unluckily IMO there is not. The spec (v6.0b)'s section 10.5.4 (page 106) https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs does define some return values, but IMO they're not applicable here. > I was confused by reading the patch since it seemed that you call > pci_msi_unmask_irq() _only_ while hibernating, which was certainly > a bug. > > Thank you for explaining. > > Lorenzo Thanks for reviewing! I'll post v3. Looking forward to your new comments! Thanks, -- Dexuan
RE: [PATCH v2] PCI: hv: Fix hibernation in case interrupts are not re-created
> From: Lorenzo Pieralisi > Sent: Monday, September 28, 2020 3:43 AM > > [+MarcZ - this patch needs IRQ maintainers vetting] Sure. Hi MarkZ, please also review the patch. Thanks! > On Tue, Sep 08, 2020 at 04:17:59PM -0700, Dexuan Cui wrote: > > Hyper-V doesn't trap and emulate the accesses to the MSI/MSI-X > > registers, and we must use hv_compose_msi_msg() to ask Hyper-V to > > create the IOMMU Interrupt Remapping Table Entries. This is not an issue > > for a lot of PCI device drivers (e.g. NVMe driver, Mellanox NIC drivers), > > which destroy and re-create the interrupts across hibernation, so > > hv_compose_msi_msg() is called automatically. However, some other > > PCI device drivers (e.g. the Nvidia driver) may not destroy and re-create > > the interrupts across hibernation, so hv_pci_resume() has to call > > hv_compose_msi_msg(), otherwise the PCI device drivers can no longer > > receive MSI/MSI-X interrupts after hibernation. > > This looks like drivers bugs and I don't think the HV controller > driver is where you should fix them. IMHO this is not a PCI device driver bug, because I think a PCI device driver is allowed to keep and re-use the MSI/MSI-X interrupts across hibernation, otherwise we would not have pci_restore_msi_state() in pci_restore_state(). The in-tree open-source Nvidia GPU driver drivers/gpu/drm/nouveau is such a PCI device driver that re-uses the MSI-X interrupts across hibernation. The out-of-tree proprietary Nvidia GPU driver also does the same thing. It looks some other in-tree PCI device drivers also do the same thing, though I don't remember their names offhand. IMO it's much better to change the pci-hyperv driver once and for all, than to change every such existing (and future?) PCI device driver. pci_restore_msi_state() directly writes the MSI/MSI-X related registers in __pci_write_msi_msg() and msix_mask_irq(). On a physical machine, this works perfectly, but for a Linux VM running on a hypervisor, which typically enables IOMMU interrupt remapping, the hypervisor usually should trap and emulate the write accesses to the MSI/MSI-X registers, so the hypervisor is able to create the necessary interrupt remapping table entries in the IOMMU, and the MSI/MSI-X interrupts can work in the VM. Hyper-V is different from other hypervisors in that it does not trap and emulate the write accesses, and instead it uses a para-virtualized method, which requires the VM to call hv_compose_msi_msg() to notify the hypervisor of the info that would be passed to the hypervisor in the case of the trap-and-emulate method. I mean this is a Hyper-V specific problem, so IMO we should fix the pci-hyperv driver rather than change the PCI device drivers, which work perfectly on a physical machine and on other hypervisors. Also it can be difficult or impossible to ask the authors of the aforementioned PCI device drivers to destry and re-create MSI/MSI-X acorss hibernation, especially for the out-of-tree driver(s). > Regardless, this commit log does not provide the information that > it should. Hi Lozenzo, I'm happy to add more info. Can you please let me know what extra info I should provide? > > Fixes: ac82fc832708 ("PCI: hv: Add hibernation support") > > Signed-off-by: Dexuan Cui > > Reviewed-by: Jake Oshins > > > > --- > > > > Changes in v2: > > Fixed a typo in the comment in hv_irq_unmask. Thanks to Michael! > > Added Jake's Reviewed-by. > > > > drivers/pci/controller/pci-hyperv.c | 44 + > > 1 file changed, 44 insertions(+) > > > > diff --git a/drivers/pci/controller/pci-hyperv.c > b/drivers/pci/controller/pci-hyperv.c > > index fc4c3a15e570..dd21afb5d62b 100644 > > --- a/drivers/pci/controller/pci-hyperv.c > > +++ b/drivers/pci/controller/pci-hyperv.c > > @@ -1211,6 +1211,21 @@ static void hv_irq_unmask(struct irq_data *data) > > pbus = pdev->bus; > > hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata); > > > > + if (hbus->state == hv_pcibus_removing) { > > + /* > > +* During hibernation, when a CPU is offlined, the kernel tries > > +* to move the interrupt to the remaining CPUs that haven't > > +* been offlined yet. In this case, the below hv_do_hypercall() > > +* always fails since the vmbus channel has been closed, so we > > +* should not call the hypercall, but we still need > > +* pci_msi_unmask_irq() to reset the mask bit in desc->masked: > > +* see cpu_disable_common() -> fixup_irqs() -> > > +* irq_migrate_all_off_this_cpu() -> migrate_one_irq(). > > +
RE: [PATCH v2] PCI: hv: Fix hibernation in case interrupts are not re-created
> From: Dexuan Cui > Sent: Tuesday, September 8, 2020 4:18 PM > > Hyper-V doesn't trap and emulate the accesses to the MSI/MSI-X registers, > and we must use hv_compose_msi_msg() to ask Hyper-V to create the IOMMU > Interrupt Remapping Table Entries. This is not an issue for a lot of > PCI device drivers (e.g. NVMe driver, Mellanox NIC drivers), which > destroy and re-create the interrupts across hibernation, so > hv_compose_msi_msg() is called automatically. However, some other PCI > device drivers (e.g. the Nvidia driver) may not destroy and re-create > the interrupts across hibernation, so hv_pci_resume() has to call > hv_compose_msi_msg(), otherwise the PCI device drivers can no longer > receive MSI/MSI-X interrupts after hibernation. > > Fixes: ac82fc832708 ("PCI: hv: Add hibernation support") > Signed-off-by: Dexuan Cui > Reviewed-by: Jake Oshins > > --- > > Changes in v2: > Fixed a typo in the comment in hv_irq_unmask. Thanks to Michael! > Added Jake's Reviewed-by. > > drivers/pci/controller/pci-hyperv.c | 44 + > 1 file changed, 44 insertions(+) Hi Lorenzo, Bjorn, Can you please take a look at this patch? I hope it still could have a chance to be in 5.9. :-) Thanks, -- Dexuan
RE: [PATCH 1/1] Drivers: hv: vmbus: Add timeout to vmbus_wait_for_unload
> From: linux-hyperv-ow...@vger.kernel.org > On Behalf Of Michael Kelley > Sent: Sunday, September 13, 2020 12:47 PM > > vmbus_wait_for_unload() looks for a CHANNELMSG_UNLOAD_RESPONSE > message > coming from Hyper-V. But if the message isn't found for some reason, > the panic path gets hung forever. Add a timeout of 10 seconds to prevent > this. > > Fixes: 415719160de3 ("Drivers: hv: vmbus: avoid scheduling in interrupt > context in vmbus_initiate_unload()") > Signed-off-by: Michael Kelley Reviewed-by: Dexuan Cui
[PATCH net 1/2] hv_netvsc: Switch the data path at the right time during hibernation
When netvsc_resume() is called, the mlx5 VF NIC has not been resumed yet, so in the future the host might sliently fail the call netvsc_vf_changed() -> netvsc_switch_datapath() there, even if the call works now. Call netvsc_vf_changed() in the NETDEV_CHANGE event handler: at that time the mlx5 VF NIC has been resumed. Fixes: 19162fd4063a ("hv_netvsc: Fix hibernation for mlx5 VF driver") Signed-off-by: Dexuan Cui --- drivers/net/hyperv/netvsc_drv.c | 11 +-- 1 file changed, 1 insertion(+), 10 deletions(-) diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index 81c5c70b616a..4a25886e2346 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -2619,7 +2619,6 @@ static int netvsc_resume(struct hv_device *dev) struct net_device *net = hv_get_drvdata(dev); struct net_device_context *net_device_ctx; struct netvsc_device_info *device_info; - struct net_device *vf_netdev; int ret; rtnl_lock(); @@ -2632,15 +2631,6 @@ static int netvsc_resume(struct hv_device *dev) netvsc_devinfo_put(device_info); net_device_ctx->saved_netvsc_dev_info = NULL; - /* A NIC driver (e.g. mlx5) may keep the VF network interface across -* hibernation, but here the data path is implicitly switched to the -* netvsc NIC since the vmbus channel is closed and re-opened, so -* netvsc_vf_changed() must be used to switch the data path to the VF. -*/ - vf_netdev = rtnl_dereference(net_device_ctx->vf_netdev); - if (vf_netdev && netvsc_vf_changed(vf_netdev) != NOTIFY_OK) - ret = -EINVAL; - rtnl_unlock(); return ret; @@ -2701,6 +2691,7 @@ static int netvsc_netdev_event(struct notifier_block *this, return netvsc_unregister_vf(event_dev); case NETDEV_UP: case NETDEV_DOWN: + case NETDEV_CHANGE: return netvsc_vf_changed(event_dev); default: return NOTIFY_DONE; -- 2.19.1
[PATCH net 2/2] hv_netvsc: Cache the current data path to avoid duplicate call and message
The previous change "hv_netvsc: Switch the data path at the right time during hibernation" adds the call of netvsc_vf_changed() upon NETDEV_CHANGE, so it's necessary to avoid the duplicate call and message when the VF is brought UP or DOWN. Signed-off-by: Dexuan Cui --- drivers/net/hyperv/hyperv_net.h | 3 +++ drivers/net/hyperv/netvsc_drv.c | 21 - 2 files changed, 23 insertions(+), 1 deletion(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index 2181d4538ab7..ff33f27cdcd3 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -974,6 +974,9 @@ struct net_device_context { /* Serial number of the VF to team with */ u32 vf_serial; + /* Is the current data path through the VF NIC? */ + bool data_path_is_vf; + /* Used to temporarily save the config info across hibernation */ struct netvsc_device_info *saved_netvsc_dev_info; }; diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index 4a25886e2346..b7db3766f5b9 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -2366,7 +2366,16 @@ static int netvsc_register_vf(struct net_device *vf_netdev) return NOTIFY_OK; } -/* VF up/down change detected, schedule to change data path */ +/* Change the data path when VF UP/DOWN/CHANGE are detected. + * + * Typically a UP or DOWN event is followed by a CHANGE event, so + * net_device_ctx->data_path_is_vf is used to cache the current data path + * to avoid the duplicate call of netvsc_switch_datapath() and the duplicate + * message. + * + * During hibernation, if a VF NIC driver (e.g. mlx5) preserves the network + * interface, there is only the CHANGE event and no UP or DOWN event. + */ static int netvsc_vf_changed(struct net_device *vf_netdev) { struct net_device_context *net_device_ctx; @@ -2383,6 +2392,10 @@ static int netvsc_vf_changed(struct net_device *vf_netdev) if (!netvsc_dev) return NOTIFY_DONE; + if (net_device_ctx->data_path_is_vf == vf_is_up) + return NOTIFY_OK; + net_device_ctx->data_path_is_vf = vf_is_up; + netvsc_switch_datapath(ndev, vf_is_up); netdev_info(ndev, "Data path switched %s VF: %s\n", vf_is_up ? "to" : "from", vf_netdev->name); @@ -2624,6 +2637,12 @@ static int netvsc_resume(struct hv_device *dev) rtnl_lock(); net_device_ctx = netdev_priv(net); + + /* Reset the data path to the netvsc NIC before re-opening the vmbus +* channel. Later netvsc_netdev_event() will switch the data path to +* the VF upon the UP or CHANGE event. +*/ + net_device_ctx->data_path_is_vf = false; device_info = net_device_ctx->saved_netvsc_dev_info; ret = netvsc_attach(net, device_info); -- 2.19.1
[PATCH v2] PCI: hv: Fix hibernation in case interrupts are not re-created
Hyper-V doesn't trap and emulate the accesses to the MSI/MSI-X registers, and we must use hv_compose_msi_msg() to ask Hyper-V to create the IOMMU Interrupt Remapping Table Entries. This is not an issue for a lot of PCI device drivers (e.g. NVMe driver, Mellanox NIC drivers), which destroy and re-create the interrupts across hibernation, so hv_compose_msi_msg() is called automatically. However, some other PCI device drivers (e.g. the Nvidia driver) may not destroy and re-create the interrupts across hibernation, so hv_pci_resume() has to call hv_compose_msi_msg(), otherwise the PCI device drivers can no longer receive MSI/MSI-X interrupts after hibernation. Fixes: ac82fc832708 ("PCI: hv: Add hibernation support") Signed-off-by: Dexuan Cui Reviewed-by: Jake Oshins --- Changes in v2: Fixed a typo in the comment in hv_irq_unmask. Thanks to Michael! Added Jake's Reviewed-by. drivers/pci/controller/pci-hyperv.c | 44 + 1 file changed, 44 insertions(+) diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index fc4c3a15e570..dd21afb5d62b 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -1211,6 +1211,21 @@ static void hv_irq_unmask(struct irq_data *data) pbus = pdev->bus; hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata); + if (hbus->state == hv_pcibus_removing) { + /* +* During hibernation, when a CPU is offlined, the kernel tries +* to move the interrupt to the remaining CPUs that haven't +* been offlined yet. In this case, the below hv_do_hypercall() +* always fails since the vmbus channel has been closed, so we +* should not call the hypercall, but we still need +* pci_msi_unmask_irq() to reset the mask bit in desc->masked: +* see cpu_disable_common() -> fixup_irqs() -> +* irq_migrate_all_off_this_cpu() -> migrate_one_irq(). +*/ + pci_msi_unmask_irq(data); + return; + } + spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags); params = &hbus->retarget_msi_interrupt_params; @@ -3372,6 +3387,33 @@ static int hv_pci_suspend(struct hv_device *hdev) return 0; } +static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg) +{ + struct msi_desc *entry; + struct irq_data *irq_data; + + for_each_pci_msi_entry(entry, pdev) { + irq_data = irq_get_irq_data(entry->irq); + if (WARN_ON_ONCE(!irq_data)) + return -EINVAL; + + hv_compose_msi_msg(irq_data, &entry->msg); + } + + return 0; +} + +/* + * Upon resume, pci_restore_msi_state() -> ... -> __pci_write_msi_msg() + * re-writes the MSI/MSI-X registers, but since Hyper-V doesn't trap and + * emulate the accesses, we have to call hv_compose_msi_msg() to ask + * Hyper-V to re-create the IOMMU Interrupt Remapping Table Entries. + */ +static void hv_pci_restore_msi_state(struct hv_pcibus_device *hbus) +{ + pci_walk_bus(hbus->pci_bus, hv_pci_restore_msi_msg, NULL); +} + static int hv_pci_resume(struct hv_device *hdev) { struct hv_pcibus_device *hbus = hv_get_drvdata(hdev); @@ -3405,6 +3447,8 @@ static int hv_pci_resume(struct hv_device *hdev) prepopulate_bars(hbus); + hv_pci_restore_msi_state(hbus); + hbus->state = hv_pcibus_installed; return 0; out: -- 2.19.1
RE: [PATCH net v2] hv_netvsc: Fix hibernation for mlx5 VF driver
> From: Michael Kelley > Sent: Tuesday, September 8, 2020 1:49 PM > > @@ -2635,6 +2632,15 @@ static int netvsc_resume(struct hv_device *dev) > > netvsc_devinfo_put(device_info); > > net_device_ctx->saved_netvsc_dev_info = NULL; > > > > + /* A NIC driver (e.g. mlx5) may keep the VF network interface across > > +* hibernation, but here the data path is implicitly switched to the > > +* netvsc NIC since the vmbus channel is closed and re-opened, so > > +* netvsc_vf_changed() must be used to switch the data path to the VF. > > +*/ > > + vf_netdev = rtnl_dereference(net_device_ctx->vf_netdev); > > + if (vf_netdev && netvsc_vf_changed(vf_netdev) != NOTIFY_OK) > > + ret = -EINVAL; > > + > > I'm a little late looking at this code. But a question: Is it possible for > netvsc_resume() to be called before the VF driver's resume function > is called? Yes, actually this is what happens 100% of the time. :-) Upon suspend: 1. the mlx5 driver suspends the VF NIC. 2. the pci-hyperv suspends the VF vmbus device, including closing the channel. 3. hv_netvsc suspends the netvsc vmbus device, including closing the channel. Note: between 1) and 3), the data path is still the mlx5 VF, but obviously the VF NIC is not working. IMO this is not an issue in practice, since typically we don't really expect this to work once the suspending starts. Upon resume: 1. hv_netvsc resumes the netvsc vmbus device, including opening the channel. At this time, the data path should be the netvsc NIC since we close and re-open the netvsc vmbus channel, and I believe the default data path is netvsc. With this patch, the data path is switched to the VF NIC in netvsc_resume() because "netif_running(vf_netdev)" is true for the mlx5 VF NIC (CX-4), though the VF NIC device is not resumed back yet. According to my test, I believe this switching succeeds. Note: when the host receives the VM's NVSP_MSG4_TYPE_SWITCH_DATA_PATH request, it looks the host doesn't check if the VF vmbus device and the VF PCI device are really "activated"[1], and it looks the host simply programs the FPGA GFT for the newly-requested data path, and the host doesn't send a response message to the VM, indicating if the switching is a success or a failure. So, at this time, any incoming Ethernet packets (except the broadcast/multicast and TCP SYN packets, which always go through the netvsc NIC on Azure) can not be received by the VF NIC, which has not been resumed yet. IMO this is not an issue in practice, since typically we don't really expect this to work before the resuming is fully finished. BTW, at this time the userspace is not thawed yet, so no application can transmit packets. 2. pci-hyperv resumes the VF vmbus device, including opening the channel. 3. the mlx5 driver resumes the VF NIC, and now everything is backed to normal. [1] In the future, if the host starts to check if the VF vmbus/PCI devices are activated upon the receipt of the VM's NVSP_MSG4_TYPE_SWITCH_DATA_PATH request, and refuses to switch the data path if they're not activated, then we'll be in trouble, but even if that happens, this patch itself doesn't make the situation worse. So ideally we need a mechanism to switch the data path to the mlx5 VF NIC *after* the mlx5 NIC is resumed. Unluckily it looks there is not a standard notification mechanism for this since the mlx5 driver preserves the VF network interface. I'll discuss with the Mellanox developer who implemented mlx5 hibernation support, and probably mlx5 should also destroy/re-create the VF network interface across hibernation, just as the mlx4 driver does. > If so, is it possible for netvsc_vf_changed() to find that the VF > is not up, and hence to switch the data path away from the VF instead of > to the VF? > > Michael When we are in netvsc_resume(), in my test "netif_running(vf_netdev)" is always true for the mlx5 VF NIC (CX-4), so netvsc_vf_changed() should switch the data path to the VF. static inline bool netif_running(const struct net_device *dev) { return test_bit(__LINK_STATE_START, &dev->state); } The flag __LINK_STATE_START is only cleared when the NIC is brought "DOWN", and that case is already automatically handled by netvsc_netdev_event(). For mlx4 (CX-3), net_device_ctx->vf_netdev is NULL in netvsc_resume(), so the CX-3 VF NIC is not affected by this patch. Thanks, -- Dexuan
RE: [PATCH] PCI: hv: Fix hibernation in case interrupts are not re-created
> From: Michael Kelley > Sent: Tuesday, September 8, 2020 2:16 PM > > @@ -1211,6 +1211,21 @@ static void hv_irq_unmask(struct irq_data *data) > > pbus = pdev->bus; > > hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata); > > > > + if (hbus->state == hv_pcibus_removing) { > > + /* > > +* During hibernatin, when a CPU is offlined, the kernel tries > > s/hiberatin/hibernation/ Thanks! I'll post a v2 shortly with this typo fixed, and with Jake's Reviewed-by. Thanks, -- Dexuan
[PATCH net v2] hv_netvsc: Fix hibernation for mlx5 VF driver
mlx5_suspend()/resume() keep the network interface, so during hibernation netvsc_unregister_vf() and netvsc_register_vf() are not called, and hence netvsc_resume() should call netvsc_vf_changed() to switch the data path back to the VF after hibernation. Note: after we close and re-open the vmbus channel of the netvsc NIC in netvsc_suspend() and netvsc_resume(), the data path is implicitly switched to the netvsc NIC. Similarly, netvsc_suspend() should not call netvsc_unregister_vf(), otherwise the VF can no longer be used after hibernation. For mlx4, since the VF network interafce is explicitly destroyed and re-created during hibernation (see mlx4_suspend()/resume()), hv_netvsc already explicitly switches the data path from and to the VF automatically via netvsc_register_vf() and netvsc_unregister_vf(), so mlx4 doesn't need this fix. Note: mlx4 can still work with the fix because in netvsc_suspend()/resume() ndev_ctx->vf_netdev is NULL for mlx4. Fixes: 0efeea5fb153 ("hv_netvsc: Add the support of hibernation") Signed-off-by: Dexuan Cui --- Changes in v2 (Thanks Jakub Kicinski ): Added coments in the changelog and the code about the implicit data path switching to the netvsc when we close/re-open the vmbus channels. Used reverse xmas order ordering in netvsc_remove(). drivers/net/hyperv/netvsc_drv.c | 16 +++- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index 64b0a74c1523..81c5c70b616a 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -2587,8 +2587,8 @@ static int netvsc_remove(struct hv_device *dev) static int netvsc_suspend(struct hv_device *dev) { struct net_device_context *ndev_ctx; - struct net_device *vf_netdev, *net; struct netvsc_device *nvdev; + struct net_device *net; int ret; net = hv_get_drvdata(dev); @@ -2604,10 +2604,6 @@ static int netvsc_suspend(struct hv_device *dev) goto out; } - vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev); - if (vf_netdev) - netvsc_unregister_vf(vf_netdev); - /* Save the current config info */ ndev_ctx->saved_netvsc_dev_info = netvsc_devinfo_get(nvdev); @@ -2623,6 +2619,7 @@ static int netvsc_resume(struct hv_device *dev) struct net_device *net = hv_get_drvdata(dev); struct net_device_context *net_device_ctx; struct netvsc_device_info *device_info; + struct net_device *vf_netdev; int ret; rtnl_lock(); @@ -2635,6 +2632,15 @@ static int netvsc_resume(struct hv_device *dev) netvsc_devinfo_put(device_info); net_device_ctx->saved_netvsc_dev_info = NULL; + /* A NIC driver (e.g. mlx5) may keep the VF network interface across +* hibernation, but here the data path is implicitly switched to the +* netvsc NIC since the vmbus channel is closed and re-opened, so +* netvsc_vf_changed() must be used to switch the data path to the VF. +*/ + vf_netdev = rtnl_dereference(net_device_ctx->vf_netdev); + if (vf_netdev && netvsc_vf_changed(vf_netdev) != NOTIFY_OK) + ret = -EINVAL; + rtnl_unlock(); return ret; -- 2.19.1
RE: [RESEND][PATCH v3] x86/apic/flat64: Add back the early_param("apic", parse_apic)
> From: Thomas Gleixner > Sent: Friday, July 17, 2020 6:03 AM > [...] Hi, I'm very sorry for this extremely late reply -- I was sidetracked by something else and I just had a chance to revisit the issue. Thank you tglx for the comments and suggestions, which I think are reasonable. I realized this patch is problematic. The good news is that the LAPIC emulation bug has been fixed in the latest version of Hyper-V and now kdump can work reliably. I think the hypervisor fix would be backported to old Hyper-V versions, so hopefully this won't be an issue over time. Thanks, -- Dexuan
RE: [PATCH net] hv_netvsc: Fix hibernation for mlx5 VF driver
> From: Jakub Kicinski > Sent: Saturday, September 5, 2020 4:27 PM > [...] > On Fri, 4 Sep 2020 19:52:18 -0700 Dexuan Cui wrote: > > mlx5_suspend()/resume() keep the network interface, so during hibernation > > netvsc_unregister_vf() and netvsc_register_vf() are not called, and hence > > netvsc_resume() should call netvsc_vf_changed() to switch the data path > > back to the VF after hibernation. > > Does suspending the system automatically switch back to the synthetic > datapath? Yes. For mlx4, since the VF network interafce is explicitly destroyed and re-created during hibernation (i.e. suspend + resume), hv_netvsc explicitly switches the data path from and to the VF. For mlx5, the VF network interface persists across hibernation, so there is no explicit switch-over, but after we close and re-open the vmbus channel of the netvsc NIC in netvsc_suspend() and netvsc_resume(), the data path is implicitly switched to the netvsc NIC, and with this patch netvsc_resume() -> netvsc_vf_changed() switches the data path back to the mlx5 NIC. > Please clarify this in the commit message and/or add a code > comment. I will add a comment in the commit message and the code. > > @@ -2587,7 +2587,7 @@ static int netvsc_remove(struct hv_device *dev) > > static int netvsc_suspend(struct hv_device *dev) > > { > > struct net_device_context *ndev_ctx; > > - struct net_device *vf_netdev, *net; > > + struct net_device *net; > > struct netvsc_device *nvdev; > > int ret; > > Please keep reverse xmas tree variable ordering. Will do. > > @@ -2635,6 +2632,10 @@ static int netvsc_resume(struct hv_device *dev) > > netvsc_devinfo_put(device_info); > > net_device_ctx->saved_netvsc_dev_info = NULL; > > > > + vf_netdev = rtnl_dereference(net_device_ctx->vf_netdev); > > + if (vf_netdev && netvsc_vf_changed(vf_netdev) != NOTIFY_OK) > > + ret = -EINVAL; > > Should you perhaps remove the VF in case of the failure? IMO this failure actually should not happen since we're resuming the netvsc NIC, so we're sure we have a valid pointer to the netvsc net device, and netvsc_vf_changed() should be able to find the netvsc pointer and return NOTIFY_OK. In case of a failure, something really bad must be happening, and I'm not sure if it's safe to simply remove the VF, so I just return -EINVAL for simplicity, since I believe the failure should not happen in practice. I would rather keep the code as-is, but I'm OK to add a WARN_ON(1) if you think that's necessary. Thanks, -- Dexuan
[PATCH] Drivers: hv: vmbus: hibernation: do not hang forever in vmbus_bus_resume()
After we Stop and later Start a VM that uses Accelerated Networking (NIC SR-IOV), currently the VF vmbus device's Instance GUID can change, so after vmbus_bus_resume() -> vmbus_request_offers(), vmbus_onoffer() can not find the original vmbus channel of the VF, and hence we can't complete() vmbus_connection.ready_for_resume_event in check_ready_for_resume_event(), and the VM hangs in vmbus_bus_resume() forever. Fix the issue by adding a timeout, so the resuming can still succeed, and the saved state is not lost, and according to my test, the user can disable Accelerated Networking and then will be able to SSH into the VM for further recovery. Also prevent the VM in question from suspending again. The host will be fixed so in future the Instance GUID will stay the same across hibernation. Fixes: d8bd2d442bb2 ("Drivers: hv: vmbus: Resume after fixing up old primary channels") Signed-off-by: Dexuan Cui --- drivers/hv/vmbus_drv.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index 910b6e90866c..946d0aba101f 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -2382,7 +2382,10 @@ static int vmbus_bus_suspend(struct device *dev) if (atomic_read(&vmbus_connection.nr_chan_close_on_suspend) > 0) wait_for_completion(&vmbus_connection.ready_for_suspend_event); - WARN_ON(atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0); + if (atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0) { + pr_err("Can not suspend due to a previous failed resuming\n"); + return -EBUSY; + } mutex_lock(&vmbus_connection.channel_mutex); @@ -2456,7 +2459,9 @@ static int vmbus_bus_resume(struct device *dev) vmbus_request_offers(); - wait_for_completion(&vmbus_connection.ready_for_resume_event); + if (wait_for_completion_timeout( + &vmbus_connection.ready_for_resume_event, 10 * HZ) == 0) + pr_err("Some vmbus device is missing after suspending?\n"); /* Reset the event for the next suspend. */ reinit_completion(&vmbus_connection.ready_for_suspend_event); -- 2.19.1
[PATCH] PCI: hv: Fix hibernation in case interrupts are not re-created
Hyper-V doesn't trap and emulate the accesses to the MSI/MSI-X registers, and we must use hv_compose_msi_msg() to ask Hyper-V to create the IOMMU Interrupt Remapping Table Entries. This is not an issue for a lot of PCI device drivers (e.g. NVMe driver, Mellanox NIC drivers), which destroy and re-create the interrupts across hibernation, so hv_compose_msi_msg() is called automatically. However, some other PCI device drivers (e.g. the Nvidia driver) may not destroy and re-create the interrupts across hibernation, so hv_pci_resume() has to call hv_compose_msi_msg(), otherwise the PCI device drivers can no longer receive MSI/MSI-X interrupts after hibernation. Fixes: ac82fc832708 ("PCI: hv: Add hibernation support") Cc: Jake Oshins Signed-off-by: Dexuan Cui --- drivers/pci/controller/pci-hyperv.c | 44 + 1 file changed, 44 insertions(+) diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index fc4c3a15e570..abefff9a20e1 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -1211,6 +1211,21 @@ static void hv_irq_unmask(struct irq_data *data) pbus = pdev->bus; hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata); + if (hbus->state == hv_pcibus_removing) { + /* +* During hibernatin, when a CPU is offlined, the kernel tries +* to move the interrupt to the remaining CPUs that haven't +* been offlined yet. In this case, the below hv_do_hypercall() +* always fails since the vmbus channel has been closed, so we +* should not call the hypercall, but we still need +* pci_msi_unmask_irq() to reset the mask bit in desc->masked: +* see cpu_disable_common() -> fixup_irqs() -> +* irq_migrate_all_off_this_cpu() -> migrate_one_irq(). +*/ + pci_msi_unmask_irq(data); + return; + } + spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags); params = &hbus->retarget_msi_interrupt_params; @@ -3372,6 +3387,33 @@ static int hv_pci_suspend(struct hv_device *hdev) return 0; } +static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg) +{ + struct msi_desc *entry; + struct irq_data *irq_data; + + for_each_pci_msi_entry(entry, pdev) { + irq_data = irq_get_irq_data(entry->irq); + if (WARN_ON_ONCE(!irq_data)) + return -EINVAL; + + hv_compose_msi_msg(irq_data, &entry->msg); + } + + return 0; +} + +/* + * Upon resume, pci_restore_msi_state() -> ... -> __pci_write_msi_msg() + * re-writes the MSI/MSI-X registers, but since Hyper-V doesn't trap and + * emulate the accesses, we have to call hv_compose_msi_msg() to ask + * Hyper-V to re-create the IOMMU Interrupt Remapping Table Entries. + */ +static void hv_pci_restore_msi_state(struct hv_pcibus_device *hbus) +{ + pci_walk_bus(hbus->pci_bus, hv_pci_restore_msi_msg, NULL); +} + static int hv_pci_resume(struct hv_device *hdev) { struct hv_pcibus_device *hbus = hv_get_drvdata(hdev); @@ -3405,6 +3447,8 @@ static int hv_pci_resume(struct hv_device *hdev) prepopulate_bars(hbus); + hv_pci_restore_msi_state(hbus); + hbus->state = hv_pcibus_installed; return 0; out: -- 2.19.1
[PATCH net] hv_netvsc: Fix hibernation for mlx5 VF driver
mlx5_suspend()/resume() keep the network interface, so during hibernation netvsc_unregister_vf() and netvsc_register_vf() are not called, and hence netvsc_resume() should call netvsc_vf_changed() to switch the data path back to the VF after hibernation. Similarly, netvsc_suspend() should not call netvsc_unregister_vf(). BTW, mlx4_suspend()/resume() are differnt in that they destroy and re-create the network device, so netvsc_register_vf() and netvsc_unregister_vf() are automatically called. Note: mlx4 can also work with the changes here because in netvsc_suspend()/resume() ndev_ctx->vf_netdev is NULL for mlx4. Fixes: 0efeea5fb153 ("hv_netvsc: Add the support of hibernation") Signed-off-by: Dexuan Cui --- drivers/net/hyperv/netvsc_drv.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index 64b0a74c1523..f896059a9588 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -2587,7 +2587,7 @@ static int netvsc_remove(struct hv_device *dev) static int netvsc_suspend(struct hv_device *dev) { struct net_device_context *ndev_ctx; - struct net_device *vf_netdev, *net; + struct net_device *net; struct netvsc_device *nvdev; int ret; @@ -2604,10 +2604,6 @@ static int netvsc_suspend(struct hv_device *dev) goto out; } - vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev); - if (vf_netdev) - netvsc_unregister_vf(vf_netdev); - /* Save the current config info */ ndev_ctx->saved_netvsc_dev_info = netvsc_devinfo_get(nvdev); @@ -2623,6 +2619,7 @@ static int netvsc_resume(struct hv_device *dev) struct net_device *net = hv_get_drvdata(dev); struct net_device_context *net_device_ctx; struct netvsc_device_info *device_info; + struct net_device *vf_netdev; int ret; rtnl_lock(); @@ -2635,6 +2632,10 @@ static int netvsc_resume(struct hv_device *dev) netvsc_devinfo_put(device_info); net_device_ctx->saved_netvsc_dev_info = NULL; + vf_netdev = rtnl_dereference(net_device_ctx->vf_netdev); + if (vf_netdev && netvsc_vf_changed(vf_netdev) != NOTIFY_OK) + ret = -EINVAL; + rtnl_unlock(); return ret; -- 2.19.1
[RESEND][PATCH v3] x86/apic/flat64: Add back the early_param("apic", parse_apic)
parse_apic() allows the user to try a different APIC driver than the default one that's automatically chosen. It works for X86-32, but doesn't work for X86-64 because it was removed in 2009 for X86-64 by commit 7b38725318f4 ("x86: remove subarchitecture support code"), whose changelog doesn't explicitly describe the removal for X86-64. The patch adds back the functionality for X86-64. The intent is mainly to work around an APIC emulation bug in Hyper-V in the case of kdump: currently Hyper-V does not honor the disabled state of the local APICs, so all the IOAPIC-based interrupts may not be delivered to the correct virtual CPU, if the logical-mode APIC driver is used (the kdump kernel usually uses the logical-mode APIC driver, since typically only 1 CPU is active). Luckily the kdump issue can be worked around by forcing the kdump kernel to use physical mode, before the fix to Hyper-V becomes widely available. The current algorithm of choosing an APIC driver is: 1. The global pointer "struct apic *apic" has a default value, i.e "apic_default" on X86-32, and "apic_flat" on X86-64. 2. If the early_param "apic=" is specified, parse_apic() is called and the pointer "apic" is changed if a matching APIC driver is found. 3. default_acpi_madt_oem_check() calls the acpi_madt_oem_check() method of all APIC drivers, which may override the "apic" pointer. 4. default_setup_apic_routing() may override the "apic" pointer, e.g. by calling the probe() method of all APIC drivers. Note: refer to the order of the APIC drivers specified in arch/x86/kernel/apic/Makefile. The patch is safe because if the apic= early param is not specified, the current algorithm of choosing an APIC driver is unchanged; when the param is specified (e.g. on X86-64, "apic=physical flat"), the kernel still tries to find a "more suitable" APIC driver in the above step 3 and 4: e.g. if the BIOS/firmware requires that apic_x2apic_phys should be used, the above step 4 will override the APIC driver to apic_x2apic_phys, even if an early_param "apic=physical flat" is specified. On Hyper-V, when a Linux VM has <= 8 virtual CPUs, if we use "apic=physical flat", sending IPIs to multiple vCPUs is still fast because Linux VM uses the para-virtualized IPI hypercalls: see hv_apic_init(). The patch adds the __init tag for flat_acpi_madt_oem_check() and physflat_acpi_madt_oem_check() to avoid a warning seen with "make W=1": flat_acpi_madt_oem_check() accesses cmdline_apic, which has a __initdata tag. Fixes: 7b38725318f4 ("x86: remove subarchitecture support code") Signed-off-by: Dexuan Cui --- Changes in v2 (5/28/2020): Updated Documentation/admin-guide/kernel-parameters.txt. [Randy Dunlap] Changed apic_set_verbosity(). Enhanced the changelog. Changes in v3 (5/31/2020): Added the __init tag for flat_acpi_madt_oem_check() and physflat_acpi_madt_oem_check() to avoid a warning seen with "make W=1". (Thanks to kbuild test robot ). Updated the changelog for the __init tag. Today is 6/26/2020 and this is just a RESEND of v3, which was posted on 5/31 (https://lkml.org/lkml/2020/5/31/198). .../admin-guide/kernel-parameters.txt | 11 +-- arch/x86/kernel/apic/apic.c | 11 +++ arch/x86/kernel/apic/apic_flat_64.c | 31 +-- 3 files changed, 40 insertions(+), 13 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 7bc83f3d9bdf..c4503fff9348 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -341,10 +341,15 @@ Format: { quiet (default) | verbose | debug } Change the amount of debugging information output when initialising the APIC and IO-APIC components. - For X86-32, this can also be used to specify an APIC - driver name. + This can also be used to specify an APIC driver name. Format: apic=driver_name - Examples: apic=bigsmp + Examples: + On X86-32: apic=bigsmp + On X86-64: "apic=physical flat" + Note: the available driver names depend on the + architecture and the kernel config; the setting may + be overridden by the acpi_madt_oem_check() and probe() + methods of other APIC drivers. apic_extnmi=[APIC,X86] External NMI delivery setting Format: { bsp (default) | all | none } diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index e53dda210cd7..6f7d75b6358b 100644 --- a
RE: hv_hypercall_pg page permissios
> From: linux-hyperv-ow...@vger.kernel.org > On Behalf Of Dexuan Cui > Sent: Monday, June 15, 2020 10:42 AM > > > > > > Hi hch, > > > The patch is merged into the mainine recently, but unluckily we noticed > > > a warning with CONFIG_DEBUG_WX=y > > > > > > Should we revert this patch, or figure out a way to ask the DEBUG_WX > > > code to ignore this page? > > > > Are you sure it is hv_hypercall_pg? > Yes, 100% sure. I printed the value of hv_hypercall_pg and and it matched the > address in the warning line " x86/mm: Found insecure W+X mapping at > address". I did this experiment: 1. export vmalloc_exec and ptdump_walk_pgd_level_checkwx. 2. write a test module that calls them. 3. It turns out that every call of vmalloc_exec() triggers such a warning. vmalloc_exec() uses PAGE_KERNEL_EXEC, which is defined as (__PP|__RW| 0|___A| 0|___D| 0|___G) It looks the logic in note_page() is: for_each_RW_page, if the NX bit is unset, then report the page as an insecure W+X mapping. IMO this explains the warning? Thanks, -- Dexuan
RE: hv_hypercall_pg page permissios
> From: Vitaly Kuznetsov > Sent: Monday, June 15, 2020 1:35 AM > Dexuan Cui writes: > > >> From: linux-hyperv-ow...@vger.kernel.org > >> On Behalf Of Andy Lutomirski > >> Sent: Tuesday, April 7, 2020 2:01 PM > >> > On Apr 7, 2020, at 12:38 AM, Christoph Hellwig wrote: > >> > > >> > On Tue, Apr 07, 2020 at 09:28:01AM +0200, Vitaly Kuznetsov wrote: > >> >> Christoph Hellwig writes: > >> >> > >> >>> Hi all, > >> >>> > >> >>> The x86 Hyper-V hypercall page (hv_hypercall_pg) is the only allocation > >> >>> in the kernel using __vmalloc with exectutable persmissions, and the > >> >>> only user of PAGE_KERNEL_RX. Is there any good reason it needs to > >> >>> be readable? Otherwise we could use vmalloc_exec and kill off > >> >>> PAGE_KERNEL_RX. Note that before 372b1e91343e6 ("drivers: hv: > Turn > >> off > >> >>> write permission on the hypercall page") it was even mapped writable.. > >> >> > >> >> [There is nothing secret in the hypercall page, by reading it you can > >> >> figure out if you're running on Intel or AMD (VMCALL/VMMCALL) but it's > >> >> likely not the only possible way :-)] > >> >> > >> >> I see no reason for hv_hypercall_pg to remain readable. I just > >> >> smoke-tested > >> > > >> > Thanks, I have the same in my WIP tree, but just wanted to confirm this > >> > makes sense. > >> > >> Just to make sure we’re all on the same page: x86 doesn’t normally have > an > >> execute-only mode. Executable memory in the kernel is readable unless you > >> are using fancy hypervisor-based XO support. > > > > Hi hch, > > The patch is merged into the mainine recently, but unluckily we noticed > > a warning with CONFIG_DEBUG_WX=y (it looks typically this config is defined > > by default in Linux distros, at least in Ubuntu 18.04's > > /boot/config-4.18.0-11-generic). > > > > Should we revert this patch, or figure out a way to ask the DEBUG_WX code > to > > ignore this page? > > > > Are you sure it is hv_hypercall_pg? Yes, 100% sure. I printed the value of hv_hypercall_pg and and it matched the address in the warning line " x86/mm: Found insecure W+X mapping at address". > AFAIU it shouldn't be W+X as we > are allocating it with vmalloc_exec(). In other words, if you revert > 78bb17f76edc, does the issue go away? > > Vitaly Yes, the warning goes away if I revert 78bb17f76edc ("x86/hyperv: use vmalloc_exec for the hypercall page") 88dca4ca5a93 ("mm: remove the pgprot argument to __vmalloc") (I have to revert the second as well with some manual adjustments, since __vmalloc() has 2 parameters now.) Thanks, Dexuan
RE: [PATCH v3] x86/apic/flat64: Add back the early_param("apic", parse_apic)
> From: Dexuan Cui > Sent: Sunday, May 31, 2020 9:49 AM > To: t...@linutronix.de; mi...@redhat.com; rdun...@infradead.org; > b...@alien8.de; h...@zytor.com; x...@kernel.org; pet...@infradead.org; > alli...@lohutok.net; alexios.zav...@intel.com; gre...@linuxfoundation.org; > Dexuan Cui ; na...@vmware.com; Michael Kelley > ; Long Li > Cc: linux-kernel@vger.kernel.org; linux-hyp...@vger.kernel.org > Subject: [PATCH v3] x86/apic/flat64: Add back the early_param("apic", > parse_apic) > > parse_apic() allows the user to try a different APIC driver than the > default one that's automatically chosen. It works for X86-32, but > doesn't work for X86-64 because it was removed in 2009 for X86-64 by > commit 7b38725318f4 ("x86: remove subarchitecture support code"), > whose changelog doesn't explicitly describe the removal for X86-64. > > The patch adds back the functionality for X86-64. The intent is mainly > to work around an APIC emulation bug in Hyper-V in the case of kdump: > currently Hyper-V does not honor the disabled state of the local APICs, > so all the IOAPIC-based interrupts may not be delivered to the correct > virtual CPU, if the logical-mode APIC driver is used (the kdump > kernel usually uses the logical-mode APIC driver, since typically > only 1 CPU is active). Luckily the kdump issue can be worked around by > forcing the kdump kernel to use physical mode, before the fix to Hyper-V > becomes widely available. > > The current algorithm of choosing an APIC driver is: > > 1. The global pointer "struct apic *apic" has a default value, i.e > "apic_default" on X86-32, and "apic_flat" on X86-64. > > 2. If the early_param "apic=" is specified, parse_apic() is called and > the pointer "apic" is changed if a matching APIC driver is found. > > 3. default_acpi_madt_oem_check() calls the acpi_madt_oem_check() method > of all APIC drivers, which may override the "apic" pointer. > > 4. default_setup_apic_routing() may override the "apic" pointer, e.g. > by calling the probe() method of all APIC drivers. Note: refer to the > order of the APIC drivers specified in arch/x86/kernel/apic/Makefile. > > The patch is safe because if the apic= early param is not specified, > the current algorithm of choosing an APIC driver is unchanged; when the > param is specified (e.g. on X86-64, "apic=physical flat"), the kernel > still tries to find a "more suitable" APIC driver in the above step 3 and > 4: e.g. if the BIOS/firmware requires that apic_x2apic_phys should be used, > the above step 4 will override the APIC driver to apic_x2apic_phys, even > if an early_param "apic=physical flat" is specified. > > On Hyper-V, when a Linux VM has <= 8 virtual CPUs, if we use > "apic=physical flat", sending IPIs to multiple vCPUs is still fast because > Linux VM uses the para-virtualized IPI hypercalls: see hv_apic_init(). > > The patch adds the __init tag for flat_acpi_madt_oem_check() and > physflat_acpi_madt_oem_check() to avoid a warning seen with "make W=1": > flat_acpi_madt_oem_check() accesses cmdline_apic, which has a __initdata > tag. > > Fixes: 7b38725318f4 ("x86: remove subarchitecture support code") > Signed-off-by: Dexuan Cui > --- > > Changes in v2: > Updated Documentation/admin-guide/kernel-parameters.txt. [Randy > Dunlap] > Changed apic_set_verbosity(). > Enhanced the changelog. > > Changes in v3: > Added the __init tag for flat_acpi_madt_oem_check() and > physflat_acpi_madt_oem_check() to avoid a warning seen with "make W=1". > (Thanks to kbuild test robot ). > > Updated the changelog for the __init tag. > > .../admin-guide/kernel-parameters.txt | 11 +-- > arch/x86/kernel/apic/apic.c | 11 +++ > arch/x86/kernel/apic/apic_flat_64.c | 31 +-- > 3 files changed, 40 insertions(+), 13 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > b/Documentation/admin-guide/kernel-parameters.txt > index 7bc83f3d9bdf..c4503fff9348 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -341,10 +341,15 @@ > Format: { quiet (default) | verbose | debug } > Change the amount of debugging information output > when initialising the APIC and IO-APIC components. > - For X86-32, this can also be used to specify an APIC > - driver name. > + This can also be used to specify an APIC driver nam
RE: hv_hypercall_pg page permissios
> From: linux-hyperv-ow...@vger.kernel.org > On Behalf Of Andy Lutomirski > Sent: Tuesday, April 7, 2020 2:01 PM > To: Christoph Hellwig > Cc: vkuznets ; x...@kernel.org; > linux-hyp...@vger.kernel.org; linux-kernel@vger.kernel.org; KY Srinivasan > ; Stephen Hemminger ; > Andy Lutomirski ; Peter Zijlstra > Subject: Re: hv_hypercall_pg page permissios > > > > On Apr 7, 2020, at 12:38 AM, Christoph Hellwig wrote: > > > > On Tue, Apr 07, 2020 at 09:28:01AM +0200, Vitaly Kuznetsov wrote: > >> Christoph Hellwig writes: > >> > >>> Hi all, > >>> > >>> The x86 Hyper-V hypercall page (hv_hypercall_pg) is the only allocation > >>> in the kernel using __vmalloc with exectutable persmissions, and the > >>> only user of PAGE_KERNEL_RX. Is there any good reason it needs to > >>> be readable? Otherwise we could use vmalloc_exec and kill off > >>> PAGE_KERNEL_RX. Note that before 372b1e91343e6 ("drivers: hv: Turn > off > >>> write permission on the hypercall page") it was even mapped writable.. > >> > >> [There is nothing secret in the hypercall page, by reading it you can > >> figure out if you're running on Intel or AMD (VMCALL/VMMCALL) but it's > >> likely not the only possible way :-)] > >> > >> I see no reason for hv_hypercall_pg to remain readable. I just > >> smoke-tested > > > > Thanks, I have the same in my WIP tree, but just wanted to confirm this > > makes sense. > > Just to make sure we’re all on the same page: x86 doesn’t normally have an > execute-only mode. Executable memory in the kernel is readable unless you > are using fancy hypervisor-based XO support. Hi hch, The patch is merged into the mainine recently, but unluckily we noticed a warning with CONFIG_DEBUG_WX=y (it looks typically this config is defined by default in Linux distros, at least in Ubuntu 18.04's /boot/config-4.18.0-11-generic). Should we revert this patch, or figure out a way to ask the DEBUG_WX code to ignore this page? [ 19.387536] debug: unmapping init [mem 0x82713000-0x82886fff] [ 19.431766] Write protecting the kernel read-only data: 18432k [ 19.438662] debug: unmapping init [mem 0x81c02000-0x81df] [ 19.446830] debug: unmapping init [mem 0x821d6000-0x821f] [ 19.522368] [ cut here ] [ 19.527495] x86/mm: Found insecure W+X mapping at address 0xc9012000 [ 19.535066] WARNING: CPU: 26 PID: 1 at arch/x86/mm/dump_pagetables.c:248 note_page+0x639/0x690 [ 19.539038] Modules linked in: [ 19.539038] CPU: 26 PID: 1 Comm: swapper/0 Not tainted 5.7.0+ #1 [ 19.539038] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018 [ 19.539038] RIP: 0010:note_page+0x639/0x690 [ 19.539038] Code: fe ff ff 31 c0 e9 a0 fe ff ff 80 3d 39 d1 31 01 00 0f 85 76 fa ff ff 48 c7 c7 98 55 0a 82 c6 05 25 d1 31 01 01 e8 f7 c9 00 00 <0f> 0b e9 5c fa ff ff 48 83 c0 18 48 c7 45 68 00 00 00 00 48 89 45 [ 19.539038] RSP: :c90003137cb0 EFLAGS: 00010282 [ 19.539038] RAX: RBX: RCX: 0007 [ 19.539038] RDX: RSI: RDI: 810fa9c4 [ 19.539038] RBP: c90003137ea0 R08: R09: [ 19.539038] R10: 0001 R11: R12: c9013000 [ 19.539038] R13: R14: c91ff000 R15: [ 19.539038] FS: () GS:8884dad0() knlGS: [ 19.539038] CS: 0010 DS: ES: CR0: 80050033 [ 19.539038] CR2: CR3: 02210001 CR4: 003606e0 [ 19.539038] DR0: DR1: DR2: [ 19.539038] DR3: DR6: fffe0ff0 DR7: 0400 [ 19.539038] Call Trace: [ 19.539038] ptdump_pte_entry+0x39/0x40 [ 19.539038] __walk_page_range+0x5b7/0x960 [ 19.539038] walk_page_range_novma+0x7e/0xd0 [ 19.539038] ptdump_walk_pgd+0x53/0x90 [ 19.539038] ptdump_walk_pgd_level_core+0xdf/0x110 [ 19.539038] ? ptdump_walk_pgd_level_debugfs+0x40/0x40 [ 19.539038] ? hugetlb_get_unmapped_area+0x2f0/0x2f0 [ 19.703692] ? rest_init+0x24d/0x24d [ 19.703692] ? rest_init+0x24d/0x24d [ 19.703692] kernel_init+0x2c/0x113 [ 19.703692] ret_from_fork+0x24/0x30 [ 19.703692] irq event stamp: 2840666 [ 19.703692] hardirqs last enabled at (2840665): [] console_unlock+0x444/0x5b0 [ 19.703692] hardirqs last disabled at (2840666): [] trace_hardirqs_off_thunk+0x1a/0x1c [ 19.703692] softirqs last enabled at (2840662): [] __do_softirq+0x366/0x490 [ 19.703692] softirqs last disabled at (2840655): [] irq_exit+0xe8/0x100 [ 19.703692] ---[ end trace 99ca90806a8e657c ]--- [ 19.786235] x86/mm: Checked W+X mappings: FAILED, 1 W+X pages found. [ 19.793298] rodata_test: all tests were successful [ 19.798508] x86/mm: Checking user space page tables [ 19.818007] x86/mm: Checked W+X mappin
RE: [PATCH] scsi: storvsc: Remove memset before memory freeing in storvsc_suspend()
> From: Denis Efremov > Sent: Friday, June 5, 2020 1:00 AM > To: Dexuan Cui ; Michael Kelley > > Cc: Denis Efremov ; James E . J . Bottomley > ; Martin K . Petersen ; > linux-hyp...@vger.kernel.org; Linux SCSI List ; > Linux Kernel Mailing List > Subject: [PATCH] scsi: storvsc: Remove memset before memory freeing in > storvsc_suspend() > > Remove memset with 0 for stor_device->stor_chns in storvsc_suspend() > before the call to kfree() as the memory contains no sensitive information. > > Fixes: 56fb10585934 ("scsi: storvsc: Add the support of hibernation") > Suggested-by: Dexuan Cui > Signed-off-by: Denis Efremov > --- > drivers/scsi/storvsc_drv.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > index 072ed8728657..2d90cddd8ac2 100644 > --- a/drivers/scsi/storvsc_drv.c > +++ b/drivers/scsi/storvsc_drv.c > @@ -2035,9 +2035,6 @@ static int storvsc_suspend(struct hv_device > *hv_dev) > > vmbus_close(hv_dev->channel); > > - memset(stor_device->stor_chns, 0, > -num_possible_cpus() * sizeof(void *)); > - > kfree(stor_device->stor_chns); > stor_device->stor_chns = NULL; > > -- Denis, thank you for fixing this! Reviewed-by: Dexuan Cui
RE: [PATCH] scsi: storvsc: Use kzfree() in storvsc_suspend()
> From: Dexuan Cui > Sent: Thursday, June 4, 2020 2:50 PM > > > > Can you please make a v2 patch for it and Cc my corporate email "decui" > > > (in > > To)? > > > > Yes, of course. Could I add "Suggested-by"? > > > > Thanks, > > Denis > > Sure. Please also added a tag: Fixes: 56fb10585934 ("scsi: storvsc: Add the support of hibernation") Thanks, Dexuan
RE: [PATCH] scsi: storvsc: Use kzfree() in storvsc_suspend()
> From: Denis Efremov > Sent: Thursday, June 4, 2020 2:43 PM > > > > Hi Denis, > > When I added the function storvsc_suspend() several months ago, somehow > > I forgot to remove the unnecessary memset(). Sorry! > > > > The buffer is recreated in storvsc_resume() -> storvsc_connect_to_vsp() -> > > storvsc_channel_init() -> stor_device->stor_chns = kcalloc(...), so I > > believe > > the memset() can be safely removed. > > I'm not sure that I understand your description. As for me, memset with 0 > before > memory freeing is required only for sensitive information and it's completely > unrelated to memory zeroing during allocation with kzalloc/kcalloc. > If it's not a sensitive information then memset could be safely removed. There is no sensitive info in the buffer here. > > Can you please make a v2 patch for it and Cc my corporate email "decui" (in > To)? > > Yes, of course. Could I add "Suggested-by"? > > Thanks, > Denis Sure. Thanks, -- Dexuan