[PATCH] x86/hyperv: suppress "PCI: Fatal: No config space access function found"
A Generatin-2 Linux VM on Hyper-V doesn't have the legacy PCI bus, and users always see the scary warning, which is actually harmless. The patch is made to suppress the warning. Signed-off-by: Dexuan Cui Cc: K. Y. Srinivasan Cc: Haiyang Zhang Cc: Stephen Hemminger --- arch/x86/hyperv/hv_init.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index 20c876c..7abb09e 100644 --- a/arch/x86/hyperv/hv_init.c +++ b/arch/x86/hyperv/hv_init.c @@ -17,6 +17,7 @@ * */ +#include #include #include #include @@ -253,6 +254,22 @@ static int hv_cpu_die(unsigned int cpu) return 0; } +static int __init hv_pci_init(void) +{ + int gen2vm = efi_enabled(EFI_BOOT); + + /* +* For Generation-2 VM, we exit from pci_arch_init() by returning 0. +* The purpose is to suppress the harmless warning: +* "PCI: Fatal: No config space access function found" +*/ + if (gen2vm) + return 0; + + /* For Generation-1 VM, we'll proceed in pci_arch_init(). */ + return 1; +} + /* * This function is to be invoked early in the boot sequence after the * hypervisor has been detected. @@ -329,6 +346,8 @@ void __init hyperv_init(void) hv_apic_init(); + x86_init.pci.arch_init = hv_pci_init; + /* * Register Hyper-V specific clocksource. */ -- 2.7.4 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH 3/5] Drivers: hv: kvp: Fix the recent regression caused by incorrect clean-up
> From: devel On Behalf Of > KY Srinivasan > Sent: Tuesday, October 16, 2018 23:02 > > > --- a/drivers/hv/hv_kvp.c > > > +++ b/drivers/hv/hv_kvp.c > > > @@ -353,6 +353,9 @@ static void process_ib_ipinfo(void *in_msg, void > > *out_msg, int op) > > > > > > out->body.kvp_ip_val.dhcp_enabled = in- > > >kvp_ip_val.dhcp_enabled; > > > > > > + __attribute__ ((fallthrough)); > > > > The comment should be sufficient for this, right? I haven't seen many > > uses of this attribute before, how common is it? > > Yes; a common should be sufficient. You all are right. Right now, I realized the gcc warning can also be suppressed by a simple line of comment: /* fallthrough */ It looks gcc treats this comment specially. If I add something more in the comment, like /* add fallthrough */ , the warning can not be suppressed. :-) I'll do a new version for KY. Thank you all! -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH 3/5] Drivers: hv: kvp: Fix the recent regression caused by incorrect clean-up
> From: devel On Behalf Of > Greg KH > Sent: Tuesday, October 16, 2018 22:07 > > ... > > + case KVP_OP_GET: > > + message->body.kvp_get.data.key_size = > > + utf16s_to_utf8s( > > + (wchar_t *)in_msg->body.kvp_get.data.key, > > + in_msg->body.kvp_get.data.key_size, > > + UTF16_LITTLE_ENDIAN, > > + message->body.kvp_get.data.key, > > + HV_KVP_EXCHANGE_MAX_KEY_SIZE - 1) + 1; > > Worst indentation ever :( > > Yeah, I know it follows the others above it, but you should reconsider > it in the future... > greg k-h Agreed. We should consider improving this in future. -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH 5/5] Tools: hv: kvp: Fix a warning of buffer overflow with gcc 8.0.1
> From: devel On Behalf Of > Greg KH > Sent: Tuesday, October 16, 2018 22:07 > > On Wed, Oct 17, 2018 at 03:14:06AM +, k...@linuxonhyperv.com wrote: > > From: Dexuan Cui > > > > The patch fixes: > > > > hv_kvp_daemon.c: In function 'kvp_set_ip_info': > > hv_kvp_daemon.c:1305:2: note: 'snprintf' output between 41 and 4136 bytes > > into a destination of size 4096 > > ... > > --- a/tools/hv/hv_kvp_daemon.c > > +++ b/tools/hv/hv_kvp_daemon.c > > @@ -1176,7 +1176,7 @@ static int kvp_set_ip_info(char *if_name, struct > hv_kvp_ipaddr_value *new_val) > > int error = 0; > > char if_file[PATH_MAX]; > > FILE *file; > > - char cmd[PATH_MAX]; > > + char cmd[PATH_MAX * 2]; > > Overkill? Well, I found the -Wformat-truncation warning can be suppressed by checking the return value of snprintf, so I'm going to do a new patch for KY to resend like this: @@ -1178,6 +1178,7 @@ static int kvp_set_ip_info(char *if_name, struct hv_kvp_ipaddr_value *new_val) FILE *file; char cmd[PATH_MAX]; char *mac_addr; + int str_len; /* * Set the configuration for the specified interface with @@ -1301,8 +1302,19 @@ static int kvp_set_ip_info(char *if_name, struct hv_kvp_ipaddr_value *new_val) * invoke the external script to do its magic. */ - snprintf(cmd, sizeof(cmd), KVP_SCRIPTS_PATH "%s %s", -"hv_set_ifconfig", if_file); + str_len = snprintf(cmd, sizeof(cmd), KVP_SCRIPTS_PATH "%s %s", +"hv_set_ifconfig", if_file); + + /* +* This is a little overcautious, but it's necessary to suppress some +* false warnings from gcc 8.0.1. +*/ + if (str_len <= 0 || (unsigned int)str_len >= sizeof(cmd)) { + syslog(LOG_ERR, "Cmd '%s' (len=%d) may be too long", + cmd, str_len); + return HV_E_FAIL; + } Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH V2 3/5] Drivers: hv: kvp: Fix the recent regression caused by incorrect clean-up
> From: Michael Kelley > Sent: Wednesday, October 24, 2018 08:38 > From: k...@linuxonhyperv.com Sent: Wednesday, > October 17, 2018 10:10 PM > > From: Dexuan Cui > > > > In kvp_send_key(), we do need call process_ib_ipinfo() if > > message->kvp_hdr.operation is KVP_OP_GET_IP_INFO, because it turns out > > the userland hv_kvp_daemon needs the info of operation, adapter_id and > > addr_family. With the incorrect fc62c3b1977d, the host can't get the > > VM's IP via KVP. > > > > And, fc62c3b1977d added a "break;", but actually forgot to initialize > > the key_size/value in the case of KVP_OP_SET, so the default key_size of > > 0 is passed to the kvp daemon, and the pool files > > /var/lib/hyperv/.kvp_pool_* can't be updated. > > > > This patch effectively rolls back the previous fc62c3b1977d, and > > correctly fixes the "this statement may fall through" warnings. > > > > This patch is tested on WS 2012 R2 and 2016. > > > > Fixes: fc62c3b1977d ("Drivers: hv: kvp: Fix two "this statement may fall > through" warnings") > > Signed-off-by: Dexuan Cui > > Cc: K. Y. Srinivasan > > Cc: Haiyang Zhang > > Cc: Stephen Hemminger > > Cc: > > Signed-off-by: K. Y. Srinivasan > > --- > > drivers/hv/hv_kvp.c | 26 ++ > > 1 file changed, 22 insertions(+), 4 deletions(-) > > > Reviewed-by: Michael Kelley Hi Greg, Can you please take a look at this patch? We need it to fix a regression introduced by me in: Fixes: fc62c3b1977d ("Drivers: hv: kvp: Fix two "this statement may fall through" warnings") The faulty patch is being meged into the old stable kernels... So we need to take this patch ASAP. Thanks! -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH V2 3/5] Drivers: hv: kvp: Fix the recent regression caused by incorrect clean-up
> From: gre...@linuxfoundation.org > Sent: Thursday, November 1, 2018 11:57 > To: Dexuan Cui > > On Wed, Oct 31, 2018 at 11:23:54PM +0000, Dexuan Cui wrote: > > > From: Michael Kelley > > > Sent: Wednesday, October 24, 2018 08:38 > > > From: k...@linuxonhyperv.com Sent: > Wednesday, > > > October 17, 2018 10:10 PM > > > > From: Dexuan Cui > > > > > > > > In kvp_send_key(), we do need call process_ib_ipinfo() if > > > > message->kvp_hdr.operation is KVP_OP_GET_IP_INFO, because it turns > out > > > > the userland hv_kvp_daemon needs the info of operation, adapter_id > and > > > > addr_family. With the incorrect fc62c3b1977d, the host can't get the > > > > VM's IP via KVP. > > > > > > > > And, fc62c3b1977d added a "break;", but actually forgot to initialize > > > > the key_size/value in the case of KVP_OP_SET, so the default key_size of > > > > 0 is passed to the kvp daemon, and the pool files > > > > /var/lib/hyperv/.kvp_pool_* can't be updated. > > > > > > > > This patch effectively rolls back the previous fc62c3b1977d, and > > > > correctly fixes the "this statement may fall through" warnings. > > > > > > > > This patch is tested on WS 2012 R2 and 2016. > > > > > > > > Fixes: fc62c3b1977d ("Drivers: hv: kvp: Fix two "this statement may fall > > > through" warnings") > > > > Signed-off-by: Dexuan Cui > > > > Cc: K. Y. Srinivasan > > > > Cc: Haiyang Zhang > > > > Cc: Stephen Hemminger > > > > Cc: > > > > Signed-off-by: K. Y. Srinivasan > > > > --- > > > > drivers/hv/hv_kvp.c | 26 ++ > > > > 1 file changed, 22 insertions(+), 4 deletions(-) > > > > > > > Reviewed-by: Michael Kelley > > > > Hi Greg, > > Can you please take a look at this patch? > > Nope, I'm not the hv maintainer, they need to look at this and ack it, > not me :) > > greg k-h Hi Greg, KY has added his Signed-off-by in the mail. I'll ask the other HV maintainers to take a look as well. Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH V2 3/5] Drivers: hv: kvp: Fix the recent regression caused by incorrect clean-up
> From: gre...@linuxfoundation.org > Sent: Thursday, November 1, 2018 21:54 > To: Dexuan Cui > Cc: Michael Kelley ; KY Srinivasan > ; linux-ker...@vger.kernel.org; > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com; Stephen Hemminger ; > vkuznets ; Sasha Levin > ; Haiyang Zhang ; > sta...@vger.kernel.org > Subject: Re: [PATCH V2 3/5] Drivers: hv: kvp: Fix the recent regression caused > by incorrect clean-up > > On Thu, Nov 01, 2018 at 07:22:28PM +, Dexuan Cui wrote: > > > From: gre...@linuxfoundation.org > > > Sent: Thursday, November 1, 2018 11:57 > > > To: Dexuan Cui > > > > > > On Wed, Oct 31, 2018 at 11:23:54PM +, Dexuan Cui wrote: > > > > > From: Michael Kelley > > > > > Sent: Wednesday, October 24, 2018 08:38 > > > > > From: k...@linuxonhyperv.com Sent: > > > Wednesday, > > > > > October 17, 2018 10:10 PM > > > > > > From: Dexuan Cui > > > > > > > > > > > > In kvp_send_key(), we do need call process_ib_ipinfo() if > > > > > > message->kvp_hdr.operation is KVP_OP_GET_IP_INFO, because it > turns > > > out > > > > > > the userland hv_kvp_daemon needs the info of operation, adapter_id > > > and > > > > > > addr_family. With the incorrect fc62c3b1977d, the host can't get the > > > > > > VM's IP via KVP. > > > > > > > > > > > > And, fc62c3b1977d added a "break;", but actually forgot to > > > > > > initialize > > > > > > the key_size/value in the case of KVP_OP_SET, so the default > > > > > > key_size > of > > > > > > 0 is passed to the kvp daemon, and the pool files > > > > > > /var/lib/hyperv/.kvp_pool_* can't be updated. > > > > > > > > > > > > This patch effectively rolls back the previous fc62c3b1977d, and > > > > > > correctly fixes the "this statement may fall through" warnings. > > > > > > > > > > > > This patch is tested on WS 2012 R2 and 2016. > > > > > > > > > > > > Fixes: fc62c3b1977d ("Drivers: hv: kvp: Fix two "this statement may > fall > > > > > through" warnings") > > > > > > Signed-off-by: Dexuan Cui > > > > > > Cc: K. Y. Srinivasan > > > > > > Cc: Haiyang Zhang > > > > > > Cc: Stephen Hemminger > > > > > > Cc: > > > > > > Signed-off-by: K. Y. Srinivasan > > > > > > --- > > > > > > drivers/hv/hv_kvp.c | 26 ++ > > > > > > 1 file changed, 22 insertions(+), 4 deletions(-) > > > > > > > > > > > Reviewed-by: Michael Kelley > > > > > > > > Hi Greg, > > > > Can you please take a look at this patch? > > > > > > Nope, I'm not the hv maintainer, they need to look at this and ack it, > > > not me :) > > > > > > greg k-h > > > > Hi Greg, > > KY has added his Signed-off-by in the mail. > > > > I'll ask the other HV maintainers to take a look as well. > > Ok, then I'll look at it after 4.20-rc1 is out, nothing I can do until > then anyway... > > thanks, > > greg k-h Hi Greg, Can you please take a look at the patch now? The patch has received Reviewed-by: Michael Kelley Signed-off-by: Haiyang Zhang Signed-off-by: K. Y. Srinivasan Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH 2/2] Drivers: hv: vmbus: offload the handling of channels to two workqueues
> From: devel On Behalf Of > Greg KH > Sent: Monday, November 26, 2018 11:35 AM > As Sasha pointed out, this patch does not even apply :( Sorry, I'll rebase to char-misc's char-misc-testing branch, which has had one of the patches, i.e. 4d3c5c69191f ("Drivers: hv: vmbus: Remove the useless API vmbus_get_outgoing_channel()") Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH] [repost] Drivers: hv: vmbus: Offload the handling of channels to two workqueues
vmbus_process_offer() mustn't call channel->sc_creation_callback() directly for sub-channels, because sc_creation_callback() -> vmbus_open() may never get the host's response to the OPEN_CHANNEL message (the host may rescind a channel at any time, e.g. in the case of hot removing a NIC), and vmbus_onoffer_rescind() may not wake up the vmbus_open() as it's blocked due to a non-zero vmbus_connection.offer_in_progress, and finally we have a deadlock. The above is also true for primary channels, if the related device drivers use sync probing mode by default. And, usually the handling of primary channels and sub-channels can depend on each other, so we should offload them to different workqueues to avoid possible deadlock, e.g. in sync-probing mode, NIC1's netvsc_subchan_work() can race with NIC2's netvsc_probe() -> rtnl_lock(), and causes deadlock: the former gets the rtnl_lock and waits for all the sub-channels to appear, but the latter can't get the rtnl_lock and this blocks the handling of sub-channels. The patch can fix the multiple-NIC deadlock described above for v3.x kernels (e.g. RHEL 7.x) which don't support async-probing of devices, and v4.4, v4.9, v4.14 and v4.18 which support async-probing but don't enable async-probing for Hyper-V drivers (yet). The patch can also fix the hang issue in sub-channel's handling described above for all versions of kernels, including v4.19 and v4.20-rc4. So, actually the patch should be applied to all the existing kernels, not only the kernels that have 8195b1396ec8. Fixes: 8195b1396ec8 ("hv_netvsc: fix deadlock on hotplug") Cc: sta...@vger.kernel.org Cc: Stephen Hemminger Cc: K. Y. Srinivasan Cc: Haiyang Zhang Signed-off-by: Dexuan Cui Signed-off-by: K. Y. Srinivasan --- There is no change in this repost. I just rebased this patch to today's char-misc's char-misc-next branch. Previously KY posted the patch with his Signed-off-by (which is kept in this repost), but there was a conflict issue. Note: the patch can't be cleanly applied to char-misc's char-misc-linus branch -- to do that, we need to cherry-pick the supporting patch first: 4d3c5c69191f ("Drivers: hv: vmbus: Remove the useless API vmbus_get_outgoing_channel()") drivers/hv/channel_mgmt.c | 188 +++--- drivers/hv/connection.c | 24 +- drivers/hv/hyperv_vmbus.h | 7 ++ include/linux/hyperv.h| 7 ++ 4 files changed, 161 insertions(+), 65 deletions(-) diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index 82e6736..d016890 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -434,60 +434,16 @@ void vmbus_free_channels(void) } } -/* - * vmbus_process_offer - Process the offer by creating a channel/device - * associated with this offer - */ -static void vmbus_process_offer(struct vmbus_channel *newchannel) +/* Note: the function can run concurrently for primary/sub channels. */ +static void vmbus_add_channel_work(struct work_struct *work) { - struct vmbus_channel *channel; - bool fnew = true; + struct vmbus_channel *newchannel = + container_of(work, struct vmbus_channel, add_channel_work); + struct vmbus_channel *primary_channel = newchannel->primary_channel; unsigned long flags; u16 dev_type; int ret; - /* Make sure this is a new offer */ - mutex_lock(&vmbus_connection.channel_mutex); - - /* -* Now that we have acquired the channel_mutex, -* we can release the potentially racing rescind thread. -*/ - atomic_dec(&vmbus_connection.offer_in_progress); - - list_for_each_entry(channel, &vmbus_connection.chn_list, listentry) { - if (!uuid_le_cmp(channel->offermsg.offer.if_type, - newchannel->offermsg.offer.if_type) && - !uuid_le_cmp(channel->offermsg.offer.if_instance, - newchannel->offermsg.offer.if_instance)) { - fnew = false; - break; - } - } - - if (fnew) - list_add_tail(&newchannel->listentry, - &vmbus_connection.chn_list); - - mutex_unlock(&vmbus_connection.channel_mutex); - - if (!fnew) { - /* -* Check to see if this is a sub-channel. -*/ - if (newchannel->offermsg.offer.sub_channel_index != 0) { - /* -* Process the sub-channel. -*/ - newchannel->primary_channel = channel; - spin_lock_irqsave(&channel->lock, flags); - list_add_tail(&newchannel->sc_list, &channel->sc_list); - spin_unlo
RE: [PATCH] [repost] Drivers: hv: vmbus: Offload the handling of channels to two workqueues
> From: gre...@linuxfoundation.org > Sent: Wednesday, November 28, 2018 11:45 PM > > > > There is no change in this repost. I just rebased this patch to today's > > char-misc's char-misc-next branch. Previously KY posted the patch with his > > Signed-off-by (which is kept in this repost), but there was a conflict > > issue. > > > > Note: the patch can't be cleanly applied to char-misc's char-misc-linus > > branch > -- > > to do that, we need to cherry-pick the supporting patch first: > > 4d3c5c69191f ("Drivers: hv: vmbus: Remove the useless API > vmbus_get_outgoing_channel()") > > That is not going to work for the obvious reason that this dependant > patch is not going to be merged into 4.20-final. It looks the dependent patch (4d3c5c69191f) is going to miss the v4.20 release. This is not a big issue, as the dependent patch isn't really important. > So, what do you expect us to do here? The only way this can be accepted > is to have it go into my -next branch, which means it will show up in > 4.21-rc1, is that ok? Is there any chance for this patch ("Drivers: hv: vmbus: Offload the handling ...") to go into v4.20? If yes, I can quickly do a rebase to char-misc's char-misc-linus branch, because actually the conflict can be very easily fixed. And I can help to fix any conflict when the dependent patch is backported to v4.20.1. If no, I think this patch and the dependent patch can both go into v4.21, and they can be both backported to v4.20.1 in future. > But then, if that happens, it will fail to apply to any stable tree for > 4.20 and older, like you are asking it to be done for. > > So what do you expect me to do here with this? > > totally confused, > > greg k-h I hope my above reply made me clear. Sorry, I'm not really know how exactly the releasing procedure works... Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] [repost] Drivers: hv: vmbus: Offload the handling of channels to two workqueues
> From: KY Srinivasan > Sent: Friday, November 30, 2018 9:31 AM > > From: Dexuan Cui > > Sent: Thursday, November 29, 2018 12:17 AM > > To: gre...@linuxfoundation.org > > Cc: KY Srinivasan ; Haiyang Zhang > > ; Stephen Hemminger > > ; linux-ker...@vger.kernel.org; > > de...@linuxdriverproject.org; a...@canonical.com; vkuznets > > ; o...@aepfle.de; jasow...@redhat.com; Michael > > Kelley > > Subject: RE: [PATCH] [repost] Drivers: hv: vmbus: Offload the handling of > > channels to two workqueues > > > > > From: gre...@linuxfoundation.org > > > Sent: Wednesday, November 28, 2018 11:45 PM > > > > > > > > There is no change in this repost. I just rebased this patch to today's > > > > char-misc's char-misc-next branch. Previously KY posted the patch with > > his > > > > Signed-off-by (which is kept in this repost), but there was a conflict > > > > issue. > > > > > > > > Note: the patch can't be cleanly applied to char-misc's char-misc-linus > > branch > > > -- > > > > to do that, we need to cherry-pick the supporting patch first: > > > > 4d3c5c69191f ("Drivers: hv: vmbus: Remove the useless API > > > vmbus_get_outgoing_channel()") > > > > > > That is not going to work for the obvious reason that this dependant > > > patch is not going to be merged into 4.20-final. > > > > It looks the dependent patch (4d3c5c69191f) is going to miss the v4.20 > > release. > > This is not a big issue, as the dependent patch isn't really important. > > > > > So, what do you expect us to do here? The only way this can be accepted > > > is to have it go into my -next branch, which means it will show up in > > > 4.21-rc1, is that ok? > > > > Is there any chance for this patch ("Drivers: hv: vmbus: Offload the > > handling > > ...") to > > go into v4.20? > > > > If yes, I can quickly do a rebase to char-misc's char-misc-linus branch, > > because actually the conflict can be very easily fixed. And I can help to > > fix any > > conflict when the dependent patch is backported to v4.20.1. > > This patch fixes an important bug while the patch this depends on is not > critical. > I suggest we revert the patch that this patch depends on > and we can submit a new version of this patch that can go in now - into 4.20 > release. > > K. Y I agree. Hi Greg, Please let us know what we can do to try to push this important fix into v4.20. Actually it's straightforward, though it looks big. And, we ave done a full testing with the patch. Thanks, --Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] [repost] Drivers: hv: vmbus: Offload the handling of channels to two workqueues
> From: gre...@linuxfoundation.org > Sent: Saturday, December 1, 2018 11:34 PM > To: Dexuan Cui > Cc: KY Srinivasan ; Haiyang Zhang > ; Stephen Hemminger > ; linux-ker...@vger.kernel.org; > de...@linuxdriverproject.org; a...@canonical.com; vkuznets > ; o...@aepfle.de; jasow...@redhat.com; Michael > Kelley > Subject: Re: [PATCH] [repost] Drivers: hv: vmbus: Offload the handling of > channels to two workqueues > > On Fri, Nov 30, 2018 at 06:09:54PM +, Dexuan Cui wrote: > > > From: KY Srinivasan > > > Sent: Friday, November 30, 2018 9:31 AM > > > > From: Dexuan Cui > > > > Sent: Thursday, November 29, 2018 12:17 AM > > > > To: gre...@linuxfoundation.org > > > > Cc: KY Srinivasan ; Haiyang Zhang > > > > ; Stephen Hemminger > > > > ; linux-ker...@vger.kernel.org; > > > > de...@linuxdriverproject.org; a...@canonical.com; vkuznets > > > > ; o...@aepfle.de; jasow...@redhat.com; > Michael > > > > Kelley > > > > Subject: RE: [PATCH] [repost] Drivers: hv: vmbus: Offload the handling > > > > of > > > > channels to two workqueues > > > > > > > > > From: gre...@linuxfoundation.org > > > > > Sent: Wednesday, November 28, 2018 11:45 PM > > > > > > > > > > > > There is no change in this repost. I just rebased this patch to > > > > > > today's > > > > > > char-misc's char-misc-next branch. Previously KY posted the patch > with > > > > his > > > > > > Signed-off-by (which is kept in this repost), but there was a > > > > > > conflict > issue. > > > > > > > > > > > > Note: the patch can't be cleanly applied to char-misc's > char-misc-linus > > > > branch > > > > > -- > > > > > > to do that, we need to cherry-pick the supporting patch first: > > > > > > 4d3c5c69191f ("Drivers: hv: vmbus: Remove the useless API > > > > > vmbus_get_outgoing_channel()") > > > > > > > > > > That is not going to work for the obvious reason that this dependant > > > > > patch is not going to be merged into 4.20-final. > > > > > > > > It looks the dependent patch (4d3c5c69191f) is going to miss the v4.20 > > > > release. > > > > This is not a big issue, as the dependent patch isn't really important. > > > > > > > > > So, what do you expect us to do here? The only way this can be > accepted > > > > > is to have it go into my -next branch, which means it will show up in > > > > > 4.21-rc1, is that ok? > > > > > > > > Is there any chance for this patch ("Drivers: hv: vmbus: Offload the > handling > > > > ...") to > > > > go into v4.20? > > > > > > > > If yes, I can quickly do a rebase to char-misc's char-misc-linus branch, > > > > because actually the conflict can be very easily fixed. And I can help > > > > to > fix any > > > > conflict when the dependent patch is backported to v4.20.1. > > > > > > This patch fixes an important bug while the patch this depends on is not > > > critical. > > > I suggest we revert the patch that this patch depends on > > > and we can submit a new version of this patch that can go in now - into > 4.20 > > > release. > > > > > > K. Y > > > > I agree. > > > > Hi Greg, > > Please let us know what we can do to try to push this important fix into > v4.20. > > > > Actually it's straightforward, though it looks big. And, we ave done a full > testing > > with the patch. > > Ok, you all need to figure this out on your own, sorry. Please give me > a patch that I can actually apply to the tree if you need it merged into > 4.20-final. This shouldn't be tough, you all have been doing this long > enough by now... > > I have no bandwidth to do this myself for you, > > greg k-h Hi Greg, Please use the attached patch: I rebased the patch to today's char-misc's char-misc-linus branch. It can also cleanly apply to Linus's master branch today. Please let me know in case you need me to re-post the patch in a new mail (rather than the attachment here) or you need me to rebase the patch to a different tree/branch. Thanks, Dexuan 0001-Drivers-hv-vmbus-Offload-the-handling-of-channels-to.patch Description: 0001-Drivers-hv-vmbus-Offload-the-handling-of-channels-to.patch ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] [repost] Drivers: hv: vmbus: Offload the handling of channels to two workqueues
> From: gre...@linuxfoundation.org > Sent: Sunday, December 2, 2018 7:33 AM > To: Dexuan Cui > Cc: KY Srinivasan ; Haiyang Zhang > ; Stephen Hemminger > ; linux-ker...@vger.kernel.org; > de...@linuxdriverproject.org; a...@canonical.com; vkuznets > ; o...@aepfle.de; jasow...@redhat.com; Michael > Kelley > Subject: Re: [PATCH] [repost] Drivers: hv: vmbus: Offload the handling of > channels to two workqueues > > On Sun, Dec 02, 2018 at 08:47:03AM +, Dexuan Cui wrote: > > Hi Greg, > > Please use the attached patch: I rebased the patch to today's char-misc's > > char-misc-linus branch. It can also cleanly apply to Linus's master branch > > today. > > I can't use an attached patch, you know better. Please fix up and > resend properly. > > greg k-h Ok, let me re-send it right now. BTW, Linus's tree was updated to v4.20-rc5 98 minutes ago. This patch can still cleanly apply there, and on your char-misc-linus branch. Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH] [REPOST for the char-misc-linus branch] Drivers: hv: vmbus: Offload the handling of channels to two workqueues
vmbus_process_offer() mustn't call channel->sc_creation_callback() directly for sub-channels, because sc_creation_callback() -> vmbus_open() may never get the host's response to the OPEN_CHANNEL message (the host may rescind a channel at any time, e.g. in the case of hot removing a NIC), and vmbus_onoffer_rescind() may not wake up the vmbus_open() as it's blocked due to a non-zero vmbus_connection.offer_in_progress, and finally we have a deadlock. The above is also true for primary channels, if the related device drivers use sync probing mode by default. And, usually the handling of primary channels and sub-channels can depend on each other, so we should offload them to different workqueues to avoid possible deadlock, e.g. in sync-probing mode, NIC1's netvsc_subchan_work() can race with NIC2's netvsc_probe() -> rtnl_lock(), and causes deadlock: the former gets the rtnl_lock and waits for all the sub-channels to appear, but the latter can't get the rtnl_lock and this blocks the handling of sub-channels. The patch can fix the multiple-NIC deadlock described above for v3.x kernels (e.g. RHEL 7.x) which don't support async-probing of devices, and v4.4, v4.9, v4.14 and v4.18 which support async-probing but don't enable async-probing for Hyper-V drivers (yet). The patch can also fix the hang issue in sub-channel's handling described above for all versions of kernels, including v4.19 and v4.20-rc4. So actually the patch should be applied to all the existing kernels, not only the kernels that have 8195b1396ec8. Fixes: 8195b1396ec8 ("hv_netvsc: fix deadlock on hotplug") Cc: sta...@vger.kernel.org Cc: Stephen Hemminger Cc: K. Y. Srinivasan Cc: Haiyang Zhang Signed-off-by: Dexuan Cui Signed-off-by: K. Y. Srinivasan --- [Dec. 2, 2018] Dexuan: I rebased this patch to today's char-misc's char-misc-linus branch, whose top commit is: d8f190ee836a ("Merge branch 'akpm' (patches from Andrew)") Previously KY posted the patch with his Signed-off-by (which is kept in this repost), but there was a conflict issue. We have done a full testing with the patch. We hope the important patch can be in the final v4.20. Thanks! drivers/hv/channel_mgmt.c | 189 ++ drivers/hv/connection.c | 24 +- drivers/hv/hyperv_vmbus.h | 7 ++ include/linux/hyperv.h| 7 ++ 4 files changed, 161 insertions(+), 66 deletions(-) diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index 6277597d..edd34c1 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -435,61 +435,16 @@ void vmbus_free_channels(void) } } -/* - * vmbus_process_offer - Process the offer by creating a channel/device - * associated with this offer - */ -static void vmbus_process_offer(struct vmbus_channel *newchannel) +/* Note: the function can run concurrently for primary/sub channels. */ +static void vmbus_add_channel_work(struct work_struct *work) { - struct vmbus_channel *channel; - bool fnew = true; + struct vmbus_channel *newchannel = + container_of(work, struct vmbus_channel, add_channel_work); + struct vmbus_channel *primary_channel = newchannel->primary_channel; unsigned long flags; u16 dev_type; int ret; - /* Make sure this is a new offer */ - mutex_lock(&vmbus_connection.channel_mutex); - - /* -* Now that we have acquired the channel_mutex, -* we can release the potentially racing rescind thread. -*/ - atomic_dec(&vmbus_connection.offer_in_progress); - - list_for_each_entry(channel, &vmbus_connection.chn_list, listentry) { - if (!uuid_le_cmp(channel->offermsg.offer.if_type, - newchannel->offermsg.offer.if_type) && - !uuid_le_cmp(channel->offermsg.offer.if_instance, - newchannel->offermsg.offer.if_instance)) { - fnew = false; - break; - } - } - - if (fnew) - list_add_tail(&newchannel->listentry, - &vmbus_connection.chn_list); - - mutex_unlock(&vmbus_connection.channel_mutex); - - if (!fnew) { - /* -* Check to see if this is a sub-channel. -*/ - if (newchannel->offermsg.offer.sub_channel_index != 0) { - /* -* Process the sub-channel. -*/ - newchannel->primary_channel = channel; - spin_lock_irqsave(&channel->lock, flags); - list_add_tail(&newchannel->sc_list, &channel->sc_list); - channel->num_sc++; - spin_unlock_irqre
RE: [PATCH] vmbus: fix subchannel removal
> From: Stephen Hemminger > Sent: Friday, December 7, 2018 10:59 AM > To: KY Srinivasan ; Haiyang Zhang > ; Dexuan Cui ; > mga...@redhat.com > Cc: de...@linuxdriverproject.org; Stephen Hemminger > > Subject: [PATCH] vmbus: fix subchannel removal > > The changes to split ring allocation from open/close, broke > the cleanup of subchannels. This resulted in problems using > uio on network devices because the subchannel was left behind > when the network device was unbound. > > The cause was in the disconnect logic which used list splice > to move the subchannel list into a local variable. This won't > work because the subchannel list is needed later during the > process of the rescind messages (relid2channel). > > The fix is to just leave the subchannel list in place > which is what the original code did. The list is cleaned > up later when the host rescind is processed. > > Fixes: ae6935ed7d42 ("vmbus: split ring buffer allocation from open") > Signed-off-by: Stephen Hemminger > --- > drivers/hv/channel.c | 10 +- > 1 file changed, 1 insertion(+), 9 deletions(-) > > diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c > index fe00b12e4417..bea4c9850247 100644 > --- a/drivers/hv/channel.c > +++ b/drivers/hv/channel.c > @@ -701,20 +701,12 @@ static int vmbus_close_internal(struct > vmbus_channel *channel) > int vmbus_disconnect_ring(struct vmbus_channel *channel) > { > struct vmbus_channel *cur_channel, *tmp; > - unsigned long flags; > - LIST_HEAD(list); > int ret; > > if (channel->primary_channel != NULL) > return -EINVAL; > > - /* Snapshot the list of subchannels */ > - spin_lock_irqsave(&channel->lock, flags); > - list_splice_init(&channel->sc_list, &list); > - channel->num_sc = 0; > - spin_unlock_irqrestore(&channel->lock, flags); > - > - list_for_each_entry_safe(cur_channel, tmp, &list, sc_list) { > + list_for_each_entry_safe(cur_channel, tmp, &channel->sc_list, sc_list) { > if (cur_channel->rescind) > wait_for_completion(&cur_channel->rescind_event); > Reviewed-by: Dexuan Cui 4.20-rc6 is out now. I hope this fix still has a chance to go in v4.20-final. Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH] Drivers: hv: vmbus: Return -EINVAL for the sys files for unopened channels
Before 98f4c651762c, we returned zeros for unopened channels. With 98f4c651762c, we started to return random on-stack values. We'd better return -EINVAL instead. Fixes: 98f4c651762c ("hv: move ringbuffer bus attributes to dev_groups") Cc: sta...@vger.kernel.org Cc: K. Y. Srinivasan Cc: Haiyang Zhang Cc: Stephen Hemminger Signed-off-by: Dexuan Cui --- drivers/hv/vmbus_drv.c | 20 1 file changed, 20 insertions(+) diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index 283d184..d0ff656 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -316,6 +316,8 @@ static ssize_t out_intr_mask_show(struct device *dev, if (!hv_dev->channel) return -ENODEV; + if (hv_dev->channel->state != CHANNEL_OPENED_STATE) + return -EINVAL; hv_ringbuffer_get_debuginfo(&hv_dev->channel->outbound, &outbound); return sprintf(buf, "%d\n", outbound.current_interrupt_mask); } @@ -329,6 +331,8 @@ static ssize_t out_read_index_show(struct device *dev, if (!hv_dev->channel) return -ENODEV; + if (hv_dev->channel->state != CHANNEL_OPENED_STATE) + return -EINVAL; hv_ringbuffer_get_debuginfo(&hv_dev->channel->outbound, &outbound); return sprintf(buf, "%d\n", outbound.current_read_index); } @@ -343,6 +347,8 @@ static ssize_t out_write_index_show(struct device *dev, if (!hv_dev->channel) return -ENODEV; + if (hv_dev->channel->state != CHANNEL_OPENED_STATE) + return -EINVAL; hv_ringbuffer_get_debuginfo(&hv_dev->channel->outbound, &outbound); return sprintf(buf, "%d\n", outbound.current_write_index); } @@ -357,6 +363,8 @@ static ssize_t out_read_bytes_avail_show(struct device *dev, if (!hv_dev->channel) return -ENODEV; + if (hv_dev->channel->state != CHANNEL_OPENED_STATE) + return -EINVAL; hv_ringbuffer_get_debuginfo(&hv_dev->channel->outbound, &outbound); return sprintf(buf, "%d\n", outbound.bytes_avail_toread); } @@ -371,6 +379,8 @@ static ssize_t out_write_bytes_avail_show(struct device *dev, if (!hv_dev->channel) return -ENODEV; + if (hv_dev->channel->state != CHANNEL_OPENED_STATE) + return -EINVAL; hv_ringbuffer_get_debuginfo(&hv_dev->channel->outbound, &outbound); return sprintf(buf, "%d\n", outbound.bytes_avail_towrite); } @@ -384,6 +394,8 @@ static ssize_t in_intr_mask_show(struct device *dev, if (!hv_dev->channel) return -ENODEV; + if (hv_dev->channel->state != CHANNEL_OPENED_STATE) + return -EINVAL; hv_ringbuffer_get_debuginfo(&hv_dev->channel->inbound, &inbound); return sprintf(buf, "%d\n", inbound.current_interrupt_mask); } @@ -397,6 +409,8 @@ static ssize_t in_read_index_show(struct device *dev, if (!hv_dev->channel) return -ENODEV; + if (hv_dev->channel->state != CHANNEL_OPENED_STATE) + return -EINVAL; hv_ringbuffer_get_debuginfo(&hv_dev->channel->inbound, &inbound); return sprintf(buf, "%d\n", inbound.current_read_index); } @@ -410,6 +424,8 @@ static ssize_t in_write_index_show(struct device *dev, if (!hv_dev->channel) return -ENODEV; + if (hv_dev->channel->state != CHANNEL_OPENED_STATE) + return -EINVAL; hv_ringbuffer_get_debuginfo(&hv_dev->channel->inbound, &inbound); return sprintf(buf, "%d\n", inbound.current_write_index); } @@ -424,6 +440,8 @@ static ssize_t in_read_bytes_avail_show(struct device *dev, if (!hv_dev->channel) return -ENODEV; + if (hv_dev->channel->state != CHANNEL_OPENED_STATE) + return -EINVAL; hv_ringbuffer_get_debuginfo(&hv_dev->channel->inbound, &inbound); return sprintf(buf, "%d\n", inbound.bytes_avail_toread); } @@ -438,6 +456,8 @@ static ssize_t in_write_bytes_avail_show(struct device *dev, if (!hv_dev->channel) return -ENODEV; + if (hv_dev->channel->state != CHANNEL_OPENED_STATE) + return -EINVAL; hv_ringbuffer_get_debuginfo(&hv_dev->channel->inbound, &inbound); return sprintf(buf, "%d\n", inbound.bytes_avail_towrite); } -- 2.7.4 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] Drivers: hv: vmbus: Return -EINVAL for the sys files for unopened channels
> From: Stephen Hemminger > On Thu, 13 Dec 2018 16:35:43 +0000 > Dexuan Cui wrote: > > > Before 98f4c651762c, we returned zeros for unopened channels. > > With 98f4c651762c, we started to return random on-stack values. > > > > We'd better return -EINVAL instead. > > The concept looks fine, but maybe it would be simpler to move it into > hv_ringbuffer_get_debuginfo and have it return an error code. > > Since so much of the code is repeated, I would probably make a > macro which generates the code as well. > > Something like this: Thanks, Stephen! Now the patch has been in char-misc's char-misc-linus branch, so IMO we may as well leave it as is (considering the code here is unlikely to be frqeuencly changed), and we have a smaller patch this way. :-) But, yes, I agree with you that generally we should make a common function to avoid duplicate code. Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] Drivers: hv: vmbus: Return -EINVAL for the sys files for unopened channels
> From: Stephen Hemminger > Sent: Monday, December 17, 2018 10:17 AM > To: Dexuan Cui > > On Mon, 17 Dec 2018 18:00:29 +0000 > Dexuan Cui wrote: > > > > From: Stephen Hemminger > > > On Thu, 13 Dec 2018 16:35:43 + > > > Dexuan Cui wrote: > > > > > > > Before 98f4c651762c, we returned zeros for unopened channels. > > > > With 98f4c651762c, we started to return random on-stack values. > > > > > > > > We'd better return -EINVAL instead. > > > > > > The concept looks fine, but maybe it would be simpler to move it into > > > hv_ringbuffer_get_debuginfo and have it return an error code. > > > > > > Since so much of the code is repeated, I would probably make a > > > macro which generates the code as well. > > > > > > Something like this: > > > > Thanks, Stephen! Now the patch has been in char-misc's char-misc-linus > > branch, so IMO we may as well leave it as is (considering the code here is > > unlikely to be frqeuencly changed), and we have a smaller patch this way. > > :-) > > > > But, yes, I agree with you that generally we should make a common > > function to avoid duplicate code. > > > > Thanks, > > -- Dexuan > > The old code was risky because it would silently return stack garbage. > Having an error check in get_debuginfo would eliminate that. OK, then let me make another patch based on the latest char-misc-linus. Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] Drivers: hv: vmbus: Return -EINVAL for the sys files for unopened channels
> From: devel On Behalf Of > Dexuan Cui > Sent: Monday, December 17, 2018 10:31 AM > > From: Stephen Hemminger > > > > The old code was risky because it would silently return stack garbage. > > Having an error check in get_debuginfo would eliminate that. > > OK, then let me make another patch based on the latest char-misc-linus. > > -- Dexuan Hi Stephen, your patch can apply cleanly. Let me rebase your patch to char-misc-linus, do a test, and then post it with your Signed-off-by and mine: I assume you're Ok with this. Please let me know in case it's not. :-) Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH] Drivers: hv: vmbus: Check for ring when getting debug info
fc96df16a1ce is good and can already fix the "return stack garbage" issue, but let's also improve hv_ringbuffer_get_debuginfo(), which would silently return stack garbage, if people forget to check channel->state or ring_info->ring_buffer, when using the function in the future. Having an error check in the function would eliminate the potential risk. Add a Fixes tag to indicate the patch depdendency. Fixes: fc96df16a1ce ("Drivers: hv: vmbus: Return -EINVAL for the sys files for unopened channels") Cc: sta...@vger.kernel.org Cc: K. Y. Srinivasan Cc: Haiyang Zhang Signed-off-by: Stephen Hemminger Signed-off-by: Dexuan Cui --- *NOTE*: the patch is based on char-misc's char-misc-linus branch. drivers/hv/ring_buffer.c | 31 - drivers/hv/vmbus_drv.c | 91 include/linux/hyperv.h | 5 +-- 3 files changed, 79 insertions(+), 48 deletions(-) diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c index 64d0c85..1f1a55e 100644 --- a/drivers/hv/ring_buffer.c +++ b/drivers/hv/ring_buffer.c @@ -164,26 +164,25 @@ hv_get_ringbuffer_availbytes(const struct hv_ring_buffer_info *rbi, } /* Get various debug metrics for the specified ring buffer. */ -void hv_ringbuffer_get_debuginfo(const struct hv_ring_buffer_info *ring_info, -struct hv_ring_buffer_debug_info *debug_info) +int hv_ringbuffer_get_debuginfo(const struct hv_ring_buffer_info *ring_info, + struct hv_ring_buffer_debug_info *debug_info) { u32 bytes_avail_towrite; u32 bytes_avail_toread; - if (ring_info->ring_buffer) { - hv_get_ringbuffer_availbytes(ring_info, - &bytes_avail_toread, - &bytes_avail_towrite); - - debug_info->bytes_avail_toread = bytes_avail_toread; - debug_info->bytes_avail_towrite = bytes_avail_towrite; - debug_info->current_read_index = - ring_info->ring_buffer->read_index; - debug_info->current_write_index = - ring_info->ring_buffer->write_index; - debug_info->current_interrupt_mask = - ring_info->ring_buffer->interrupt_mask; - } + if (!ring_info->ring_buffer) + return -EINVAL; + + hv_get_ringbuffer_availbytes(ring_info, +&bytes_avail_toread, +&bytes_avail_towrite); + debug_info->bytes_avail_toread = bytes_avail_toread; + debug_info->bytes_avail_towrite = bytes_avail_towrite; + debug_info->current_read_index = ring_info->ring_buffer->read_index; + debug_info->current_write_index = ring_info->ring_buffer->write_index; + debug_info->current_interrupt_mask + = ring_info->ring_buffer->interrupt_mask; + return 0; } EXPORT_SYMBOL_GPL(hv_ringbuffer_get_debuginfo); diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index d0ff656..403fee0 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -313,12 +313,16 @@ static ssize_t out_intr_mask_show(struct device *dev, { struct hv_device *hv_dev = device_to_hv_device(dev); struct hv_ring_buffer_debug_info outbound; + int ret; if (!hv_dev->channel) return -ENODEV; - if (hv_dev->channel->state != CHANNEL_OPENED_STATE) - return -EINVAL; - hv_ringbuffer_get_debuginfo(&hv_dev->channel->outbound, &outbound); + + ret = hv_ringbuffer_get_debuginfo(&hv_dev->channel->outbound, + &outbound); + if (ret < 0) + return ret; + return sprintf(buf, "%d\n", outbound.current_interrupt_mask); } static DEVICE_ATTR_RO(out_intr_mask); @@ -328,12 +332,15 @@ static ssize_t out_read_index_show(struct device *dev, { struct hv_device *hv_dev = device_to_hv_device(dev); struct hv_ring_buffer_debug_info outbound; + int ret; if (!hv_dev->channel) return -ENODEV; - if (hv_dev->channel->state != CHANNEL_OPENED_STATE) - return -EINVAL; - hv_ringbuffer_get_debuginfo(&hv_dev->channel->outbound, &outbound); + + ret = hv_ringbuffer_get_debuginfo(&hv_dev->channel->outbound, + &outbound); + if (ret < 0) + return ret; return sprintf(buf, "%d\n", outbound.current_read_index); } static DEVICE_ATTR_RO(out_read_index); @@ -344,12 +351,15 @@ static ssize_t out_write_index_show(struct device *dev, { struct hv_device *hv_dev = device_to_hv_device(dev)
[PATCH][re-post] vmbus: fix subchannel removal
The changes to split ring allocation from open/close, broke the cleanup of subchannels. This resulted in problems using uio on network devices because the subchannel was left behind when the network device was unbound. The cause was in the disconnect logic which used list splice to move the subchannel list into a local variable. This won't work because the subchannel list is needed later during the process of the rescind messages (relid2channel). The fix is to just leave the subchannel list in place which is what the original code did. The list is cleaned up later when the host rescind is processed. Without the fix, we have a lot of "hang" issues in netvsc when we try to change the NIC's MTU, set the number of channels, etc. Fixes: ae6935ed7d42 ("vmbus: split ring buffer allocation from open") Cc: sta...@vger.kernel.org Signed-off-by: Stephen Hemminger Signed-off-by: Dexuan Cui --- The patch was firstly posted on Dec-7 2018: https://www.spinics.net/lists/linux-driver-devel/msg120802.html but it looks it's neglected. Now let me rebase it to v5.0-rc1: the line "channel->num_sc = 0;" in the original patch must be removed due to 4d3c5c69191f ("Drivers: hv: vmbus: Remove the useless API vmbus_get_outgoing_channel()") When the patch is backported to v4.20.1, there will be a conflict because v4.20.1 doesn't have 4d3c5c69191f. I suggest we cherry-pick 4d3c5c69191f into v4.20.1, before we backport this patch. drivers/hv/channel.c | 9 + 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c index ce0ba20..bea4c98 100644 --- a/drivers/hv/channel.c +++ b/drivers/hv/channel.c @@ -701,19 +701,12 @@ static int vmbus_close_internal(struct vmbus_channel *channel) int vmbus_disconnect_ring(struct vmbus_channel *channel) { struct vmbus_channel *cur_channel, *tmp; - unsigned long flags; - LIST_HEAD(list); int ret; if (channel->primary_channel != NULL) return -EINVAL; - /* Snapshot the list of subchannels */ - spin_lock_irqsave(&channel->lock, flags); - list_splice_init(&channel->sc_list, &list); - spin_unlock_irqrestore(&channel->lock, flags); - - list_for_each_entry_safe(cur_channel, tmp, &list, sc_list) { + list_for_each_entry_safe(cur_channel, tmp, &channel->sc_list, sc_list) { if (cur_channel->rescind) wait_for_completion(&cur_channel->rescind_event); -- 2.7.4 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH v2] Drivers: hv: vmbus: Expose counters for interrupts and full conditions
> From: Kimberly Brown > Sent: Friday, January 4, 2019 8:35 PM ... > +What: > /sys/bus/vmbus/devices//channels//intr_in_full > +Date: January 2019 > +KernelVersion: 4.21 There is no 4.21 version: see https://lkml.org/lkml/2019/1/6/178 :-) Thanks! -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH v2] Drivers: hv: vmbus: Expose counters for interrupts and full conditions
> From: Michael Kelley > Sent: Saturday, January 5, 2019 1:01 PM > > From: Kimberly Brown Sent: Friday, January 4, > > 2019 8:35 PM > > > > static inline void set_channel_pending_send_size(struct vmbus_channel *c, > > u32 size) > > { > > + if (size) { > > + ++c->out_full_total; > > + > > + if (!c->out_full_flag) { > > + ++c->out_full_first; > > + c->out_full_flag = true; > > + } > > + } else { > > + c->out_full_flag = false; > > + } > > + > > c->outbound.ring_buffer->pending_send_sz = size; > > } > > > > I think there may be an atomicity problem with the above code. I looked > in the hv_sock code, and didn't see any locks being held when > set_channel_pending_send_size() is called. The original code doesn't need > a lock because it is just storing a single value into pending_send_sz. > > In the similar code in hv_ringbuffer_write(), the ring buffer spin lock is > held > while the counts are incremented and the out_full_flag is maintained, so all > is > good there. But some locking may be needed here. Dexuan knows the > hv_sock > code best and can comment on whether there is any higher level > synchronization > that prevents multiple threads from running the above code on the same > channel. > Even if there is such higher level synchronization, this code probably > shouldn't > depend on it for correctness. > > Michael Yes, there is indeed a higher level per-"sock" lock, e.g. in the code path vsock_stream_sendmsg() -> vsock_stream_has_space() -> transport->stream_has_space() -> hvs_stream_has_space() -> hvs_set_channel_pending_send_size(), we acquire the lock by lock_sock(sk) at the beginning of vsock_stream_sendmsg(), and call release_sock(sk) at the end of the function. So we don't have an atomicity issue here for hv_sock, which is the only user of set_channel_pending_send_size(), so far. Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH v3] Drivers: hv: vmbus: Expose counters for interrupts and full conditions
> From: Kimberly Brown > Sent: Wednesday, January 16, 2019 8:38 PM > To: Michael Kelley ; Long Li > ; Sasha Levin ; > Dexuan Cui > Cc: KY Srinivasan ; Haiyang Zhang > ; Stephen Hemminger > ; de...@linuxdriverproject.org; > linux-ker...@vger.kernel.org > Subject: [PATCH v3] Drivers: hv: vmbus: Expose counters for interrupts and > full > conditions > > Counter values for per-channel interrupts and ring buffer full > conditions are useful for investigating performance. > > Expose counters in sysfs for 2 types of guest to host interrupts: > 1) Interrupts caused by the channel's outbound ring buffer transitioning > from empty to not empty > 2) Interrupts caused by the channel's inbound ring buffer transitioning > from full to not full while a packet is waiting for enough buffer space to > become available > > Expose 2 counters in sysfs for the number of times that write operations > encountered a full outbound ring buffer: > 1) The total number of write operations that encountered a full > condition > 2) The number of write operations that were the first to encounter a > full condition > > I tested this patch by confirming that the sysfs files were created and > observing the counter values. The values seemed to increase by a > reasonable amount when the Hyper-v related drivers were in use. > > Signed-off-by: Kimberly Brown > --- > Changes in v3: > - Used the outbound ring buffer spinlock to protect the the full >condition counters in set_channel_pending_send_size() > - Corrected the KernelVersion values for the new entries in >Documentation/ABI/stable/sysfs-bus-vmbus > > Changes in v2: > - Added mailing lists to the cc list > - Removed the host to guest interrupt counters proposed in v1 because >they were not accurate > - Added full condition counters for the channel's outbound ring buffer > > Documentation/ABI/stable/sysfs-bus-vmbus | 33 > drivers/hv/ring_buffer.c | 14 - > drivers/hv/vmbus_drv.c | 32 > include/linux/hyperv.h | 38 > > 4 files changed, 116 insertions(+), 1 deletion(-) > > diff --git a/Documentation/ABI/stable/sysfs-bus-vmbus > b/Documentation/ABI/stable/sysfs-bus-vmbus > index 3fed8fdb873d..a0304c563467 100644 > --- a/Documentation/ABI/stable/sysfs-bus-vmbus > +++ b/Documentation/ABI/stable/sysfs-bus-vmbus > @@ -146,3 +146,36 @@ KernelVersion: 4.16 > Contact: Stephen Hemminger > Description: Binary file created by uio_hv_generic for ring buffer > Users: Userspace drivers > + > +What: > /sys/bus/vmbus/devices//channels//intr_in_full > +Date: January 2019 > +KernelVersion: 5.0 > +Contact:Michael Kelley > +Description:Number of guest to host interrupts caused by the inbound > ring > + buffer transitioning from full to not full while a packet is > + waiting for buffer space to become available > +Users: Debugging tools > + > +What: > /sys/bus/vmbus/devices//channels//intr_out_empty > +Date: January 2019 > +KernelVersion: 5.0 > +Contact:Michael Kelley > +Description:Number of guest to host interrupts caused by the outbound > ring > + buffer transitioning from empty to not empty > +Users: Debugging tools > + > +What: > /sys/bus/vmbus/devices//channels//out_full_first > +Date: January 2019 > +KernelVersion: 5.0 > +Contact:Michael Kelley > +Description:Number of write operations that were the first to encounter > an > + outbound ring buffer full condition > +Users: Debugging tools > + > +What: > /sys/bus/vmbus/devices//channels//out_full_total > +Date: January 2019 > +KernelVersion: 5.0 > +Contact:Michael Kelley > +Description:Total number of write operations that encountered an > outbound > + ring buffer full condition > +Users: Debugging tools > diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c > index 1f1a55e07733..9e8b31ccc142 100644 > --- a/drivers/hv/ring_buffer.c > +++ b/drivers/hv/ring_buffer.c > @@ -74,8 +74,10 @@ static void hv_signal_on_write(u32 old_write, struct > vmbus_channel *channel) >* This is the only case we need to signal when the >* ring transitions from being empty to non-empty. >*/ > - if (old_write == READ_ONCE(rbi->ring_buffer->read_index)) > + if (old_write == READ_ONCE(rbi->ring_buffer->read_index)) { > + ++channel->intr_out_empty; > vmbus_set
RE: [PATCH] Drivers: hv: vmbus: Add mutex lock to channel show functions
> From: Kimberly Brown > Sent: Monday, January 21, 2019 6:08 PM > Subject: [PATCH] Drivers: hv: vmbus: Add mutex lock to channel show functions > > The channel level "_show" functions are vulnerable to race conditions. > Add a mutex lock and unlock around the call to the channel level "_show" > functions in vmbus_chan_attr_show(). > > This problem was discussed here: > https://lkml.org/lkml/2018/10/18/830 > > --- a/drivers/hv/vmbus_drv.c > +++ b/drivers/hv/vmbus_drv.c > @@ -1414,6 +1414,7 @@ static ssize_t vmbus_chan_attr_show(struct kobject > *kobj, > = container_of(attr, struct vmbus_chan_attribute, attr); > const struct vmbus_channel *chan > = container_of(kobj, struct vmbus_channel, kobj); > + ssize_t ret; > > if (!attribute->show) > return -EIO; > @@ -1421,7 +1422,10 @@ static ssize_t vmbus_chan_attr_show(struct > kobject *kobj, > if (chan->state != CHANNEL_OPENED_STATE) > return -EINVAL; > > - return attribute->show(chan, buf); > + mutex_lock(&vmbus_connection.channel_mutex); > + ret = attribute->show(chan, buf); > + mutex_unlock(&vmbus_connection.channel_mutex); > + return ret; > } It looks this patch is already able to fix the original issue: 6712cc9c2211 ("vmbus: don't return values for uninitalized channels"), as chan->state can't be CHANNEL_OPENED_STATE when channel->ringbuffer_page is not set up yet, or has been freed. Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] Drivers: hv: vmbus: Add mutex lock to channel show functions
> From: Kimberly Brown > Sent: Monday, January 21, 2019 10:43 PM > > > @@ -1421,7 +1422,10 @@ static ssize_t vmbus_chan_attr_show(struct > > > kobject *kobj, > > > if (chan->state != CHANNEL_OPENED_STATE) > > > return -EINVAL; > > > > > > - return attribute->show(chan, buf); > > > + mutex_lock(&vmbus_connection.channel_mutex); > > > + ret = attribute->show(chan, buf); > > > + mutex_unlock(&vmbus_connection.channel_mutex); > > > + return ret; > > > } > > > > It looks this patch is already able to fix the original issue: > > 6712cc9c2211 ("vmbus: don't return values for uninitalized channels"), > > as chan->state can't be CHANNEL_OPENED_STATE when > > channel->ringbuffer_page is not set up yet, or has been freed. > > -- Dexuan > > I think that patch 6712cc9c2211 fixes the problem when the channel is > not set up yet, but it doesn't fix the problem when the channel is being > closed Yeah, now I see your point. > The channel could be freed after the check that "chan->state" is > CHANNEL_OPENED_STATE, while the "attribute->show()" function is running. IMO the channel struct itself can't be freed while the "attribute->show()" function is running, because I suppose the sysfs interface should have an extra reference to channel->kobj (see vmbus_add_channel_kobj()), so before the sysfs files are removed, the channel struct itself can't disappear. (I didn't check the related code very carefully, so I could be wrong. :-) ) But as you pointed, at least for sub-channels, channel->ringbuffer_page can indeed disappear in vmbus_close() -> ... -> vmbus_free_ring(), and the "attribute->show()" could crash when the race happens. Adding channel_mutex here seems to be able to fix the race for sub-channels, as the same channel_mutex is used in vmbus_disconnect_ring(). For a primary channel, vmbus_close() -> vmbus_free_ring() can still free channel->ringbuffer_page without notifying the "attribute->show()". We may also need to acquire/release the channel_mutex in vmbus_close()? > Actually, there should be checks that "chan" is not null and that > "chan->state" is CHANNEL_OPENED_STATE within the locked section. I'll > need to fix that. I suppose "chan" can not be NULL here (see the above). Checking "chan->state" may not help to completely fix the race, because there is no locking/synchronization code in vmbus_close_internal() when we test and change "channel->state". I guess we may need to check if channel->ringbuffer_page is NULL in the "attribute->show()". PS, to prove that a race condition does exist and can really cause a panic or something, I usually add some msleep() delays in different paths so that I can reproduce the crash every time I take a special action, e.g. when I read the sysfs files of a NIC, I try to remove hv_netvsc driver. This way, I can prove a patch can indeed help, at least it can fix the crash which would happen without the patch. :-) -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH] nfit: add Hyper-V NVDIMM DSM command set to white list
Add the Hyper-V _DSM command set to the white list of NVDIMM command sets. This command set is documented at http://www.uefi.org/RFIC_LIST (see the link to "Virtual NVDIMM 0x1901" on the page). Signed-off-by: Dexuan Cui --- I'm going to change the user-space utility "ndctl" to support Hyper-V Virtual NVDIMM. This kernel patch is required first. drivers/acpi/nfit/core.c | 5 - drivers/acpi/nfit/nfit.h | 6 +- include/uapi/linux/ndctl.h | 1 + 3 files changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index 011d3db19c80..fb48cb17a519 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -1840,7 +1840,7 @@ static int acpi_nfit_add_dimm(struct acpi_nfit_desc *acpi_desc, dev_set_drvdata(&adev_dimm->dev, nfit_mem); /* -* Until standardization materializes we need to consider 4 +* Until standardization materializes we need to consider 5 * different command sets. Note, that checking for function0 (bit0) * tells us if any commands are reachable through this GUID. */ @@ -1865,6 +1865,8 @@ static int acpi_nfit_add_dimm(struct acpi_nfit_desc *acpi_desc, dsm_mask &= ~(1 << 8); } else if (nfit_mem->family == NVDIMM_FAMILY_MSFT) { dsm_mask = 0x; + } else if (nfit_mem->family == NVDIMM_FAMILY_HYPERV) { + dsm_mask = 0x1f; } else { dev_dbg(dev, "unknown dimm command family\n"); nfit_mem->family = -1; @@ -3707,6 +3709,7 @@ static __init int nfit_init(void) guid_parse(UUID_NFIT_DIMM_N_HPE1, &nfit_uuid[NFIT_DEV_DIMM_N_HPE1]); guid_parse(UUID_NFIT_DIMM_N_HPE2, &nfit_uuid[NFIT_DEV_DIMM_N_HPE2]); guid_parse(UUID_NFIT_DIMM_N_MSFT, &nfit_uuid[NFIT_DEV_DIMM_N_MSFT]); + guid_parse(UUID_NFIT_DIMM_N_HYPERV, &nfit_uuid[NFIT_DEV_DIMM_N_HYPERV]); nfit_wq = create_singlethread_workqueue("nfit"); if (!nfit_wq) diff --git a/drivers/acpi/nfit/nfit.h b/drivers/acpi/nfit/nfit.h index 33691aecfcee..9194022f29f3 100644 --- a/drivers/acpi/nfit/nfit.h +++ b/drivers/acpi/nfit/nfit.h @@ -34,11 +34,14 @@ /* https://msdn.microsoft.com/library/windows/hardware/mt604741 */ #define UUID_NFIT_DIMM_N_MSFT "1ee68b36-d4bd-4a1a-9a16-4f8e53d46e05" +/* http://www.uefi.org/RFIC_LIST */ +#define UUID_NFIT_DIMM_N_HYPERV "5746c5f2-a9a2-4264-ad0e-e4ddc9e09e80" + #define ACPI_NFIT_MEM_FAILED_MASK (ACPI_NFIT_MEM_SAVE_FAILED \ | ACPI_NFIT_MEM_RESTORE_FAILED | ACPI_NFIT_MEM_FLUSH_FAILED \ | ACPI_NFIT_MEM_NOT_ARMED | ACPI_NFIT_MEM_MAP_FAILED) -#define NVDIMM_FAMILY_MAX NVDIMM_FAMILY_MSFT +#define NVDIMM_FAMILY_MAX NVDIMM_FAMILY_HYPERV #define NVDIMM_STANDARD_CMDMASK \ (1 << ND_CMD_SMART | 1 << ND_CMD_SMART_THRESHOLD | 1 << ND_CMD_DIMM_FLAGS \ @@ -94,6 +97,7 @@ enum nfit_uuids { NFIT_DEV_DIMM_N_HPE1 = NVDIMM_FAMILY_HPE1, NFIT_DEV_DIMM_N_HPE2 = NVDIMM_FAMILY_HPE2, NFIT_DEV_DIMM_N_MSFT = NVDIMM_FAMILY_MSFT, + NFIT_DEV_DIMM_N_HYPERV = NVDIMM_FAMILY_HYPERV, NFIT_SPA_VOLATILE, NFIT_SPA_PM, NFIT_SPA_DCR, diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h index f57c9e434d2d..de5d90212409 100644 --- a/include/uapi/linux/ndctl.h +++ b/include/uapi/linux/ndctl.h @@ -243,6 +243,7 @@ struct nd_cmd_pkg { #define NVDIMM_FAMILY_HPE1 1 #define NVDIMM_FAMILY_HPE2 2 #define NVDIMM_FAMILY_MSFT 3 +#define NVDIMM_FAMILY_HYPERV 4 #define ND_IOCTL_CALL _IOWR(ND_IOCTL, ND_CMD_CALL,\ struct nd_cmd_pkg) -- 2.19.1 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] nfit: add Hyper-V NVDIMM DSM command set to white list
> From: Dan Williams > Sent: Monday, January 28, 2019 1:01 PM > > Hi Dexuan, > Looks good. Just one update request and a note below... > > On Wed, Jan 23, 2019 at 12:51 PM Dexuan Cui wrote: > > ... > > --- a/drivers/acpi/nfit/core.c > > +++ b/drivers/acpi/nfit/core.c > > @@ -1840,7 +1840,7 @@ static int acpi_nfit_add_dimm(struct > acpi_nfit_desc *acpi_desc, > > dev_set_drvdata(&adev_dimm->dev, nfit_mem); > > > > /* > > -* Until standardization materializes we need to consider 4 > > +* Until standardization materializes we need to consider 5 > > * different command sets. Note, that checking for function0 > (bit0) > > * tells us if any commands are reachable through this GUID. > > */ > > This comment is stale. This "HYPERV" family is the first example of > the "right" way to define a new NVDIMM command set. Lets update it to > mention the process and considerations for submitting new command sets > to UEFI (http://www.uefi.org/RFIC_LIST). The fact that there's now a > defined process is a step forward from when this comment was initially > written. Also, the fact that the process encourages "adopt" vs > "supersede" addresses the main concern about vendor-specific > command-set proliferation. I made the below simple change. Is this enough? Please suggest the proper wording/sentence, as I'm relatively new to NVDIMM, and I don't really know the history of the standardization process. --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -1732,8 +1732,10 @@ static int acpi_nfit_add_dimm(struct acpi_nfit_desc *acpi_desc, dev_set_drvdata(&adev_dimm->dev, nfit_mem); /* -* Until standardization materializes we need to consider 4 -* different command sets. Note, that checking for function0 (bit0) +* New command sets should be submitted to UEFI +* http://www.uefi.org/RFIC_LIST. +* +* Note, that checking for function0 (bit0) * tells us if any commands are reachable through this GUID. */ for (i = 0; i <= NVDIMM_FAMILY_MAX; i++) > > @@ -1865,6 +1865,8 @@ static int acpi_nfit_add_dimm(struct > acpi_nfit_desc *acpi_desc, > > dsm_mask &= ~(1 << 8); > > } else if (nfit_mem->family == NVDIMM_FAMILY_MSFT) { > > dsm_mask = 0x; > > + } else if (nfit_mem->family == NVDIMM_FAMILY_HYPERV) { > > + dsm_mask = 0x1f; > > Just a note that starting with commit 5e9e38d0db1d "acpi/nfit: Block > function zero DSMs", bit0 in this mask will be cleared to ensure that > only the kernel has the ability to probe for supported DSM functions. Thanks for the note! Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] nfit: add Hyper-V NVDIMM DSM command set to white list
> From: Dan Williams > Sent: Monday, January 28, 2019 1:55 PM > > On Mon, Jan 28, 2019 at 1:40 PM Dexuan Cui wrote: > > > I made the below simple change. Is this enough? Please suggest the proper > > wording/sentence, as I'm relatively new to NVDIMM, and I don't really know > the history of the standardization process. > > How about something a bit more relevant for the code in question: > > There are 4 "legacy" NVDIMM command sets > (NVDIMM_FAMILY_{INTEL,MSFT,HPE1,HPE2}) that were created before an EFI > working group was established to constrain this proliferation. The > nfit driver probes for the supported command set by GUID. Note, If > you're a platform developer looking to add a new command set to this > probe consider using an existing set, or otherwise seek approval to > publish the command set at > http://www.uefi.org/RFIC_LIST Looks perfect! Let me use this, rebase the patch to https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git/tag/?h=libnvdimm-fixes-5.0-rc4 and post a v2 later today. Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v2] nfit: add Hyper-V NVDIMM DSM command set to white list
Add the Hyper-V _DSM command set to the white list of NVDIMM command sets. This command set is documented at http://www.uefi.org/RFIC_LIST (see "Virtual NVDIMM 0x1901"). Thanks Dan Williams for writing the comment change. Signed-off-by: Dexuan Cui Reviewed-by: Michael Kelley --- Changes in v2: Updated the comment and changelog (Thanks, Dan!) Rebased to the tag libnvdimm-fixes-5.0-rc4 of the nvdimm tree. drivers/acpi/nfit/core.c | 17 ++--- drivers/acpi/nfit/nfit.h | 6 +- include/uapi/linux/ndctl.h | 1 + 3 files changed, 20 insertions(+), 4 deletions(-) diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index e18ade5d74e9..a9270c99be72 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -1861,9 +1861,17 @@ static int acpi_nfit_add_dimm(struct acpi_nfit_desc *acpi_desc, dev_set_drvdata(&adev_dimm->dev, nfit_mem); /* -* Until standardization materializes we need to consider 4 -* different command sets. Note, that checking for function0 (bit0) -* tells us if any commands are reachable through this GUID. +* There are 4 "legacy" NVDIMM command sets +* (NVDIMM_FAMILY_{INTEL,MSFT,HPE1,HPE2}) that were created before +* an EFI working group was established to constrain this +* proliferation. The nfit driver probes for the supported command +* set by GUID. Note, if you're a platform developer looking to add +* a new command set to this probe, consider using an existing set, +* or otherwise seek approval to publish the command set at +* http://www.uefi.org/RFIC_LIST. +* +* Note, that checking for function0 (bit0) tells us if any commands +* are reachable through this GUID. */ for (i = 0; i <= NVDIMM_FAMILY_MAX; i++) if (acpi_check_dsm(adev_dimm->handle, to_nfit_uuid(i), 1, 1)) @@ -1886,6 +1894,8 @@ static int acpi_nfit_add_dimm(struct acpi_nfit_desc *acpi_desc, dsm_mask &= ~(1 << 8); } else if (nfit_mem->family == NVDIMM_FAMILY_MSFT) { dsm_mask = 0x; + } else if (nfit_mem->family == NVDIMM_FAMILY_HYPERV) { + dsm_mask = 0x1f; } else { dev_dbg(dev, "unknown dimm command family\n"); nfit_mem->family = -1; @@ -3729,6 +3739,7 @@ static __init int nfit_init(void) guid_parse(UUID_NFIT_DIMM_N_HPE1, &nfit_uuid[NFIT_DEV_DIMM_N_HPE1]); guid_parse(UUID_NFIT_DIMM_N_HPE2, &nfit_uuid[NFIT_DEV_DIMM_N_HPE2]); guid_parse(UUID_NFIT_DIMM_N_MSFT, &nfit_uuid[NFIT_DEV_DIMM_N_MSFT]); + guid_parse(UUID_NFIT_DIMM_N_HYPERV, &nfit_uuid[NFIT_DEV_DIMM_N_HYPERV]); nfit_wq = create_singlethread_workqueue("nfit"); if (!nfit_wq) diff --git a/drivers/acpi/nfit/nfit.h b/drivers/acpi/nfit/nfit.h index 33691aecfcee..4de167b4f76f 100644 --- a/drivers/acpi/nfit/nfit.h +++ b/drivers/acpi/nfit/nfit.h @@ -34,11 +34,14 @@ /* https://msdn.microsoft.com/library/windows/hardware/mt604741 */ #define UUID_NFIT_DIMM_N_MSFT "1ee68b36-d4bd-4a1a-9a16-4f8e53d46e05" +/* http://www.uefi.org/RFIC_LIST (see "Virtual NVDIMM 0x1901") */ +#define UUID_NFIT_DIMM_N_HYPERV "5746c5f2-a9a2-4264-ad0e-e4ddc9e09e80" + #define ACPI_NFIT_MEM_FAILED_MASK (ACPI_NFIT_MEM_SAVE_FAILED \ | ACPI_NFIT_MEM_RESTORE_FAILED | ACPI_NFIT_MEM_FLUSH_FAILED \ | ACPI_NFIT_MEM_NOT_ARMED | ACPI_NFIT_MEM_MAP_FAILED) -#define NVDIMM_FAMILY_MAX NVDIMM_FAMILY_MSFT +#define NVDIMM_FAMILY_MAX NVDIMM_FAMILY_HYPERV #define NVDIMM_STANDARD_CMDMASK \ (1 << ND_CMD_SMART | 1 << ND_CMD_SMART_THRESHOLD | 1 << ND_CMD_DIMM_FLAGS \ @@ -94,6 +97,7 @@ enum nfit_uuids { NFIT_DEV_DIMM_N_HPE1 = NVDIMM_FAMILY_HPE1, NFIT_DEV_DIMM_N_HPE2 = NVDIMM_FAMILY_HPE2, NFIT_DEV_DIMM_N_MSFT = NVDIMM_FAMILY_MSFT, + NFIT_DEV_DIMM_N_HYPERV = NVDIMM_FAMILY_HYPERV, NFIT_SPA_VOLATILE, NFIT_SPA_PM, NFIT_SPA_DCR, diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h index f57c9e434d2d..de5d90212409 100644 --- a/include/uapi/linux/ndctl.h +++ b/include/uapi/linux/ndctl.h @@ -243,6 +243,7 @@ struct nd_cmd_pkg { #define NVDIMM_FAMILY_HPE1 1 #define NVDIMM_FAMILY_HPE2 2 #define NVDIMM_FAMILY_MSFT 3 +#define NVDIMM_FAMILY_HYPERV 4 #define ND_IOCTL_CALL _IOWR(ND_IOCTL, ND_CMD_CALL,\ struct nd_cmd_pkg) -- 2.19.1 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] Drivers: hv: vmbus: Add mutex lock to channel show functions
> From: Kimberly Brown > > ... > > But as you pointed, at least for sub-channels, channel->ringbuffer_page > > can indeed disappear in vmbus_close() -> ... -> vmbus_free_ring(), and > > the "attribute->show()" could crash when the race happens. > > Adding channel_mutex here seems to be able to fix the race for > > sub-channels, as the same channel_mutex is used in > vmbus_disconnect_ring(). > > > > For a primary channel, vmbus_close() -> vmbus_free_ring() can still > > free channel->ringbuffer_page without notifying the "attribute->show()". > > We may also need to acquire/release the channel_mutex in vmbus_close()? > > > > > Actually, there should be checks that "chan" is not null and that > > > "chan->state" is CHANNEL_OPENED_STATE within the locked section. I'll > > > need to fix that. > > I suppose "chan" can not be NULL here (see the above). > > > > Checking "chan->state" may not help to completely fix the race, because > > there is no locking/synchronization code in > > vmbus_close_internal() when we test and change "channel->state". > > > > The calls to vmbus_close_internal() for the subchannels and the primary > channel are protected with channel_mutex in vmbus_disconnect_ring(). > This prevents "channel->state" from changing while "attribute->show()" is > running. Ah, I think you're right. > > I guess we may need to check if channel->ringbuffer_page is NULL in > > the "attribute->show()". > > > > For the primary channel, vmbus_free_ring() is called after the > return from vmbus_disconnect_ring(). Therefore, the primary channel's > state is changed before "channel->ringbuffer_page" is set to NULL. > Checking the channel state should be sufficient to prevent the ring > buffers from being freed while "attribute->show()" is running. The > ring buffers can't be freed until the channel's state is changed, and > the channel state change is protected by the mutex. I think you're right (I noticed in a previous mail you mentioned you would improve your patch to check "chan->state" with the mutax held). > I think checking that "channel->ringbuffer_page" is not NULL would > also work, but, as you stated, we would need to aquire/release > channel_mutex in vmbus_close(). Then it looks unnecessary to check "channel->ringbuffer_page". > > PS, to prove that a race condition does exist and can really cause a panic > > or > > something, I usually add some msleep() delays in different paths so that I > > can reproduce the crash every time I take a special action, e.g. when I read > > the sysfs files of a NIC, I try to remove hv_netvsc driver. This way, I can > > prove > > a patch can indeed help, at least it can fix the crash which would happen > > without the patch. :-) > > > > Thanks! I was able to free the ring buffers while "attribute->show()" > was running, which caused a null pointer dereference bug. As expected, > the mutex lock fixed it. Awesome! -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH] nfit: acpi_nfit_ctl(): check out_obj->type in the right place
In the case of ND_CMD_CALL, we should also check out_obj->type. The patch uses out_obj->type, which is a short alias to out_obj->package.type. Fixes: 31eca76ba2fc ("nfit, libnvdimm: limited/whitelisted dimm command marshaling mechanism") Cc: Signed-off-by: Dexuan Cui --- drivers/acpi/nfit/core.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index 0a49c57334cc..1fa378435dd2 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -554,6 +554,13 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, struct nvdimm *nvdimm, return -EINVAL; } + if (out_obj->type != ACPI_TYPE_BUFFER) { + dev_dbg(dev, "%s unexpected output object type cmd: %s type: %d\n", + dimm_name, cmd_name, out_obj->type); + rc = -EINVAL; + goto out; + } + if (call_pkg) { call_pkg->nd_fw_size = out_obj->buffer.length; memcpy(call_pkg->nd_payload + call_pkg->nd_size_in, @@ -572,13 +579,6 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, struct nvdimm *nvdimm, return 0; } - if (out_obj->package.type != ACPI_TYPE_BUFFER) { - dev_dbg(dev, "%s unexpected output object type cmd: %s type: %d\n", - dimm_name, cmd_name, out_obj->type); - rc = -EINVAL; - goto out; - } - dev_dbg(dev, "%s cmd: %s output length: %d\n", dimm_name, cmd_name, out_obj->buffer.length); print_hex_dump_debug(cmd_name, DUMP_PREFIX_OFFSET, 4, 4, -- 2.19.1 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH] nfit: Collect shutdown status for NVDIMM_FAMILY_HYPERV
See http://www.uefi.org/RFIC_LIST ("Virtual NVDIMM 0x1901"): "Get Unsafe Shutdown Count (Function Index 2)". Let's expose the info to the userspace (e.g. ntctl) via sysfs. Signed-off-by: Dexuan Cui --- drivers/acpi/nfit/core.c | 51 ++ drivers/acpi/nfit/hyperv.h | 26 +++ 2 files changed, 77 insertions(+) create mode 100644 drivers/acpi/nfit/hyperv.h diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index d5a164b6833a..d504da34ce34 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -25,6 +25,7 @@ #include #include #include "intel.h" +#include "hyperv.h" #include "nfit.h" /* @@ -1802,6 +1803,54 @@ __weak void nfit_intel_shutdown_status(struct nfit_mem *nfit_mem) } } +__weak void nfit_hyperv_shutdown_status(struct nfit_mem *nfit_mem) +{ + struct device *dev = &nfit_mem->adev->dev; + struct nd_hyperv_shutdown_status status; + union acpi_object in_buf = { + .buffer.type = ACPI_TYPE_BUFFER, + .buffer.length = 0, + }; + union acpi_object in_obj = { + .package.type = ACPI_TYPE_PACKAGE, + .package.count = 1, + .package.elements = &in_buf, + }; + const u8 func = ND_HYPERV_GET_UNSAFE_SHUTDOWN_COUNT; + const guid_t *guid = to_nfit_uuid(nfit_mem->family); + u8 revid = nfit_dsm_revid(nfit_mem->family, func); + struct acpi_device *adev = nfit_mem->adev; + acpi_handle handle = adev->handle; + union acpi_object *out_obj; + + if ((nfit_mem->dsm_mask & BIT(func)) == 0) + return; + + out_obj = acpi_evaluate_dsm(handle, guid, revid, func, &in_obj); + if (!out_obj || out_obj->type != ACPI_TYPE_BUFFER +|| out_obj->buffer.length < sizeof(status)) { + dev_dbg(dev->parent, + "%s: failed to get Unsafe Shutdown Count\n", + dev_name(dev)); + ACPI_FREE(out_obj); + return; + } + + memcpy(&status, out_obj->buffer.pointer, sizeof(status)); + ACPI_FREE(out_obj); + + if (status.general_err != 0) { + dev_dbg(dev->parent, + "%s: failed to get Unsafe Shutdown Count: err=0x%x\n", + dev_name(dev), status.status); + return; + } + + set_bit(NFIT_MEM_DIRTY_COUNT, &nfit_mem->flags); + nfit_mem->dirty_shutdown = status.shutdown_count; +} + + static void populate_shutdown_status(struct nfit_mem *nfit_mem) { /* @@ -1811,6 +1860,8 @@ static void populate_shutdown_status(struct nfit_mem *nfit_mem) */ if (nfit_mem->family == NVDIMM_FAMILY_INTEL) nfit_intel_shutdown_status(nfit_mem); + else if (nfit_mem->family == NVDIMM_FAMILY_HYPERV) + nfit_hyperv_shutdown_status(nfit_mem); } static int acpi_nfit_add_dimm(struct acpi_nfit_desc *acpi_desc, diff --git a/drivers/acpi/nfit/hyperv.h b/drivers/acpi/nfit/hyperv.h new file mode 100644 index ..c4a25b8624cc --- /dev/null +++ b/drivers/acpi/nfit/hyperv.h @@ -0,0 +1,26 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright(c) 2019 Microsoft Corporation. All rights reserved. + * Hyper-V specific definitions for _DSM of Hyper-V Virtual NVDIMM + * + * See http://www.uefi.org/RFIC_LIST (Virtual NVDIMM 0x1901) + */ +#ifndef _NFIT_HYPERV_H_ +#define _NFIT_HYPERV_H_ + +#define ND_HYPERV_GET_UNSAFE_SHUTDOWN_COUNT 2 + +struct nd_hyperv_shutdown_status { + union { + u32 status; + struct { + u16 general_err; + u8 func_err; + u8 vendor_err; + }; + }; + u32 shutdown_count; +} __packed; + + +#endif -- 2.19.1 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] nfit: Collect shutdown status for NVDIMM_FAMILY_HYPERV
> From: Greg KH > Sent: Wednesday, January 30, 2019 11:38 AM > > On Wed, Jan 30, 2019 at 06:48:40PM +0000, Dexuan Cui wrote: > > > > Let's expose the info to the userspace (e.g. ntctl) via sysfs. > > If you add a new sysfs file, you need to add a new Documentation/ABI/ > update as well :( It's an existing sysfs node, which was originally added by Dan in Sep 2018: /sys/bus/nd/devices/nmem0/nfit/dirty_shutdown. Before this patch, the node doesn't appear when Linux runs on Hyper-V, and with this patch, the node can appear now. However, indeed, the node and the other related nodes have not been documented in Documentation/ABI/testing/sysfs-bus-nfit yet. I suppose Dan would add that? Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] nfit: Collect shutdown status for NVDIMM_FAMILY_HYPERV
> From: Linux-nvdimm On Behalf Of > Dexuan Cui > Sent: Wednesday, January 30, 2019 12:03 PM > To: Greg KH > Cc: Josh Poulson ; linux-nvd...@lists.01.org; > Haiyang Zhang ; > driverdev-devel@linuxdriverproject.org; Rafael J. Wysocki > ; linux-ker...@vger.kernel.org; Michael Kelley > ; Sasha Levin ; > linux-a...@vger.kernel.org; Ross Zwisler ; Stephen > Hemminger ; Len Brown > Subject: RE: [PATCH] nfit: Collect shutdown status for > NVDIMM_FAMILY_HYPERV > > > From: Greg KH > > Sent: Wednesday, January 30, 2019 11:38 AM > > > > On Wed, Jan 30, 2019 at 06:48:40PM +, Dexuan Cui wrote: > > > > > > Let's expose the info to the userspace (e.g. ntctl) via sysfs. > > > > If you add a new sysfs file, you need to add a new Documentation/ABI/ > > update as well :( > > It's an existing sysfs node, which was originally added by Dan in Sep 2018: > /sys/bus/nd/devices/nmem0/nfit/dirty_shutdown. > > Before this patch, the node doesn't appear when Linux runs on Hyper-V, > and with this patch, the node can appear now. > > However, indeed, the node and the other related nodes have not been > documented in Documentation/ABI/testing/sysfs-bus-nfit yet. > I suppose Dan would add that? Actually the other nodes have been documented, and only the "dirty_shutdown" is missed. It should be straightfoward. Let me make a patch for this. Thanks for the reminder, Greg! Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH] nfit: Document sysfs interface dirty_shutdown
The new sysfs node was added in Sep 2018 in: commit 0ead11181fe0 ("acpi, nfit: Collect shutdown status") Now let's document it. Signed-off-by: Dexuan Cui --- Documentation/ABI/testing/sysfs-bus-nfit | 8 1 file changed, 8 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-bus-nfit b/Documentation/ABI/testing/sysfs-bus-nfit index a1cb44dcb908..bd6cde8c0ea9 100644 --- a/Documentation/ABI/testing/sysfs-bus-nfit +++ b/Documentation/ABI/testing/sysfs-bus-nfit @@ -153,6 +153,14 @@ Description: subsystem controller vendor. +What: /sys/bus/nd/devices/nmemX/nfit/dirty_shutdown +Date: Sep, 2018 +KernelVersion: v4.20 +Contact: linux-nvd...@lists.01.org +Description: + (RO) Unsafe Shutdown Count for the NVDIMM device. + + What: /sys/bus/nd/devices/ndbusX/nfit/revision Date: Jun, 2015 KernelVersion: v4.2 -- 2.19.1 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] Drivers: hv: vmbus: Add mutex lock to channel show functions
> From: Kimberly Brown > Sent: Thursday, January 31, 2019 9:47 AM > ... > 2) Prevent a deadlock that can occur between the proposed mutex_lock() > call in the vmbus_chan_attr_show() function and the sysfs/kernfs functions. Hi Kim, Can you please share more details about the deadlock? It's unclear to me how the deadlock happens. > I've identified two possible solutions to the deadlock: > > 1) Add a new mutex to the vmbus_channel struct which protects changes to > "channel->state". Use this new mutex in vmbus_chan_attr_show() instead of > "vmbus_connection.channel_mutex". > > 2) Use "vmbus_connection.channel_mutex" in vmbus_chan_attr_show() as > originally proposed, and acquire it with mutex_trylock(). If the mutex > cannot be acquired, return -EINVAL. It looks more like a workaround. IMO we should try to find a real fix. :-) > I'm leaning towards (2), using mutex_trylock(). > "vmbus_connection.channel_mutex" > appears to be used only when a channel is being opened or closed, so > vmbus_chan_attr_show() should be able to acquire the mutex when the > channel is in use. > > If anyone has suggestions, please let me know. > > Thanks, > Kim Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH v2] nfit: add Hyper-V NVDIMM DSM command set to white list
> From: Dan Williams > Sent: Friday, February 1, 2019 9:29 AM > > Hi Dan, > > Unluckily it looks this commit causes a regression ... > > With the patch, "ndctl list" shows nothing, and /dev/pmem0 can't appear. > > If I revert the patch, it will be back to normal. > > > > I attached the config/logs. In the bad case, "dmesg" shows a line > > [5.259017] nd_pmem namespace0.0: 0x, too small > must be at least 0x1000 > > Any idea why this happens? I'm digging into the details and I appreciate > > your > insights. > > Looks like it is working as expected. I was working on linux-next tree's next-20190107 and this patch did "work fine" there. The "regression" happens on djbw/nvdimm.git tree's libnvdimm-pending branch because we have this recent commit (Jan 19): 11189c1089da ("acpi/nfit: Fix command-supported detection"), which makes such a change in acpi_nfit_ctl(): - if (!test_bit(cmd, &cmd_mask) || !test_bit(func, &dsm_mask)) + if (cmd == ND_CMD_CALL && !test_bit(func, &dsm_mask)) + return -ENOTTY; + else if (!test_bit(cmd, &cmd_mask)) return -ENOTTY; So previously ND_CMD_GET_CONFIG_DATA fails with -ENOTTY and we're good. Now the command succeeds, but it looks the returned data is inavlid, and I see the "regression". > The regression you are seeing is the fact that the patch enables the kernel to > enable nvdimm-namespace-label reads. Yes. > Those reads find a namespace index block > and a label. Unfortunately the label has the LOCAL flag set and Linux > explicitly ignores pmem namespace labels with that bit set. The reason Can you please point out the function that ignores the flag? I checked where NSLABEL_FLAG_LOCAL is used, but it looks I can't find a related function. > for that is due to the fact that the original definition of the LOCAL > bit from v1.1 of the namespace label implementation [1] explicitly > limited the LOCAL flag to "block aperture" regions. If you clear that > LOCAL flag I expect it will work. To my knowledge Windows pretends > that the v1.1 definition never existed. I'm trying to find out where the flag is used and how to clear it. > The UEFI 2.7 specification for v1.2 labels states that setting the > LOCAL flag is optional when "nlabel", number of labels in the set, is > 1. Linux makes that mandatory as LOCAL is redundant when nlabel is 1. > > That said, the Robustness Principle makes a case that Linux should > tolerate the bit being set. However, it's just a non-trivial amount of > work to unwind the ingrained block-aperture assumptions of that bit. Can you please explain this a bit more? Sorry, I'm new to this area... Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH v2] nfit: add Hyper-V NVDIMM DSM command set to white list
> From: Dan Williams > Sent: Friday, February 1, 2019 3:47 PM > To: Dexuan Cui > > I believe it's the same reason. Without 11189c1089da the _LSR method > will fail, and otherwise it works and finds the label that it doesn't > like. Exactly. > I'm not seeing "invalid" data in your failure log. Could you double > check that it's just not the success of _LSR that causes the issue? acpi_label_read() never fails for me. By "invalid", I only mean the messages in the dmesg.bad.txt I previously attached (I'm just reading the specs to learn the details about NVDIMM namespace's labels, so my description might be inaccurate) : [4.832367] nvdimm nmem1: nsindex0 labelsize 1 invalid [4.832369] nvdimm nmem1: nsindex1 labelsize 1 invalid ... [5.259017] nd_pmem namespace0.0: 0x, too small must be at least 0x1000 > > > The regression you are seeing is the fact that the patch enables the > > > kernel > to > > > enable nvdimm-namespace-label reads. > > Yes. > > > > > Those reads find a namespace index block > > > and a label. Unfortunately the label has the LOCAL flag set and Linux > > > explicitly ignores pmem namespace labels with that bit set. The reason > > Can you please point out the function that ignores the flag? > > > > I checked where NSLABEL_FLAG_LOCAL is used, but it looks I can't find a > > related function. > > scan_labels() is where the namespace label is validated relative to > the region type: > > if (is_nd_blk(&nd_region->dev) > == !!(flags & NSLABEL_FLAG_LOCAL)) > /* pass, region matches label type */; > else > continue; > > It also has meaning for the namespace capacity allocation > implementation that needed that flag to distinguish aliased capacity > between Block Aperture Mode and PMEM Mode access. Thanks for the pointer! I'm looking at this function. > > > for that is due to the fact that the original definition of the LOCAL > > > bit from v1.1 of the namespace label implementation [1] explicitly > > > limited the LOCAL flag to "block aperture" regions. If you clear that > > > LOCAL flag I expect it will work. To my knowledge Windows pretends > > > that the v1.1 definition never existed. > > I'm trying to find out where the flag is used and how to clear it. > > Assuming Hyper-V implements _LSW, you can recreate / reinitialize the > label area: I think Hyper-V only implements _LSR: [4.720623] nfit ACPI0012:00: device:00: has _LSR [4.723683] nfit ACPI0012:00: device:01: has _LSR > > > The UEFI 2.7 specification for v1.2 labels states that setting the > > > LOCAL flag is optional when "nlabel", number of labels in the set, is > > > 1. Linux makes that mandatory as LOCAL is redundant when nlabel is 1. > > > > > > That said, the Robustness Principle makes a case that Linux should > > > tolerate the bit being set. However, it's just a non-trivial amount of > > > work to unwind the ingrained block-aperture assumptions of that bit. > > Can you please explain this a bit more? Sorry, I'm new to this area... > > The short story is that Linux enforces that LOCAL == Block Mode > Namespaces. See section 2.2 Namespace Label Layout in the original > spec [1]. The EFI 2.7 definition tried to allow for LOCAL to be set > when an interleave-set was comprised of a single NVDIMM, but then also > states its optional when Nlabel is 1. It has zero functional use for > interleave-set based namespaces even when the interleave-set-width is > 1. So Linux takes the option to never set it, and goes further to > reject it if it's set and the region-type does not match, because that > follows the v1.1 meaning of the flag. > > [1]: Thanks for the link! I'll read it. BTW, it looks Hyper-V only supports PMEM namespace, at least so far. Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH v2] nfit: add Hyper-V NVDIMM DSM command set to white list
> From: Linux-nvdimm On Behalf Of > Dexuan Cui > Sent: Friday, February 1, 2019 4:34 PM > > > > ... > > > > Those reads find a namespace index block > > > > and a label. Unfortunately the label has the LOCAL flag set and Linux > > > > explicitly ignores pmem namespace labels with that bit set. The reason > > > > for that is due to the fact that the original definition of the LOCAL > > > > bit from v1.1 of the namespace label implementation [1] explicitly > > > > limited the LOCAL flag to "block aperture" regions. If you clear that > > > > LOCAL flag I expect it will work. To my knowledge Windows pretends > > > > that the v1.1 definition never existed. On the libnvdimm-pending branch, I get this: root@decui-gen2-u1904:~/nvdimm# ndctl list root@decui-gen2-u1904:~/nvdimm# ndctl list --idle [ { "dev":"namespace1.0", "mode":"raw", "size":0, "uuid":"----", "state":"disabled" }, { "dev":"namespace0.0", "mode":"raw", "size":0, "uuid":"----", "state":"disabled" } ] With the patch that clears the LOCAL label (BTW, the initial value of flags is 0x3, meaning a read-only local label) : @@ -2496,6 +2500,7 @@ static int init_active_labels(struct nd_region *nd_region) if (!label_ent) break; label = nd_label_active(ndd, j); + label->flags &= ~NSLABEL_FLAG_LOCAL; label_ent->label = label; I get this: root@decui-gen2-u1904:~/nvdimm# ndctl list root@decui-gen2-u1904:~/nvdimm# ndctl list --idle [ { "dev":"namespace1.0", "mode":"raw", "size":0, "uuid":"c258aaab-f72b-e546-bfa5-be5e07761dbc", "state":"disabled", "name":"Microsoft Hyper-V NVDIMM 1 Label" }, { "dev":"namespace0.0", "mode":"raw", "size":0, "uuid":"9f0497a7-4453-7c40-ad35-21a791e00345", "state":"disabled", "name":"Microsoft Hyper-V NVDIMM 0 Label" } ] The "size" and "mode" still don't look right, but the improvement is that now I can see a good descriptive "name", which I suppose is retrieved from Hyper-V. With Ubuntu 19.04 (4.18.0-11-generic), I get this: (Note: the "mode" and "size" are correct. The "uuid" is different from the above "9f0497a7-4453-7c40-ad35-21a791e00345" -- this is weird.) root@decui-gen2-u1904:~# ndctl list [ { "dev":"namespace1.0", "mode":"raw", "size":137438953472, "blockdev":"pmem1" }, { "dev":"namespace0.0", "mode":"fsdax", "map":"dev", "size":33820770304, "uuid":"35018886-397e-4fe7-a348-0a4d16eec44d", "blockdev":"pmem0" } ] I'm trying to find out the correct solution. I apprecite your insights! Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH v2] nfit: add Hyper-V NVDIMM DSM command set to white list
> From: Dan Williams > Sent: Friday, February 1, 2019 5:29 PM > > ... > > The "size" and "mode" still don't look right, but the improvement is that > > now I can see a good descriptive "name", which I suppose is retrieved > > from Hyper-V. > > Mode is right, there is no way for Hyper-V to create Linux fsdax mode > namespaces it requires some setup using variables only Linux knows. > Can you send the output of: > > ndctl read-labels -j all The output is from a kernel built with the libnvdimm-pending branch plus the one-line patch (label->flags &= ~NSLABEL_FLAG_LOCAL) in init_active_labels(): root@decui-gen2-u1904:~# ndctl read-labels -j all [ { "dev":"nmem1", "index":[ { "signature":"NAMESPACE_INDEX", "major":1, "minor":2, "labelsize":256, "seq":1, "nslot":2 }, { "signature":"NAMESPACE_INDEX", "major":1, "minor":2, "labelsize":256, "seq":2, "nslot":2 } ], "label":[ { "uuid":"c258aaab-f72b-e546-bfa5-be5e07761dbc", "name":"Microsoft Hyper-V NVDIMM 1 Label", "slot":0, "position":0, "nlabel":1, "isetcookie":708891662257476870, "lbasize":0, "dpa":0, "rawsize":137438953472, "type_guid":"79d3f066-f3b4-7440-ac43-0d3318b78cdb", "abstraction_guid":"----" } ] }, { "dev":"nmem0", "index":[ { "signature":"NAMESPACE_INDEX", "major":1, "minor":2, "labelsize":256, "seq":1, "nslot":2 }, { "signature":"NAMESPACE_INDEX", "major":1, "minor":2, "labelsize":256, "seq":2, "nslot":2 } ], "label":[ { "uuid":"9f0497a7-4453-7c40-ad35-21a791e00345", "name":"Microsoft Hyper-V NVDIMM 0 Label", "slot":0, "position":0, "nlabel":1, "isetcookie":708891619307803909, "lbasize":0, "dpa":0, "rawsize":34359738368, "type_guid":"79d3f066-f3b4-7440-ac43-0d3318b78cdb", "abstraction_guid":"----" } ] } ] read 2 nmems > > With Ubuntu 19.04 (4.18.0-11-generic), I get this: > > (Note: the "mode" and "size" are correct. The "uuid" is different from > > the above "9f0497a7-4453-7c40-ad35-21a791e00345" -- this is weird.) > > > > root@decui-gen2-u1904:~# ndctl list > > [ > > { > > "dev":"namespace1.0", > > "mode":"raw", > > "size":137438953472, > > "blockdev":"pmem1" > > }, > > { > > "dev":"namespace0.0", > > "mode":"fsdax", > > "map":"dev", > > "size":33820770304, > > "uuid":"35018886-397e-4fe7-a348-0a4d16eec44d", > > "blockdev":"pmem0" > > } > > ] > > This is because the Ubuntu kernel has the bug that causes _LSR to fail > so Linux falls back to a namespace defined by the region boundary. On > that namespace there is an "fsdax" info block located at the region > base +4K. That info block is tagged with the uuid of > "35018886-397e-4fe7-a348-0a4d16eec44d". Thanks for the explanation! > > I'm trying to find out the correct solution. I apprecite your insights! > > It's a mess. First we need to figure out whether the label is actually > specifying a size of zero, or there is some other bug. I agree. > However, the next problem is going to be adding "fsdax" mode support > on top of the read-only defined namespaces. The ndctl reconfiguration > flow: > >ndctl create-namespace -e namespace0.0 -m fsdax -f > > ...will likely fail because deleting the previous namespace in the > labels is part of that flow. It's always that labels are writable. > > Honestly, the quickest path to something functional for Linux is to > simply delete the _LSR support and use raw mode defined namespaces. > Why have labels if they are read-only and the region is sufficient for > defining boundaries? Just now Hyper-V team confirmed _LSW is not supported. But with Ubuntu 19.04 kernel (4.18.0-11-generic), I'm able to run the commands without any issue: root@decui-gen2-u1904:~# ndctl list [ { "dev":"namespace1.0", "mode":"raw", "size":137438953472, "blockdev":"pmem1" }, { "dev":"namespace0.0", "mode":"fsdax", "map":"dev", "size":33820770304, "uuid":"35018886-397e-4fe7-a348-0a4d16eec44d", "blockdev":"pmem0" } ] root@decui-gen2-u1904:~# ndctl destroy-namespace "namespace0.0" Error: namespace0.0 is active, specify --force for re-configuration error destroying namespaces: Device or resource busy destroyed 0 namespaces root@decui-gen2-u1904:~# ndctl destroy-namespace "namespace0.0" --force destroyed 1 namespace root@decui-gen2-u1904:~# ndctl list [ { "dev":"namespace1.0", "mode":"raw", "size":137438953472, "blockdev":"pmem1" } ] root@decui-gen2-u1904:~# ndctl create-namespace -e namespace0.0 -m fsdax -f { "dev
RE: [PATCH v2] nfit: add Hyper-V NVDIMM DSM command set to white list
> From: Dan Williams > Sent: Friday, February 1, 2019 5:29 PM > ... > Honestly, the quickest path to something functional for Linux is to > simply delete the _LSR support and use raw mode defined namespaces. > Why have labels if they are read-only and the region is sufficient for > defining boundaries? Hyper-V Virtual NVDIMM can already work with Ubuntu 19.04 virtual machine running on Hyper-V, i.e. I can create a raw or fsdax namesapce, and create a xfs or ext4 file sysetm in /dev/pmem0p1, and mount the file system with and without "-o dax". The basic functionality is good. My recent work is mainly for ndctl support, i.e. get the health info by ndctl, and use ndctl to monitor the error events (if applicable). Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH] PCI: hv: Add hv_pci_remove_slots() when we unload the driver
When we unload pci-hyperv, the host doesn't send us a PCI_EJECT message. In this case we also need to make sure the sysfs pci slot directory is removed, otherwise "cat /sys/bus/pci/slots/2/address" will trigger "BUG: unable to handle kernel paging request". And, if we unload/reload the driver several times, we'll have multiple pci slot directories in /sys/bus/pci/slots/ like this: root@localhost:~# ls -rtl /sys/bus/pci/slots/ total 0 drwxr-xr-x 2 root root 0 Feb 7 10:49 2 drwxr-xr-x 2 root root 0 Feb 7 10:49 2-1 drwxr-xr-x 2 root root 0 Feb 7 10:51 2-2 The patch adds the missing code, and in hv_eject_device_work() it also moves pci_destroy_slot() to an earlier place where we hold the pci lock. Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot information") Signed-off-by: Dexuan Cui Cc: sta...@vger.kernel.org Cc: Stephen Hemminger --- drivers/pci/controller/pci-hyperv.c | 23 --- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index 9ba4d12c179c..6b4773727525 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -1491,6 +1491,21 @@ static void hv_pci_assign_slots(struct hv_pcibus_device *hbus) } } +/* + * Remove entries in sysfs pci slot directory. + */ +static void hv_pci_remove_slots(struct hv_pcibus_device *hbus) +{ + struct hv_pci_dev *hpdev; + + list_for_each_entry(hpdev, &hbus->children, list_entry) { + if (!hpdev->pci_slot) + continue; + pci_destroy_slot(hpdev->pci_slot); + hpdev->pci_slot = NULL; + } +} + /** * create_root_hv_pci_bus() - Expose a new root PCI bus * @hbus: Root PCI bus, as understood by this driver @@ -1887,6 +1902,10 @@ static void hv_eject_device_work(struct work_struct *work) pci_lock_rescan_remove(); pci_stop_and_remove_bus_device(pdev); pci_dev_put(pdev); + if (hpdev->pci_slot) { + pci_destroy_slot(hpdev->pci_slot); + hpdev->pci_slot = NULL; + } pci_unlock_rescan_remove(); } @@ -1894,9 +1913,6 @@ static void hv_eject_device_work(struct work_struct *work) list_del(&hpdev->list_entry); spin_unlock_irqrestore(&hpdev->hbus->device_list_lock, flags); - if (hpdev->pci_slot) - pci_destroy_slot(hpdev->pci_slot); - memset(&ctxt, 0, sizeof(ctxt)); ejct_pkt = (struct pci_eject_response *)&ctxt.pkt.message; ejct_pkt->message_type.type = PCI_EJECTION_COMPLETE; @@ -2682,6 +2698,7 @@ static int hv_pci_remove(struct hv_device *hdev) pci_lock_rescan_remove(); pci_stop_root_bus(hbus->pci_bus); pci_remove_root_bus(hbus->pci_bus); + hv_pci_remove_slots(hbus); pci_unlock_rescan_remove(); hbus->state = hv_pcibus_removed; } -- 2.19.1 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] PCI: hv: Add hv_pci_remove_slots() when we unload the driver
> From: Lorenzo Pieralisi > Sent: Tuesday, February 12, 2019 4:13 AM > ... > This patch fixes three bugs: > > 1) set hpdev->pci_slot to NULL > 2) move code destroying the slot inside a locked region in >hv_eject_device_work() > 3) Add missing slots removal code in hv_pci_remove() > > We need three patches, not one. > > (1) and (2), I am not entirely sure we want them in stable kernels, > since they are potential bugs, waiting for your input. > > Lorenzo (1) is actually unnecessary, as I suppose hpdev should be freed at a later place in the same function hv_eject_device_work -> put_pcichild() -> kfree(hpdev). But today I think I found a refcount bug in the hot-remove case and the "kfree(hpdev)" is never called in the hot-remove case. I'll further dig into this and make some extra patches. About (2), it's a race condition that can happen when the device is being hot-removed and we're unloading the pci-hyperv driver at the same time. This is not a normal usage, so I agree it doesn't really need to go into the stables. (3) should go into the stables. I'll make 3 separate patches, and extra patches for the refcount issue, and possible other minor issues. Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH 3/3] PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary
When we hot-remove a device, usually the host sends us a PCI_EJECT message, and a PCI_BUS_RELATIONS message with bus_rel->device_count == 0. But when we do the quick hot-add/hot-remove test, the host may not send us the PCI_EJECT message, if the guest has not fully finished the initialization by sending the PCI_RESOURCES_ASSIGNED* message to the host, so it's potentially unsafe to only depend on the pci_destroy_slot() in hv_eject_device_work(), though create_root_hv_pci_bus() -> hv_pci_assign_slots() is not called in this case. Note: in this case, the host still sends the guest a PCI_BUS_RELATIONS message with bus_rel->device_count == 0. And, in the quick hot-add/hot-remove test, we can have such a race: before pci_devices_present_work() -> new_pcichild_device() adds the new device into hbus->children, we may have already received the PCI_EJECT message, and hence the taklet handler hv_pci_onchannelcallback() may fail to find the "hpdev" by get_pcichild_wslot(hbus, dev_message->wslot.slot), so hv_pci_eject_device() is NOT called; later create_root_hv_pci_bus() -> hv_pci_assign_slots() creates the slot, and the PCI_BUS_RELATIONS message with bus_rel->device_count == 0 removes the device from hbus->children, and we end up being unable to remove the slot in hv_pci_remove() -> hv_pci_remove_slots(). The patch removes the slot in pci_devices_present_work() when the device is removed. This can address the above race. Note 1: pci_devices_present_work() and hv_eject_device_work() run in the singled-threaded hbus->wq, so there is not a double-remove issue for the slot. Note 2: we can't offload hv_pci_eject_device() from hv_pci_onchannelcallback() to the workqueue, because we need hv_pci_onchannelcallback() synchronously call hv_pci_eject_device() to poll the channel's ringbuffer to work around the "hangs in hv_compose_msi_msg()" issue: see commit de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in hv_compose_msi_msg()") Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot information") Signed-off-by: Dexuan Cui Cc: sta...@vger.kernel.org --- drivers/pci/controller/pci-hyperv.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index b489412e3502..82acd6155adf 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -1776,6 +1776,10 @@ static void pci_devices_present_work(struct work_struct *work) hpdev = list_first_entry(&removed, struct hv_pci_dev, list_entry); list_del(&hpdev->list_entry); + + if (hpdev->pci_slot) + pci_destroy_slot(hpdev->pci_slot); + put_pcichild(hpdev); } -- 2.19.1 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH 2/3] PCI: hv: Add hv_pci_remove_slots() when we unload the driver
When we unload pci-hyperv, the host doesn't send us a PCI_EJECT message. In this case we also need to make sure the sysfs pci slot directory is removed, otherwise "cat /sys/bus/pci/slots/2/address" will trigger "BUG: unable to handle kernel paging request" (I noticed the issue when systemd-dev crashed for me when I unloaded the driver). And, if we unload/reload the driver several times, we'll have multiple pci slot directories in /sys/bus/pci/slots/ like this: root@localhost:~# ls -rtl /sys/bus/pci/slots/ total 0 drwxr-xr-x 2 root root 0 Feb 7 10:49 2 drwxr-xr-x 2 root root 0 Feb 7 10:49 2-1 drwxr-xr-x 2 root root 0 Feb 7 10:51 2-2 The patch adds the missing code. Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot information") Signed-off-by: Dexuan Cui Acked-by: Stephen Hemminger Cc: sta...@vger.kernel.org --- drivers/pci/controller/pci-hyperv.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index 30f16b882746..b489412e3502 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -1486,6 +1486,21 @@ static void hv_pci_assign_slots(struct hv_pcibus_device *hbus) } } +/* + * Remove entries in sysfs pci slot directory. + */ +static void hv_pci_remove_slots(struct hv_pcibus_device *hbus) +{ + struct hv_pci_dev *hpdev; + + list_for_each_entry(hpdev, &hbus->children, list_entry) { + if (!hpdev->pci_slot) + continue; + pci_destroy_slot(hpdev->pci_slot); + hpdev->pci_slot = NULL; + } +} + /** * create_root_hv_pci_bus() - Expose a new root PCI bus * @hbus: Root PCI bus, as understood by this driver @@ -2680,6 +2695,7 @@ static int hv_pci_remove(struct hv_device *hdev) pci_lock_rescan_remove(); pci_stop_root_bus(hbus->pci_bus); pci_remove_root_bus(hbus->pci_bus); + hv_pci_remove_slots(hbus); pci_unlock_rescan_remove(); hbus->state = hv_pcibus_removed; } -- 2.19.1 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH 0/3] pci-hyperv: fix memory leak and add pci_destroy_slot()
Patch #1 fixes a memory leak caused by incorrectly-maintained hpdev->refs. Patch #2 and #3 make sure the "slot" is removed in all the scenarios. Without them, in the quick hot-add/hot-remove test, systemd-dev may easily crash when trying to access a dangling sys file in /sys/bus/pci/slots/: "BUG: unable to handle kernel paging request". BTW, Patch #2 was posted on Feb 7, 2019, and this is the v2: the change to hv_eject_device_work() in v1 is removed, as the change is only needed when we hot-remove the device and remove the pci-hyperv driver at the same time. It looks more work is required to make this scenaro work correctly, and since removing the driver is not really a "usual" usage, we can address this scenario in the future. Please review the patchset. Dexuan Cui (3): PCI: hv: Fix a memory leak in hv_eject_device_work() PCI: hv: Add hv_pci_remove_slots() when we unload the driver PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary drivers/pci/controller/pci-hyperv.c | 23 +++ 1 file changed, 23 insertions(+) -- 2.19.1 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()
After a device is just created in new_pcichild_device(), hpdev->refs is set to 2 (i.e. the initial value of 1 plus the get_pcichild()). When we hot remove the device from the host, in Linux VM we first call hv_pci_eject_device(), which increases hpdev->refs by get_pcichild() and then schedules a work of hv_eject_device_work(), so hpdev->refs becomes 3 (let's ignore the paired get/put_pcichild() in other places). But in hv_eject_device_work(), currently we only call put_pcichild() twice, meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch adds one put_pcichild() to fix the memory leak. BTW, the device can also be removed when we run "rmmod pci-hyperv". On this path (hv_pci_remove() -> hv_pci_bus_exit() -> hv_pci_devices_present()), hpdev->refs is 2, and we do correctly call put_pcichild() twice in pci_devices_present_work(). Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs") Signed-off-by: Dexuan Cui Cc: --- drivers/pci/controller/pci-hyperv.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index 95441a35eceb..30f16b882746 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -1900,6 +1900,9 @@ static void hv_eject_device_work(struct work_struct *work) sizeof(*ejct_pkt), (unsigned long)&ctxt.pkt, VM_PKT_DATA_INBAND, 0); + /* For the get_pcichild() in hv_pci_eject_device() */ + put_pcichild(hpdev); + /* For the two refs got in new_pcichild_device() */ put_pcichild(hpdev); put_pcichild(hpdev); put_hvpcibus(hpdev->hbus); -- 2.19.1 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()
> From: Michael Kelley > Sent: Wednesday, March 20, 2019 2:38 PM > > From: Dexuan Cui > > > > After a device is just created in new_pcichild_device(), hpdev->refs is set > > to 2 (i.e. the initial value of 1 plus the get_pcichild()). > > > > When we hot remove the device from the host, in Linux VM we first call > > hv_pci_eject_device(), which increases hpdev->refs by get_pcichild() and > > then schedules a work of hv_eject_device_work(), so hpdev->refs becomes 3 > > (let's ignore the paired get/put_pcichild() in other places). But in > > hv_eject_device_work(), currently we only call put_pcichild() twice, > > meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch > > adds one put_pcichild() to fix the memory leak. > > > > BTW, the device can also be removed when we run "rmmod pci-hyperv". On > this > > path (hv_pci_remove() -> hv_pci_bus_exit() -> hv_pci_devices_present()), > > hpdev->refs is 2, and we do correctly call put_pcichild() twice in > > pci_devices_present_work(). > > Exiting new_pcichild_device() with hpdev->refs set to 2 seems OK to me. > There is the reference in the hbus->children list, and there is the reference > that > is returned to the caller. So IMO the "normal" reference count should be 2. :-) IMO only when a hv_pci_dev device is about to be destroyed, its reference count can drop to less than 2, i.e. first temporarily drop to 1 (meaning the hv_pci_dev device is removed from hbus->children), and then drop to zero (meaning kfree(hpdev) is called). > But what is strange is that pci_devices_present_work() > overwrites the reference returned in local variable hpdev without doing a > put_pcichild(). I suppose you mean: /* First, mark all existing children as reported missing. */ spin_lock_irqsave(&hbus->device_list_lock, flags); list_for_each_entry(hpdev, &hbus->children, list_entry) { hpdev->reported_missing = true; } spin_unlock_irqrestore(&hbus->device_list_lock, flags) This is not strange to me, because, in pci_devices_present_work(), at first we don't know which devices are about to disappear, so we pre-mark all devices to be potentially missing like that; if a device is still on the bus, we'll mark its hpdev->reported_missing to false later; only after we know exactly which devices are missing, we should call put_pcichild() against them. All these seem natural to me. > It seems like the "normal" reference count should be 1 when the > child device is not being manipulated, not 2. What does "not being manipulated" mean? > The fix would be to add a call to > put_pcichild() when the return value from new_pcichild_device() is > overwritten. In pci_devices_present_work(), we NEVER "overwrite" the "hpdev" returned from new_pcichild_device(): the "reported_missing" field of the new hpdev is implicitly initialized to false in new_pcichild_device(). > Then remove the call to put_pcichild() in pci_device_present_work() when > missing > children are moved to the local list. The children have been moved from one > list > to another, so there's no need to decrement the reference count. Then when > everything in the local list is deleted, the reference is correctly > decremented, > presumably freeing the memory. > > With this approach, the code in hv_eject_device_work() is correct. There's > one call to put_pcichild() to reflect removing the child device from the > hbus-> > children list, and one call to put_pcichild() to pair with the get_pcichild() > in > hv_pci_eject_device(). Please refer to my replies above. IMO we should fix hv_eject_device_work() rather than pci_devices_present_work(). Thanks -- Dexuan > Your patch works, but to me it leaves the ref count in an unnatural state > most of the time. > > Michael ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH 3/3] PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary
> From: Michael Kelley > Sent: Wednesday, March 20, 2019 2:44 PM > To: Dexuan Cui ; lorenzo.pieral...@arm.com; > bhelg...@google.com; linux-...@vger.kernel.org; KY Srinivasan > > ... > > diff --git a/drivers/pci/controller/pci-hyperv.c > > @@ -1776,6 +1776,10 @@ static void pci_devices_present_work(struct > work_struct *work) > > hpdev = list_first_entry(&removed, struct hv_pci_dev, > > list_entry); > > list_del(&hpdev->list_entry); > > + > > + if (hpdev->pci_slot) > > + pci_destroy_slot(hpdev->pci_slot); > > The code is inconsistent in whether hpdev->pci_slot is set to NULL after > calling > pci_destory_slot(). Here, in pci_devices_present_work(), it's unnecessary to set it to NULL, Because: 1) the "hpdev" is removed from hbus->children and it can not be seen elsewhere; 2) the "hpdev" struct is freed in the below put_pcichild(): while (!list_empty(&removed)) { hpdev = list_first_entry(&removed, struct hv_pci_dev, list_entry); list_del(&hpdev->list_entry); if (hpdev->pci_slot) pci_destroy_slot(hpdev->pci_slot); put_pcichild(hpdev); } > Patch 2 in this series does set it to NULL, but this code does not. In Patch2, i.e. in the code path hv_pci_remove() -> hv_pci_remove_slots(), we must set hpdev->pci_slot to NULL, otherwise, later, due to hv_pci_remove() -> hv_pci_bus_exit() -> hv_pci_devices_present() with the zero "relations", we'll double-free the "hpdev" struct in pci_devices_present_work() -- see the above. > And the code in hv_eject_device_work() does not set it to NULL. It's unnecessary to set hpdev->pci_slot to NULL in hv_eject_device_work(), Because in hv_eject_device_work(): 1) the "hpdev" is removed from hbus->children and it can not be seen elsewhere; 2) the "hpdev" struct is freed at the end of hv_eject_device_work() with my first patch: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work(). > It looks like all the places that test the value of hpdev->pci_slot or call > pci_destroy_slot() are serialized, so it looks like it really doesn't matter. > But > when > the code is inconsistent about setting to NULL, it always makes me wonder if > there > is a reason. > > Michael Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH 3/3] PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary
> From: linux-hyperv-ow...@vger.kernel.org > On Behalf Of Dexuan Cui > > ... > > Patch 2 in this series does set it to NULL, but this code does not. > In Patch2, i.e. in the code path hv_pci_remove() -> hv_pci_remove_slots(), > we must set hpdev->pci_slot to NULL, otherwise, later, due to > hv_pci_remove() -> hv_pci_bus_exit() -> > hv_pci_devices_present() with the zero "relations", we'll double-free the > "hpdev" struct in pci_devices_present_work() -- see the above. A minor correction: I meant we'll call pci_destroy_slot(hpdev->pci_slot) twice, not "double-free hpdev". Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()
> From: Michael Kelley > Sent: Tuesday, March 26, 2019 10:47 AM > To: Lorenzo Pieralisi ; Dexuan Cui > > Cc: bhelg...@google.com; linux-...@vger.kernel.org; KY Srinivasan > ; Stephen Hemminger ; > Sasha Levin ; linux-hyp...@vger.kernel.org; > linux-ker...@vger.kernel.org; driverdev-devel@linuxdriverproject.org; Haiyang > Zhang ; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com; vkuznets ; > marcelo.ce...@canonical.com; ja...@mellanox.com; sta...@vger.kernel.org > Subject: RE: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work() > > From: Lorenzo Pieralisi Sent: Tuesday, March 26, > 2019 10:09 AM > > On Thu, Mar 21, 2019 at 12:12:03AM +, Dexuan Cui wrote: > > > > From: Michael Kelley > > > > Sent: Wednesday, March 20, 2019 2:38 PM > > > > > > > > From: Dexuan Cui > > > > > > > > > > After a device is just created in new_pcichild_device(), hpdev->refs > > > > > is > set > > > > > to 2 (i.e. the initial value of 1 plus the get_pcichild()). > > > > > > > > > > When we hot remove the device from the host, in Linux VM we first call > > > > > hv_pci_eject_device(), which increases hpdev->refs by get_pcichild() > and > > > > > then schedules a work of hv_eject_device_work(), so hpdev->refs > becomes 3 > > > > > (let's ignore the paired get/put_pcichild() in other places). But in > > > > > hv_eject_device_work(), currently we only call put_pcichild() twice, > > > > > meaning the 'hpdev' struct can't be freed in put_pcichild(). This > > > > > patch > > > > > adds one put_pcichild() to fix the memory leak. > > > > > > > > > > BTW, the device can also be removed when we run "rmmod pci-hyperv". > On > > > > this > > > > > path (hv_pci_remove() -> hv_pci_bus_exit() -> > hv_pci_devices_present()), > > > > > hpdev->refs is 2, and we do correctly call put_pcichild() twice in > > > > > pci_devices_present_work(). > > > > > > > > Exiting new_pcichild_device() with hpdev->refs set to 2 seems OK to me. > > > > There is the reference in the hbus->children list, and there is the > reference that > > > > is returned to the caller. > > > So IMO the "normal" reference count should be 2. :-) IMO only when a > hv_pci_dev > > > device is about to be destroyed, its reference count can drop to less > > > than 2, > > > i.e. first temporarily drop to 1 (meaning the hv_pci_dev device is removed > from > > > hbus->children), and then drop to zero (meaning kfree(hpdev) is called). > > > > > > > But what is strange is that pci_devices_present_work() > > > > overwrites the reference returned in local variable hpdev without doing > > > > a > > > > put_pcichild(). > > > I suppose you mean: > > > > > > /* First, mark all existing children as reported missing. */ > > > spin_lock_irqsave(&hbus->device_list_lock, flags); > > > list_for_each_entry(hpdev, &hbus->children, list_entry) { > > > hpdev->reported_missing = true; > > > } > > > spin_unlock_irqrestore(&hbus->device_list_lock, flags) > > > > > > This is not strange to me, because, in pci_devices_present_work(), at > > > first > we > > > don't know which devices are about to disappear, so we pre-mark all > devices to > > > be potentially missing like that; if a device is still on the bus, we'll > > > mark its > > > hpdev->reported_missing to false later; only after we know exactly which > > > devices are missing, we should call put_pcichild() against them. All these > > > seem natural to me. > > > > > > > It seems like the "normal" reference count should be 1 when the > > > > child device is not being manipulated, not 2. > > > What does "not being manipulated" mean? > > > > > > > The fix would be to add a call to > > > > put_pcichild() when the return value from new_pcichild_device() is > > > > overwritten. > > > In pci_devices_present_work(), we NEVER "overwrite" the "hpdev" > returned > > > from new_pcichild_device(): the "reported_missing" field of the new hpdev > > > is implicitly initialized to false in new_pcichild_device(). >
RE: [PATCH 3/3] PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary
> From: Lorenzo Pieralisi > Sent: Tuesday, March 26, 2019 12:55 PM > On Mon, Mar 04, 2019 at 09:34:49PM +0000, Dexuan Cui wrote: > > When we hot-remove a device, usually the host sends us a PCI_EJECT > message, > > and a PCI_BUS_RELATIONS message with bus_rel->device_count == 0. But > when > > we do the quick hot-add/hot-remove test, the host may not send us the > > PCI_EJECT message, if the guest has not fully finished the initialization > > by sending the PCI_RESOURCES_ASSIGNED* message to the host, so it's > > potentially unsafe to only depend on the pci_destroy_slot() in > > hv_eject_device_work(), though create_root_hv_pci_bus() -> > > hv_pci_assign_slots() is not called in this case. Note: in this case, the > > host still sends the guest a PCI_BUS_RELATIONS message with > > bus_rel->device_count == 0. > > > > And, in the quick hot-add/hot-remove test, we can have such a race: before > > pci_devices_present_work() -> new_pcichild_device() adds the new device > > into hbus->children, we may have already received the PCI_EJECT message, > > and hence the taklet handler hv_pci_onchannelcallback() may fail to find > > the "hpdev" by get_pcichild_wslot(hbus, dev_message->wslot.slot), so > > hv_pci_eject_device() is NOT called; later create_root_hv_pci_bus() -> > > hv_pci_assign_slots() creates the slot, and the PCI_BUS_RELATIONS message > > with bus_rel->device_count == 0 removes the device from hbus->children, > and > > we end up being unable to remove the slot in hv_pci_remove() -> > > hv_pci_remove_slots(). > > > > The patch removes the slot in pci_devices_present_work() when the device > > is removed. This can address the above race. Note 1: > > pci_devices_present_work() and hv_eject_device_work() run in the > > singled-threaded hbus->wq, so there is not a double-remove issue for the > > slot. Note 2: we can't offload hv_pci_eject_device() from > > hv_pci_onchannelcallback() to the workqueue, because we need > > hv_pci_onchannelcallback() synchronously call hv_pci_eject_device() to > > poll the channel's ringbuffer to work around the > > "hangs in hv_compose_msi_msg()" issue: see > > commit de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in > hv_compose_msi_msg()") > > This commit log is unreadable, sorry. Indentation, punctuation and > formatting are just a mess, try to read it, you will notice by > yourself. > > I basically reformatted it completely and pushed the series to > pci/controller-fixes but that's the last time I do it since I am not an > editor, next time I won't merge it. Hi Lorenzo, Thank you for helping improve my changelogs! I did learn a lot after carefully comparing the improved version with my original version. :-) I'll try my best to write a good changelog for my future patches. > More importantly, these patches are marked for stable, given the series > of fixes that triggered this series please ensure it was tested > thoroughly because it is honestly complicate to understand and I do not > want to backport further fixes to stable kernels on top of this. I did the hot-add/hot-remove test in a loop for several thousand times, and the patchset worked as expected and didn't show any issue. > Please have a look and report back. > > Thanks, > Lorenzo Thanks again! Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH] PCI: hv: Fix panic by calling hv_pci_remove_slots() earlier
When a slot is removed, the pci_dev must still exist. pci_remove_root_bus() removes and free all the pci_devs, so hv_pci_remove_slots() must be called before pci_remove_root_bus(), otherwise a general protection fault can happen, if the kernel is built with the memory debugging options. Fixes: 15becc2b56c6 ("PCI: hv: Add hv_pci_remove_slots() when we unload the driver") Signed-off-by: Dexuan Cui Cc: sta...@vger.kernel.org --- When pci-hyperv is unloaded, this panic can happen: general protection fault: CPU: 2 PID: 1091 Comm: rmmod Not tainted 5.2.0+ RIP: 0010:pci_slot_release+0x30/0xd0 Call Trace: kobject_release+0x65/0x190 pci_destroy_slot+0x25/0x60 hv_pci_remove+0xec/0x110 [pci_hyperv] vmbus_remove+0x20/0x30 [hv_vmbus] device_release_driver_internal+0xd5/0x1b0 driver_detach+0x44/0x7c bus_remove_driver+0x75/0xc7 vmbus_driver_unregister+0x50/0xbd [hv_vmbus] __x64_sys_delete_module+0x136/0x200 do_syscall_64+0x5e/0x220 drivers/pci/controller/pci-hyperv.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index 6b9cc6e60a..68c611d 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -2757,8 +2757,8 @@ static int hv_pci_remove(struct hv_device *hdev) /* Remove the bus from PCI's point of view. */ pci_lock_rescan_remove(); pci_stop_root_bus(hbus->pci_bus); - pci_remove_root_bus(hbus->pci_bus); hv_pci_remove_slots(hbus); + pci_remove_root_bus(hbus->pci_bus); pci_unlock_rescan_remove(); hbus->state = hv_pcibus_removed; } -- 1.8.3.1
RE: [PATCH] PCI: hv: Fix panic by calling hv_pci_remove_slots() earlier
> From: Bjorn Helgaas > Sent: Friday, August 2, 2019 12:41 PM > The subject line only describes the mechanical code change, which is > obvious from the patch. It would be better if we could say something > about *why* we need this. Hi Bjorn, Sorry. I'll try to write a better changelog in v2. :-) > On Fri, Aug 02, 2019 at 01:32:28AM +, Dexuan Cui wrote: > > > > When a slot is removed, the pci_dev must still exist. > > > > pci_remove_root_bus() removes and free all the pci_devs, so > > hv_pci_remove_slots() must be called before pci_remove_root_bus(), > > otherwise a general protection fault can happen, if the kernel is built > > "general protection fault" is an x86 term that doesn't really say what > the issue is. I suspect this would be a "use-after-free" problem. Yes, it's use-after-free. I'll fix the the wording. > > --- a/drivers/pci/controller/pci-hyperv.c > > +++ b/drivers/pci/controller/pci-hyperv.c > > @@ -2757,8 +2757,8 @@ static int hv_pci_remove(struct hv_device *hdev) > > /* Remove the bus from PCI's point of view. */ > > pci_lock_rescan_remove(); > > pci_stop_root_bus(hbus->pci_bus); > > - pci_remove_root_bus(hbus->pci_bus); > > hv_pci_remove_slots(hbus); > > + pci_remove_root_bus(hbus->pci_bus); > > I'm curious about why we need hv_pci_remove_slots() at all. None of > the other callers of pci_stop_root_bus() and pci_remove_root_bus() do > anything similar to hv_pci_remove_slots(). > > Surely some of those callers also support slots, so there must be some > other path that calls pci_destroy_slot() in those cases. Can we use a > similar strategy here? Originally Stephen Heminger added the slot code for pci-hyperv.c: a15f2c08c708 ("PCI: hv: support reporting serial number as slot information") So he may know this better. My understanding is: we can not use the similar stragegy used in the 2 other users of pci_create_slot(): drivers/pci/hotplug/pci_hotplug_core.c calls pci_create_slot(). It looks drivers/pci/hotplug/ is quite different from pci-hyperv.c because pci-hyper-v uses a simple *private* hot-plug protocol, making it impossible to use the API pci_hp_register() and pci_hp_destroy() -> pci_destroy_slot(). drivers/acpi/pci_slot.c calls pci_create_slot(), and saves the created slots in the static "slot_list" list in the same file. Again, since pci-hyper-v uses a private PCI-device-discovery protocol (which is based on VMBus rather the emulated ACPI and PCI), acpi_pci_slot_enumerate() can not find the PCI devices that are discovered by pci-hyperv, so we can not use the standard register_slot() -> pci_create_slot() to create the slots and hence acpi_pci_slot_remove() -> pci_destroy_slot() can not work for pci-hyperv. I think I can use this as the v2 changelog: The slot must be removed before the pci_dev is removed, otherwise a panic can happen due to use-after-free. Thanks, Dexuan
[PATCH v2] PCI: hv: Fix panic by calling hv_pci_remove_slots() earlier
The slot must be removed before the pci_dev is removed, otherwise a panic can happen due to use-after-free. Fixes: 15becc2b56c6 ("PCI: hv: Add hv_pci_remove_slots() when we unload the driver") Signed-off-by: Dexuan Cui Cc: sta...@vger.kernel.org --- Changes in v2: Improved the changelog accordign to the discussion with Bjorn Helgaas: https://lkml.org/lkml/2019/8/1/1173 https://lkml.org/lkml/2019/8/2/1559 drivers/pci/controller/pci-hyperv.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index 6b9cc6e60a..68c611d 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -2757,8 +2757,8 @@ static int hv_pci_remove(struct hv_device *hdev) /* Remove the bus from PCI's point of view. */ pci_lock_rescan_remove(); pci_stop_root_bus(hbus->pci_bus); - pci_remove_root_bus(hbus->pci_bus); hv_pci_remove_slots(hbus); + pci_remove_root_bus(hbus->pci_bus); pci_unlock_rescan_remove(); hbus->state = hv_pcibus_removed; } -- 1.8.3.1
RE: [PATCH v2] PCI: hv: Fix panic by calling hv_pci_remove_slots() earlier
> From: linux-hyperv-ow...@vger.kernel.org > On Behalf Of Bjorn Helgaas > Sent: Tuesday, August 6, 2019 1:16 PM > To: Dexuan Cui > > Thanks for updating this. But you didn't update the subject line, > which is really still a little too low-level. Maybe Lorenzo will fix > this. Something like this, maybe? > > PCI: hv: Avoid use of hv_pci_dev->pci_slot after freeing it This is better. Thanks! I hope Lorenzo can help to fix this so I could avoid a v3. :-) Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH] PCI: PM: Also move to D0 before calling pci_legacy_resume_early()
In pci_legacy_suspend_late(), the device state is moved to PCI_UNKNOWN. In pci_pm_thaw_noirq(), the state is supposed to be moved back to PCI_D0, but the current code misses the pci_legacy_resume_early() path, so the state remains in PCI_UNKNOWN in that path, and during hiberantion this causes an error for the Mellanox VF driver, which fails to enable MSI-X: pci_msi_supported() is false due to dev->current_state != PCI_D0: mlx4_core a6d1:00:02.0: Detected virtual function - running in slave mode mlx4_core a6d1:00:02.0: Sending reset mlx4_core a6d1:00:02.0: Sending vhcr0 mlx4_core a6d1:00:02.0: HCA minimum page size:512 mlx4_core a6d1:00:02.0: Timestamping is not supported in slave mode mlx4_core a6d1:00:02.0: INTx is not supported in multi-function mode, aborting PM: dpm_run_callback(): pci_pm_thaw+0x0/0xd7 returns -95 PM: Device a6d1:00:02.0 failed to thaw: error -95 Signed-off-by: Dexuan Cui --- drivers/pci/pci-driver.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index 36dbe960306b..27dfc68db9e7 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -1074,15 +1074,16 @@ static int pci_pm_thaw_noirq(struct device *dev) return error; } - if (pci_has_legacy_pm_support(pci_dev)) - return pci_legacy_resume_early(dev); - /* * pci_restore_state() requires the device to be in D0 (because of MSI * restoration among other things), so force it into D0 in case the * driver's "freeze" callbacks put it into a low-power state directly. */ pci_set_power_state(pci_dev, PCI_D0); + + if (pci_has_legacy_pm_support(pci_dev)) + return pci_legacy_resume_early(dev); + pci_restore_state(pci_dev); if (drv && drv->pm && drv->pm->thaw_noirq) -- 2.19.1
RE: [PATCH] PCI: PM: Also move to D0 before calling pci_legacy_resume_early()
> From: Bjorn Helgaas > Sent: Thursday, August 8, 2019 12:19 PM > To: Dexuan Cui > > On Thu, Aug 08, 2019 at 06:46:51PM +0000, Dexuan Cui wrote: > > > > In pci_legacy_suspend_late(), the device state is moved to PCI_UNKNOWN. > > In pci_pm_thaw_noirq(), the state is supposed to be moved back to PCI_D0, > > but the current code misses the pci_legacy_resume_early() path, so the > > state remains in PCI_UNKNOWN in that path, and during hiberantion this > > causes an error for the Mellanox VF driver, which fails to enable > > MSI-X: pci_msi_supported() is false due to dev->current_state != PCI_D0: > > s/hiberantion/hibernation/ Thanks for spoting this typo! :-) > Actually, it sounds more like "during *resume*, this causes an error", > so maybe you want s/hiberantion/resume/ instead? Yes, it's during "resume", and to be more accurate, it happens during the "resume" of the "thaw" phase: when we run "echo disk > /sys/power/state", first the kernel "freezes" all the devices and create a hibernation image, then the kernel "thaws" the devices including the disk/NIC, and writes the memory to the disk and powers down. This patch fixes the error message for the Mellanox VF driver in this phase. When the system starts again, a fresh kernel starts to run, and when the kernel detects that a hibernation image was saved, the kernel "quiesces" the devices, and "restores" the devices from the saved image. Here device_resume_noirq() -> ... -> pci_pm_restore_noirq() -> pci_pm_default_resume_early() -> pci_power_up() moves the device states back to PCI_D0. This path is not broken and doesn't need my patch. Thanks, -- Dexuan
[PATCH v2] PCI: PM: Move to D0 before calling pci_legacy_resume_early()
In pci_legacy_suspend_late(), the device state is moved to PCI_UNKNOWN. In pci_pm_thaw_noirq(), the state is supposed to be moved back to PCI_D0, but the current code misses the pci_legacy_resume_early() path, so the state remains in PCI_UNKNOWN in that path. As a result, in the resume phase of hibernation, this causes an error for the Mellanox VF driver, which fails to enable MSI-X because pci_msi_supported() is false due to dev->current_state != PCI_D0: mlx4_core a6d1:00:02.0: Detected virtual function - running in slave mode mlx4_core a6d1:00:02.0: Sending reset mlx4_core a6d1:00:02.0: Sending vhcr0 mlx4_core a6d1:00:02.0: HCA minimum page size:512 mlx4_core a6d1:00:02.0: Timestamping is not supported in slave mode mlx4_core a6d1:00:02.0: INTx is not supported in multi-function mode, aborting PM: dpm_run_callback(): pci_pm_thaw+0x0/0xd7 returns -95 PM: Device a6d1:00:02.0 failed to thaw: error -95 To be more accurate, the "resume" phase means the "thaw" callbacks which run before the system enters hibernation: when the user runs the command "echo disk > /sys/power/state" for hibernation, first the kernel "freezes" all the devices and creates a hibernation image, then the kernel "thaws" the devices including the disk/NIC, writes the memory to the disk, and powers down. This patch fixes the error message for the Mellanox VF driver in this phase. When the system starts again, a fresh kernel starts to run, and when the kernel detects that a hibernation image was saved, the kernel "quiesces" the devices, and then "restores" the devices from the saved image. In this path: device_resume_noirq() -> ... -> pci_pm_restore_noirq() -> pci_pm_default_resume_early() -> pci_power_up() moves the device states back to PCI_D0. This path is not broken and doesn't need my patch. Signed-off-by: Dexuan Cui --- changes in v2: Updated the changelog with more details. drivers/pci/pci-driver.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index 36dbe960306b..27dfc68db9e7 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -1074,15 +1074,16 @@ static int pci_pm_thaw_noirq(struct device *dev) return error; } - if (pci_has_legacy_pm_support(pci_dev)) - return pci_legacy_resume_early(dev); - /* * pci_restore_state() requires the device to be in D0 (because of MSI * restoration among other things), so force it into D0 in case the * driver's "freeze" callbacks put it into a low-power state directly. */ pci_set_power_state(pci_dev, PCI_D0); + + if (pci_has_legacy_pm_support(pci_dev)) + return pci_legacy_resume_early(dev); + pci_restore_state(pci_dev); if (drv && drv->pm && drv->pm->thaw_noirq) -- 2.19.1
RE: [PATCH v2] PCI: PM: Move to D0 before calling pci_legacy_resume_early()
> From: Dexuan Cui > Sent: Tuesday, August 13, 2019 6:07 PM > To: lorenzo.pieral...@arm.com; bhelg...@google.com; > linux-...@vger.kernel.org > Cc: Michael Kelley ; linux-hyp...@vger.kernel.org; > linux-ker...@vger.kernel.org; driverdev-devel@linuxdriverproject.org; Sasha > Levin ; Haiyang Zhang > ; KY Srinivasan ; > o...@aepfle.de; a...@canonical.com; jasow...@redhat.com; vkuznets > ; marcelo.ce...@canonical.com; Stephen Hemminger > ; ja...@mellanox.com; Dexuan Cui > > Subject: [PATCH v2] PCI: PM: Move to D0 before calling > pci_legacy_resume_early() > > > In pci_legacy_suspend_late(), the device state is moved to PCI_UNKNOWN. > > In pci_pm_thaw_noirq(), the state is supposed to be moved back to PCI_D0, > but the current code misses the pci_legacy_resume_early() path, so the > state remains in PCI_UNKNOWN in that path. As a result, in the resume > phase of hibernation, this causes an error for the Mellanox VF driver, > which fails to enable MSI-X because pci_msi_supported() is false due > to dev->current_state != PCI_D0: > > mlx4_core a6d1:00:02.0: Detected virtual function - running in slave mode > mlx4_core a6d1:00:02.0: Sending reset > mlx4_core a6d1:00:02.0: Sending vhcr0 > mlx4_core a6d1:00:02.0: HCA minimum page size:512 > mlx4_core a6d1:00:02.0: Timestamping is not supported in slave mode > mlx4_core a6d1:00:02.0: INTx is not supported in multi-function mode, > aborting > PM: dpm_run_callback(): pci_pm_thaw+0x0/0xd7 returns -95 > PM: Device a6d1:00:02.0 failed to thaw: error -95 > > To be more accurate, the "resume" phase means the "thaw" callbacks which > run before the system enters hibernation: when the user runs the command > "echo disk > /sys/power/state" for hibernation, first the kernel "freezes" > all the devices and creates a hibernation image, then the kernel "thaws" > the devices including the disk/NIC, writes the memory to the disk, and > powers down. This patch fixes the error message for the Mellanox VF driver > in this phase. > > When the system starts again, a fresh kernel starts to run, and when the > kernel detects that a hibernation image was saved, the kernel "quiesces" > the devices, and then "restores" the devices from the saved image. In this > path: > device_resume_noirq() -> ... -> > pci_pm_restore_noirq() -> > pci_pm_default_resume_early() -> > pci_power_up() moves the device states back to PCI_D0. This path is > not broken and doesn't need my patch. > > Signed-off-by: Dexuan Cui > --- > > changes in v2: > Updated the changelog with more details. > > drivers/pci/pci-driver.c | 7 --- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c > index 36dbe960306b..27dfc68db9e7 100644 > --- a/drivers/pci/pci-driver.c > +++ b/drivers/pci/pci-driver.c > @@ -1074,15 +1074,16 @@ static int pci_pm_thaw_noirq(struct device *dev) > return error; > } > > - if (pci_has_legacy_pm_support(pci_dev)) > - return pci_legacy_resume_early(dev); > - > /* >* pci_restore_state() requires the device to be in D0 (because of MSI >* restoration among other things), so force it into D0 in case the >* driver's "freeze" callbacks put it into a low-power state directly. >*/ > pci_set_power_state(pci_dev, PCI_D0); > + > + if (pci_has_legacy_pm_support(pci_dev)) > + return pci_legacy_resume_early(dev); > + > pci_restore_state(pci_dev); > > if (drv && drv->pm && drv->pm->thaw_noirq) > -- Hi, Lorenzo, Bjorn, Can you please take a look at the v2 ? Thanks, -- Dexuan
RE: [PATCH v2] PCI: PM: Move to D0 before calling pci_legacy_resume_early()
> From: devel On Behalf Of > Dexuan Cui > Sent: Monday, September 2, 2019 5:35 PM > To: lorenzo.pieral...@arm.com; bhelg...@google.com; > linux-...@vger.kernel.org > [..snipped...] > > --- > > > > changes in v2: > > Updated the changelog with more details. > > > > drivers/pci/pci-driver.c | 7 --- > > 1 file changed, 4 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c > > index 36dbe960306b..27dfc68db9e7 100644 > > --- a/drivers/pci/pci-driver.c > > +++ b/drivers/pci/pci-driver.c > > @@ -1074,15 +1074,16 @@ static int pci_pm_thaw_noirq(struct device > *dev) > > return error; > > } > > > > - if (pci_has_legacy_pm_support(pci_dev)) > > - return pci_legacy_resume_early(dev); > > - > > /* > > * pci_restore_state() requires the device to be in D0 (because of MSI > > * restoration among other things), so force it into D0 in case the > > * driver's "freeze" callbacks put it into a low-power state directly. > > */ > > pci_set_power_state(pci_dev, PCI_D0); > > + > > + if (pci_has_legacy_pm_support(pci_dev)) > > + return pci_legacy_resume_early(dev); > > + > > pci_restore_state(pci_dev); > > > > if (drv && drv->pm && drv->pm->thaw_noirq) > > -- > > Hi, Lorenzo, Bjorn, > > Can you please take a look at the v2 ? > > -- Dexuan Hi Lorenzo, Bjorn, and all, It looks this patch has not been acked by anyone. Should I resend it? There is no change since v2. Thanks, -- Dexuan
RE: [PATCH v2] PCI: PM: Move to D0 before calling pci_legacy_resume_early()
> -Original Message- > From: Bjorn Helgaas > Sent: Monday, October 7, 2019 6:24 AM > To: Dexuan Cui > Cc: lorenzo.pieral...@arm.com; linux-...@vger.kernel.org; Michael Kelley > ; linux-hyp...@vger.kernel.org; > linux-ker...@vger.kernel.org; driverdev-devel@linuxdriverproject.org; Sasha > Levin ; Haiyang Zhang > ; KY Srinivasan ; > o...@aepfle.de; a...@canonical.com; jasow...@redhat.com; vkuznets > ; marcelo.ce...@canonical.com; Stephen Hemminger > ; ja...@mellanox.com > Subject: Re: [PATCH v2] PCI: PM: Move to D0 before calling > pci_legacy_resume_early() > > On Wed, Aug 14, 2019 at 01:06:55AM +, Dexuan Cui wrote: > > > > In pci_legacy_suspend_late(), the device state is moved to PCI_UNKNOWN. > > > > In pci_pm_thaw_noirq(), the state is supposed to be moved back to PCI_D0, > > but the current code misses the pci_legacy_resume_early() path, so the > > state remains in PCI_UNKNOWN in that path. As a result, in the resume > > phase of hibernation, this causes an error for the Mellanox VF driver, > > which fails to enable MSI-X because pci_msi_supported() is false due > > to dev->current_state != PCI_D0: > > > > mlx4_core a6d1:00:02.0: Detected virtual function - running in slave mode > > mlx4_core a6d1:00:02.0: Sending reset > > mlx4_core a6d1:00:02.0: Sending vhcr0 > > mlx4_core a6d1:00:02.0: HCA minimum page size:512 > > mlx4_core a6d1:00:02.0: Timestamping is not supported in slave mode > > mlx4_core a6d1:00:02.0: INTx is not supported in multi-function mode, > aborting > > PM: dpm_run_callback(): pci_pm_thaw+0x0/0xd7 returns -95 > > PM: Device a6d1:00:02.0 failed to thaw: error -95 > > > > To be more accurate, the "resume" phase means the "thaw" callbacks which > > run before the system enters hibernation: when the user runs the command > > "echo disk > /sys/power/state" for hibernation, first the kernel "freezes" > > all the devices and creates a hibernation image, then the kernel "thaws" > > the devices including the disk/NIC, writes the memory to the disk, and > > powers down. This patch fixes the error message for the Mellanox VF driver > > in this phase. > > > > When the system starts again, a fresh kernel starts to run, and when the > > kernel detects that a hibernation image was saved, the kernel "quiesces" > > the devices, and then "restores" the devices from the saved image. In this > > path: > > device_resume_noirq() -> ... -> > > pci_pm_restore_noirq() -> > > pci_pm_default_resume_early() -> > > pci_power_up() moves the device states back to PCI_D0. This path is > > not broken and doesn't need my patch. > > > > Signed-off-by: Dexuan Cui > > This looks like a bugfix for 5839ee7389e8 ("PCI / PM: Force devices to > D0 in pci_pm_thaw_noirq()") so maybe it should be marked for stable as > 5839ee7389e8 was? > > Rafael, could you confirm? > > > --- > > > > changes in v2: > > Updated the changelog with more details. > > > > drivers/pci/pci-driver.c | 7 --- > > 1 file changed, 4 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c > > index 36dbe960306b..27dfc68db9e7 100644 > > --- a/drivers/pci/pci-driver.c > > +++ b/drivers/pci/pci-driver.c > > @@ -1074,15 +1074,16 @@ static int pci_pm_thaw_noirq(struct device > *dev) > > return error; > > } > > > > - if (pci_has_legacy_pm_support(pci_dev)) > > - return pci_legacy_resume_early(dev); > > - > > /* > > * pci_restore_state() requires the device to be in D0 (because of MSI > > * restoration among other things), so force it into D0 in case the > > * driver's "freeze" callbacks put it into a low-power state directly. > > */ > > pci_set_power_state(pci_dev, PCI_D0); > > + > > + if (pci_has_legacy_pm_support(pci_dev)) > > + return pci_legacy_resume_early(dev); > > + > > pci_restore_state(pci_dev); > > > > if (drv && drv->pm && drv->pm->thaw_noirq) > > -- > > 2.19.1 > > Added Rafael to "To". Thanks, -- Dexuan
RE: [PATCH v2] PCI: PM: Move to D0 before calling pci_legacy_resume_early()
> From: Bjorn Helgaas > Sent: Tuesday, October 8, 2019 12:56 PM > ... > Wordsmithing nit: what the patch does is not "fix the error message"; > what it does is fix the *problem*, i.e., the fact that we can't > operate the device because we can't enable MSI-X. The message is only > a symptom. I totally agree. :-) > IIUC the relevant part of the system hibernation sequence is: > > pci_pm_freeze_noirq > pci_pm_thaw_noirq > pci_pm_thaw > > And the execution flow is: > > pci_pm_freeze_noirq > if (pci_has_legacy_pm_support(pci_dev)) # true for mlx4 > pci_legacy_suspend_late(dev, PMSG_FREEZE) > pci_pm_set_unknown_state > dev->current_state = PCI_UNKNOWN # <--- > pci_pm_thaw_noirq > if (pci_has_legacy_pm_support(pci_dev)) # true > pci_legacy_resume_early(dev) # noop; mlx4 doesn't > implement > pci_pm_thaw # returns -95 > EOPNOTSUPP > if (pci_has_legacy_pm_support(pci_dev)) # true > pci_legacy_resume > drv->resume > mlx4_resume # mlx4_driver.resume (legacy) > mlx4_load_one > mlx4_enable_msi_x > pci_enable_msix_range > __pci_enable_msix_range > __pci_enable_msix > if (!pci_msi_supported()) > if (dev->current_state != PCI_D0) # <--- > return 0 > return -EINVAL > err = -EOPNOTSUPP > "INTx is not supported ..." > > (These are just my notes; you don't need to put them all into the > commit message. I'm just sharing them in case I'm not understanding > correctly.) Yes, these notes are accurate. > > > > > When the system starts again, a fresh kernel starts to run, and when > > > > > the > > > > > kernel detects that a hibernation image was saved, the kernel > "quiesces" > > > > > the devices, and then "restores" the devices from the saved image. In > this > > > > > path: > > > > > device_resume_noirq() -> ... -> > > > > >pci_pm_restore_noirq() -> > > > > > pci_pm_default_resume_early() -> > > > > >pci_power_up() moves the device states back to PCI_D0. This > path is > > > > > not broken and doesn't need my patch. > > > > > > > The cc list suggests that this might be a fix for a user-reported > problem. Is there a launchpad or similar link you could include here? I guess I'm the first one to notice the issue and there is not any bug link AFAIK. The hibernation process usually saves the states into a local disk (before the system is powered off), and the Mellanox NIC is not needed during the process, so it's not a real issue that the NIC can not work between pci_pm_thaw() and power_down(). This may explain why nobody else noticed the issue. I happened to see the error message, and hence investigated the issue. > Should this be marked for stable? I think we should do it. > > > > > --- a/drivers/pci/pci-driver.c > > > > > +++ b/drivers/pci/pci-driver.c > > > > > @@ -1074,15 +1074,16 @@ static int pci_pm_thaw_noirq(struct device > > > > *dev) > > > > > return error; > > > > > } > > > > > > > > > > - if (pci_has_legacy_pm_support(pci_dev)) > > > > > - return pci_legacy_resume_early(dev); > > > > > - > > > > > /* > > > > >* pci_restore_state() requires the device to be in D0 (because > of MSI > > > > >* restoration among other things), so force it into D0 in case > the > > > > >* driver's "freeze" callbacks put it into a low-power state > directly. > > > > >*/ > > > > > pci_set_power_state(pci_dev, PCI_D0); > > > > > + > > > > > + if (pci_has_legacy_pm_support(pci_dev)) > > > > > + return pci_legacy_resume_early(dev); > > > > > + > > > > > pci_restore_state(pci_dev); > > > > > > > > > > if (drv && drv->pm && drv->pm->thaw_noirq) > > > > > -- > > > > > 2.19.1 > > > > > > > The patch looks reasonable to me, but the comment above the > > pci_set_power_state() call needs to be updated too IMO. > > Hmm. > > 1) pci_restore_state() mainly writes config space, which doesn't > require the device to be in D0. The only thing I see that would > require D0 is the MSI-X MMIO space, so to be more specific, the > comment could say "restoring the MSI-X *MMIO* state requires the > device to be in D0". > > But I think you meant some other comment change. Did you mean > something along the lines of "a legacy drv->resume_early() callback > and pci_restore_state() both require the device to be in D0"? > > If something else, maybe you could propose some text? > > 2) I assume pci_pm_thaw_noirq() should leave the device in a > functionally equivalent state, whether it uses legacy PM or not. Do > we want something like the patch below instead? If we *do* want to > skip pci_restore_state() for legacy PM, maybe we should add a comment. > > 3) Documentation/power/pci.rst says: > >
RE: [PATCH v3 0/7] PCI: PM: Move to D0 before calling pci_legacy_resume_early()
> From: Bjorn Helgaas > Sent: Monday, October 14, 2019 4:00 PM > ... > > Dexuan, the important thing here is the first patch, which is your [1], > which I modified by doing pci_restore_state() as well as setting to D0: > > pci_set_power_state(pci_dev, PCI_D0); > pci_restore_state(pci_dev); > > I'm proposing some more patches on top. None are relevant to the problem > you're solving; they're just minor doc and other updates in the same area. > > Rafael, if you have a chance to look at these, I'd appreciate it. I tried > to make the doc match the code, but I'm no PM expert. Thank you very much, Bjorn! The patchset looks good to me. Thanks, -- Dexuan
RE: [PATCH 0/3] Tools: hv: vssdaemon: freeze/thaw logic improvement for the failure case
> -Original Message- > From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] > Sent: Saturday, November 8, 2014 1:09 AM > To: KY Srinivasan; Haiyang Zhang; Greg Kroah-Hartman > Cc: de...@linuxdriverproject.org; linux-ker...@vger.kernel.org; Dexuan Cui > Subject: [PATCH 0/3] Tools: hv: vssdaemon: freeze/thaw logic improvement > for the failure case > > This patch series addresses the following issues: > - Wrong error reporting for multiple filesystems case. > - Skip all readonly-mounted filesystems instead of skipping iso9660. > - Thaw all filesystems after an unsuccessful freeze attempt. > > Vitaly Kuznetsov (3): > Tools: hv: vssdaemon: consult with errno in case of failure only > Tools: hv: vssdaemon: skip all filesystems mounted readonly > Tools: hv: vssdaemon: thaw everything in case of freeze failure > > tools/hv/hv_vss_daemon.c | 14 -- > 1 file changed, 12 insertions(+), 2 deletions(-) Hi Vitaly, Thanks for your patchset! FYI: Greg checked in a patch of mine several hours ago -- my patch implemented "thaw all filesytems on a failure of freeze" too. :-) Please see my patch in Greg's char-misc-next tree: https://git.kernel.org/cgit/linux/kernel/git/gregkh/char-misc.git/commit/?h=char-misc-next&id=4f689190bb55d171d2f6614f8a6cbd4b868e48bd Can you please rebase your patch(es) on Greg's tree? Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH v2 1/2] Tools: hv: vssdaemon: report freeze errors
> -Original Message- > From: Vitaly Kuznetsov > Sent: Tuesday, November 11, 2014 0:37 AM > To: KY Srinivasan; Haiyang Zhang; Greg Kroah-Hartman > Cc: de...@linuxdriverproject.org; linux-ker...@vger.kernel.org; Dexuan Cui > Subject: [PATCH v2 1/2] Tools: hv: vssdaemon: report freeze errors > > When ioctl(fd, FIFREEZE, 0) results in an error we cannot report it > to syslog instantly since that can cause write to a frozen disk. > However, the name of the filesystem which caused the error and errno > are valuable and we would like to get a nice human-readable message > in the log. Save errno before calling vss_operate(VSS_OP_THAW) and > report the error right after. > > Unfortunately, FITHAW errors cannot be reported the same way as we > need to finish thawing all filesystems before calling syslog(). > > We should also avoid calling endmntent() for the second time in > case we encountered an error during freezing of '/' as it usually > results in SEGSEGV. > > Signed-off-by: Vitaly Kuznetsov > --- > tools/hv/hv_vss_daemon.c | 16 > 1 file changed, 12 insertions(+), 4 deletions(-) Hi Vitaly, Thanks for the patch -- it does improve the error handling! Acked-by: Dexuan Cui -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH v2 2/2] Tools: hv: vssdaemon: skip all filesystems mounted readonly
> -Original Message- > From: Vitaly Kuznetsov > Sent: Tuesday, November 11, 2014 0:37 AM > To: KY Srinivasan; Haiyang Zhang; Greg Kroah-Hartman > Cc: de...@linuxdriverproject.org; linux-ker...@vger.kernel.org; Dexuan Cui > Subject: [PATCH v2 2/2] Tools: hv: vssdaemon: skip all filesystems mounted > readonly > > Instead of making a list of exceptions for readonly filesystems > in addition to iso9660 we already have it is better to skip freeze > operation for all readonly-mounted filesystems. > > Signed-off-by: Vitaly Kuznetsov > --- > tools/hv/hv_vss_daemon.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c > index ee44f0d..5e63f70 100644 > --- a/tools/hv/hv_vss_daemon.c > +++ b/tools/hv/hv_vss_daemon.c > @@ -102,7 +102,7 @@ static int vss_operate(int operation) > while ((ent = getmntent(mounts))) { > if (strncmp(ent->mnt_fsname, match, strlen(match))) > continue; > - if (strcmp(ent->mnt_type, "iso9660") == 0) > + if (hasmntopt(ent, MNTOPT_RO) != NULL) > continue; > if (strcmp(ent->mnt_type, "vfat") == 0) > continue; > -- Acked-by: Dexuan Cui ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH] hv: hv_fcopy: drop the obsolete message on transfer failure
In the case the user-space daemon crashes, hangs or is killed, we need to down the semaphore, otherwise, after the daemon starts next time, the obsolete data in fcopy_transaction.message or fcopy_transaction.fcopy_msg will be used immediately. Cc: K. Y. Srinivasan Signed-off-by: Dexuan Cui --- drivers/hv/hv_fcopy.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c index 23b2ce2..177122a 100644 --- a/drivers/hv/hv_fcopy.c +++ b/drivers/hv/hv_fcopy.c @@ -86,6 +86,15 @@ static void fcopy_work_func(struct work_struct *dummy) * process the pending transaction. */ fcopy_respond_to_host(HV_E_FAIL); + + /* In the case the user-space daemon crashes, hangs or is killed, we +* need to down the semaphore, otherwise, after the daemon starts next +* time, the obsolete data in fcopy_transaction.message or +* fcopy_transaction.fcopy_msg will be used immediately. +*/ + if (down_trylock(&fcopy_transaction.read_sema)) + pr_debug("FCP: failed to acquire the semaphore\n"); + } static int fcopy_handle_handshake(u32 version) -- 1.9.1 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH] tools: hv: ignore ENOBUFS in the KVP daemon
Under high memory pressure and very high KVP R/W test pressure, the netlink recvfrom() may transiently return ENOBUFS to the daemon -- we found this during a 2-week stress test. We'd better not terminate the daemon on this failure, because a typical KVP user can re-try the R/W and hopefully it will succeed next time. Cc: K. Y. Srinivasan Signed-off-by: Dexuan Cui --- tools/hv/hv_kvp_daemon.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c index 22b0764..9f4b303 100644 --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -1559,8 +1559,15 @@ int main(int argc, char *argv[]) addr_p, &addr_l); if (len < 0) { + int saved_errno = errno; syslog(LOG_ERR, "recvfrom failed; pid:%u error:%d %s", addr.nl_pid, errno, strerror(errno)); + + if (saved_errno == ENOBUFS) { + syslog(LOG_ERR, "error = ENOBUFS: ignored"); + continue; + } + close(fd); return -1; } -- 1.9.1 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] tools: hv: ignore ENOBUFS in the KVP daemon
> -Original Message- > From: Vitaly Kuznetsov > Sent: Wednesday, November 19, 2014 18:50 PM > To: Dexuan Cui > Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; driverdev- > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com; Haiyang Zhang > Subject: Re: [PATCH] tools: hv: ignore ENOBUFS in the KVP daemon > > Dexuan Cui writes: > > > Under high memory pressure and very high KVP R/W test pressure, the > netlink > > recvfrom() may transiently return ENOBUFS to the daemon -- we found > this > > during a 2-week stress test. > > > > We'd better not terminate the daemon on this failure, because a typical > KVP > > user can re-try the R/W and hopefully it will succeed next time. > > > > diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c > > index 22b0764..9f4b303 100644 > > --- a/tools/hv/hv_kvp_daemon.c > > +++ b/tools/hv/hv_kvp_daemon.c > > @@ -1559,8 +1559,15 @@ int main(int argc, char *argv[]) > > addr_p, &addr_l); > > > > if (len < 0) { > > + int saved_errno = errno; > > syslog(LOG_ERR, "recvfrom failed; pid:%u > error:%d %s", > > addr.nl_pid, errno, strerror(errno)); > > + > > + if (saved_errno == ENOBUFS) { > > is it possible to meet EAGAIN (or EWOULDBLOCK) here as well? I'd suggest > we ignore these as well in such case. Ignoring ENOMEM here is doubtful, > I think. But possible. > > Vitaly I don't think EAGAIN is possible because "man recvfrom" says "If no messages are available at the socket, the receive calls wait for a message to arrive, unless the socket is nonblocking (see fcntl(2)), in which case the value -1 is returned and the external variable errno is set to EAGAIN or EWOULDBLOCK". The same man page mention ENOMEM for recvmsg(), but not recvfrom(). -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] tools: hv: ignore ENOBUFS in the KVP daemon
> -Original Message- > From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] > Sent: Wednesday, November 19, 2014 20:41 PM > To: Dexuan Cui > Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; driverdev- > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com; Haiyang Zhang > Subject: Re: [PATCH] tools: hv: ignore ENOBUFS in the KVP daemon > > Dexuan Cui writes: > > >> -Original Message- > >> From: Vitaly Kuznetsov > >> Sent: Wednesday, November 19, 2014 18:50 PM > >> To: Dexuan Cui > >> Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; > driverdev- > >> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > >> jasow...@redhat.com; Haiyang Zhang > >> Subject: Re: [PATCH] tools: hv: ignore ENOBUFS in the KVP daemon > >> > >> Dexuan Cui writes: > >> > >> > Under high memory pressure and very high KVP R/W test pressure, > the netlink > >> > recvfrom() may transiently return ENOBUFS to the daemon -- we found > this > >> > during a 2-week stress test. > >> > > >> > We'd better not terminate the daemon on this failure, because a > typical KVP > >> > user can re-try the R/W and hopefully it will succeed next time. > >> > > >> > diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c > >> > index 22b0764..9f4b303 100644 > >> > --- a/tools/hv/hv_kvp_daemon.c > >> > +++ b/tools/hv/hv_kvp_daemon.c > >> > @@ -1559,8 +1559,15 @@ int main(int argc, char *argv[]) > >> > addr_p, &addr_l); > >> > > >> > if (len < 0) { > >> > +int saved_errno = errno; > >> > syslog(LOG_ERR, "recvfrom failed; pid:%u > error:%d %s", > >> > addr.nl_pid, errno, > >> > strerror(errno)); > >> > + > >> > +if (saved_errno == ENOBUFS) { > >> > >> is it possible to meet EAGAIN (or EWOULDBLOCK) here as well? I'd > suggest > >> we ignore these as well in such case. Ignoring ENOMEM here is doubtful, > >> I think. But possible. > >> > >> Vitaly > > > > I don't think EAGAIN is possible because "man recvfrom" says > >"If no messages are available at the socket, the receive calls wait for > > a > > message to arrive, unless the socket is nonblocking (see fcntl(2)), in > which > > case the value -1 is returned and the external variable errno is > > set to > > EAGAIN or EWOULDBLOCK". > > > > The same man page mention ENOMEM for recvmsg(), but not recvfrom(). > > Ah, sorry, I though your patch patches the other place: call to > netlink_send() which does sendmsg() (and my > EAGAIN/EWOULDBLOCK/ENOMEM > comment was about it). It could also make sense to patch them both as I > think it is possible to hit these as well. > > > -- Dexuan > -- > Vitaly OK, I can add this new check: (I'll send out the v2 tomorrow in case people have new comments) --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -1770,8 +1770,15 @@ kvp_done: len = netlink_send(fd, incoming_cn_msg); if (len < 0) { + int saved_errno = errno; syslog(LOG_ERR, "net_link send failed; error: %d %s", errno, strerror(errno)); + + if (saved_errno == ENOMEM || saved_errno == EAGAIN) { + syslog(LOG_ERR, "send error: ignored"); + continue; + } + exit(EXIT_FAILURE); } } Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] tools: hv: ignore ENOBUFS in the KVP daemon
> -Original Message- > From: Vitaly Kuznetsov > >> -- > >> Vitaly > > > > OK, I can add this new check: > > (I'll send out the v2 tomorrow in case people have new comments) > > > > Thanks! > > > --- a/tools/hv/hv_kvp_daemon.c > > +++ b/tools/hv/hv_kvp_daemon.c > > @@ -1770,8 +1770,15 @@ kvp_done: > > > > len = netlink_send(fd, incoming_cn_msg); > > if (len < 0) { > > + int saved_errno = errno; > > syslog(LOG_ERR, "net_link send failed; error: %d > > %s", errno, > > strerror(errno)); > > + > > + if (saved_errno == ENOMEM || saved_errno == > > EAGAIN) { > > Sorry for being pushy, but it seems ENOBUFS is also possible here (at > least man sendmsg mentions it). OK, I'll add this too. :-) BTW, I realized sendmsg() can't return EAGAIN here as that's for non-blocking socket. Here I simply ignore the error, hoping the other end will re-try. > > > + syslog(LOG_ERR, "send error: ignored"); > > + continue; > > + } > > + > > exit(EXIT_FAILURE); > > } > > } > > > > Thanks, > > -- Dexuan > > Vitaly -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v2] tools: hv: ignore ENOBUFS and ENOMEM in the KVP daemon
Under high memory pressure and very high KVP R/W test pressure, the netlink recvfrom() may transiently return ENOBUFS to the daemon -- we found this during a 2-week stress test. We'd better not terminate the daemon on the failure, because a typical KVP user will re-try the R/W and hopefully it will succeed next time. We can also ignore the errors on sending. Cc: Vitaly Kuznetsov Cc: K. Y. Srinivasan Signed-off-by: Dexuan Cui --- v2: I also ignore the errors on sending, as Vitaly suggested. tools/hv/hv_kvp_daemon.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c index 22b0764..6a6432a 100644 --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -1559,8 +1559,15 @@ int main(int argc, char *argv[]) addr_p, &addr_l); if (len < 0) { + int saved_errno = errno; syslog(LOG_ERR, "recvfrom failed; pid:%u error:%d %s", addr.nl_pid, errno, strerror(errno)); + + if (saved_errno == ENOBUFS) { + syslog(LOG_ERR, "receive error: ignored"); + continue; + } + close(fd); return -1; } @@ -1763,8 +1770,15 @@ kvp_done: len = netlink_send(fd, incoming_cn_msg); if (len < 0) { + int saved_errno = errno; syslog(LOG_ERR, "net_link send failed; error: %d %s", errno, strerror(errno)); + + if (saved_errno == ENOMEM || saved_errno == ENOBUFS) { + syslog(LOG_ERR, "send error: ignored"); + continue; + } + exit(EXIT_FAILURE); } } -- 1.9.1 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] hv: hv_fcopy: drop the obsolete message on transfer failure
> -Original Message- > From: KY Srinivasan > Sent: Thursday, November 20, 2014 6:59 AM > > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c index > > 23b2ce2..177122a 100644 > > --- a/drivers/hv/hv_fcopy.c > > +++ b/drivers/hv/hv_fcopy.c > > @@ -86,6 +86,15 @@ static void fcopy_work_func(struct work_struct > > *dummy) > > * process the pending transaction. > > */ > > fcopy_respond_to_host(HV_E_FAIL); > > + > > +/* In the case the user-space daemon crashes, hangs or is killed, we > > + * need to down the semaphore, otherwise, after the daemon starts > > next > > + * time, the obsolete data in fcopy_transaction.message or > > + * fcopy_transaction.fcopy_msg will be used immediately. > > + */ > > +if (down_trylock(&fcopy_transaction.read_sema)) > > +pr_debug("FCP: failed to acquire the semaphore\n"); > > + > > } > > When the daemon is killed, we currently reset the state in the release > function. Why can't we cleanup the semaphore state (initialize) here as well. > > K. Y Hi KY, 1) The down_trylock() here is necessary: the daemon can fail to respond in 5 seconds due to many reasons, e.g., the VM's CPU and I/O are too busy. In this case, the daemon may become running later(NOTE: in this example, the daemon is not killed), but from the host user's point of view, the PowerShell copy-vmfile command has failed, so here we have to 'down' the semaphore anyway, otherwise, the daemon can get obsolete data. 2) If we add a line sema_init(&fcopy_transaction.read_sema, 0); in fcopy_release(), it seems OK at a glance, but we have to handle the race condition: the above down_trylock() and the sema_init() can, in theory, run simultaneously on different virtual CPUs. It's tricky to address this. 3) So I think we can reuse the same semaphore without an actually unnecessary re-initialization. :-) Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] hv: hv_fcopy: drop the obsolete message on transfer failure
> -Original Message- > From: KY Srinivasan > Sent: Friday, November 21, 2014 1:58 AM > To: Dexuan Cui; gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; > driverdev-devel@linuxdriverproject.org; o...@aepfle.de; > a...@canonical.com; jasow...@redhat.com > Cc: Haiyang Zhang > Subject: RE: [PATCH] hv: hv_fcopy: drop the obsolete message on transfer > failure > > -Original Message- > > From: Dexuan Cui > > Sent: Wednesday, November 19, 2014 11:48 PM > > To: KY Srinivasan; gre...@linuxfoundation.org; linux- > > ker...@vger.kernel.org; driverdev-devel@linuxdriverproject.org; > > o...@aepfle.de; a...@canonical.com; jasow...@redhat.com > > Cc: Haiyang Zhang > > Subject: RE: [PATCH] hv: hv_fcopy: drop the obsolete message on transfer > > failure > > > > > -Original Message- > > > From: KY Srinivasan > > > Sent: Thursday, November 20, 2014 6:59 AM > > > > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c index > > > > 23b2ce2..177122a 100644 > > > > --- a/drivers/hv/hv_fcopy.c > > > > +++ b/drivers/hv/hv_fcopy.c > > > > @@ -86,6 +86,15 @@ static void fcopy_work_func(struct > work_struct > > > > *dummy) > > > > * process the pending transaction. > > > > */ > > > > fcopy_respond_to_host(HV_E_FAIL); > > > > + > > > > +/* In the case the user-space daemon crashes, hangs or is killed, > > > > +we > > > > + * need to down the semaphore, otherwise, after the daemon starts > > > > next > > > > + * time, the obsolete data in fcopy_transaction.message or > > > > + * fcopy_transaction.fcopy_msg will be used immediately. > > > > + */ > > > > +if (down_trylock(&fcopy_transaction.read_sema)) > > > > +pr_debug("FCP: failed to acquire the semaphore\n"); > > > > + > > > > } > > > > > > When the daemon is killed, we currently reset the state in the release > > > function. Why can't we cleanup the semaphore state (initialize) here as > > well. > > > > > > K. Y > > > > Hi KY, > > 1) The down_trylock() here is necessary: the daemon can fail to respond > in 5 > > seconds due to many reasons, e.g., the VM's CPU and I/O are too busy. In > > this case, the daemon may become running later(NOTE: in this example, > the > > daemon is not killed), but from the host user's point of view, the > PowerShell > > copy-vmfile command has failed, so here we have to 'down' the > semaphore > > anyway, otherwise, the daemon can get obsolete data. > > > > 2) If we add a line > > sema_init(&fcopy_transaction.read_sema, 0); in fcopy_release(), it seems > > OK at a glance, but we have to handle the race > > condition: the above down_trylock() and the sema_init() can, in theory, > run > > simultaneously on different virtual CPUs. It's tricky to address this. > > > > 3) So I think we can reuse the same semaphore without an actually > > unnecessary re-initialization. :-) > > Agreed; you may want to get rid of the pr_debug() call though. > > Thanks, > > K. Y The pr_debug() is added intentionally according to suggestion of Redhat's Vitaly Kuznetsov in the bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1162100#c5 The function is declared with__must_check in include/linux/semaphore.h: extern int __must_check down_trylock(struct semaphore *sem); Without checking the return value, we'll get these warning if the "Kernel hacking" options are enabled: drivers/hv/hv_fcopy.c: In function 'fcopy_work_func': drivers/hv/hv_fcopy.c:95:2: warning: ignoring return value of 'down_trylock', declared with attribute warn_unused_result [-Wunused-result] (void)down_trylock(&fcopy_transaction.read_sema); ^ In practice, the message I add should be very rare since it's very unlikely to fail to get the semaphore in this timeout case -- and in case this happens, it's actually OK, because the driver has told the host user the PowerShell command should fail. Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH] hv: hv_balloon: avoid memory leak on alloc_error of 2MB memory block
If num_ballooned is not 0, we shouldn't neglect the already-allocated 2MB memory block(s). Cc: K. Y. Srinivasan Cc: Signed-off-by: Dexuan Cui --- drivers/hv/hv_balloon.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index 5e90c5d..cba2d3b 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -1091,6 +1091,8 @@ static void balloon_up(struct work_struct *dummy) bool done = false; int i; + /* The host does balloon_up in 2MB. */ + WARN_ON(num_pages % PAGES_IN_2M != 0); /* * We will attempt 2M allocations. However, if we fail to @@ -,7 +1113,7 @@ static void balloon_up(struct work_struct *dummy) bl_resp, alloc_unit, &alloc_error); - if ((alloc_error) && (alloc_unit != 1)) { + if (alloc_error && (alloc_unit != 1) && num_ballooned == 0) { alloc_unit = 1; continue; } -- 1.9.1 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] hv: hv_balloon: avoid memory leak on alloc_error of 2MB memory block
> -Original Message- > From: Jason Wang [mailto:jasow...@redhat.com] > Sent: Monday, November 24, 2014 13:18 PM > To: Dexuan Cui; gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; > driverdev-devel@linuxdriverproject.org; o...@aepfle.de; > a...@canonical.com; KY Srinivasan > Cc: Haiyang Zhang > Subject: Re: [PATCH] hv: hv_balloon: avoid memory leak on alloc_error of > 2MB memory block > > On 11/24/2014 01:56 PM, Dexuan Cui wrote: > > If num_ballooned is not 0, we shouldn't neglect the already-allocated > 2MB > > memory block(s). > > > > Cc: K. Y. Srinivasan > > Cc: > > Signed-off-by: Dexuan Cui > > --- > > drivers/hv/hv_balloon.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c > > index 5e90c5d..cba2d3b 100644 > > --- a/drivers/hv/hv_balloon.c > > +++ b/drivers/hv/hv_balloon.c > > @@ -1091,6 +1091,8 @@ static void balloon_up(struct work_struct > *dummy) > > bool done = false; > > int i; > > > > + /* The host does balloon_up in 2MB. */ > > + WARN_ON(num_pages % PAGES_IN_2M != 0); > > > > /* > > * We will attempt 2M allocations. However, if we fail to > > @@ -,7 +1113,7 @@ static void balloon_up(struct work_struct > *dummy) > > bl_resp, alloc_unit, > > &alloc_error); > > > > - if ((alloc_error) && (alloc_unit != 1)) { > > + if (alloc_error && (alloc_unit != 1) && num_ballooned == 0) > { > > alloc_unit = 1; > > continue; > > } > > Before the change, we may retry the 4K allocation when part or all 2M > allocations were failed. This makes sense when memory is fragmented. But Yes, but all the partially-allocated 2MB memory blocks are lost(mem leak). > after the change, if part of 2M allocation were failed, we won't retry > 4K allocation. Is this expected? Hi Jason, The patch doesn't break the "try 2MB first; then try 4K" logic: With the change, we'll retry the 2MB allocation in the next iteration of the same while (!done) loop -- we expect this retry will cause "alloc_error && (alloc_unit != 1) && num_ballooned == 0" to be true, so we'll later try 4K allocation, as we did before. > Btw, can host just require 1M? If yes, should alloc_balloon_pages() set Hi KY, Can you please clarify this? You know the host much more than me. :-) > alloc_error if num_pages < alloc_unit for caller to catch this and retry > 4K allocation? -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] hv: hv_balloon: avoid memory leak on alloc_error of 2MB memory block
> -Original Message- > From: Jason Wang [mailto:jasow...@redhat.com] > Sent: Monday, November 24, 2014 15:28 PM > To: Dexuan Cui; gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; > driverdev-devel@linuxdriverproject.org; o...@aepfle.de; > a...@canonical.com; KY Srinivasan > Cc: Haiyang Zhang > Subject: Re: [PATCH] hv: hv_balloon: avoid memory leak on alloc_error of > 2MB memory block > > On 11/24/2014 02:08 PM, Dexuan Cui wrote: > >> -Original Message- > >> > From: Jason Wang [mailto:jasow...@redhat.com] > >> > Sent: Monday, November 24, 2014 13:18 PM > >> > To: Dexuan Cui; gre...@linuxfoundation.org; linux- > ker...@vger.kernel.org; > >> > driverdev-devel@linuxdriverproject.org; o...@aepfle.de; > >> > a...@canonical.com; KY Srinivasan > >> > Cc: Haiyang Zhang > >> > Subject: Re: [PATCH] hv: hv_balloon: avoid memory leak on > alloc_error of > >> > 2MB memory block > >> > > >> > On 11/24/2014 01:56 PM, Dexuan Cui wrote: > >>> > > If num_ballooned is not 0, we shouldn't neglect the already- > allocated > >> > 2MB > >>> > > memory block(s). > >>> > > > >>> > > Cc: K. Y. Srinivasan > >>> > > Cc: > >>> > > Signed-off-by: Dexuan Cui > >>> > > --- > >>> > > drivers/hv/hv_balloon.c | 4 +++- > >>> > > 1 file changed, 3 insertions(+), 1 deletion(-) > >>> > > > >>> > > diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c > >>> > > index 5e90c5d..cba2d3b 100644 > >>> > > --- a/drivers/hv/hv_balloon.c > >>> > > +++ b/drivers/hv/hv_balloon.c > >>> > > @@ -1091,6 +1091,8 @@ static void balloon_up(struct > work_struct > >> > *dummy) > >>> > > bool done = false; > >>> > > int i; > >>> > > > >>> > > + /* The host does balloon_up in 2MB. */ > >>> > > + WARN_ON(num_pages % PAGES_IN_2M != 0); > >>> > > > >>> > > /* > >>> > >* We will attempt 2M allocations. However, if we fail to > >>> > > @@ -,7 +1113,7 @@ static void balloon_up(struct > work_struct > >> > *dummy) > >>> > > bl_resp, alloc_unit, > >>> > >&alloc_error); > >>> > > > >>> > > - if ((alloc_error) && (alloc_unit != 1)) { > >>> > > + if (alloc_error && (alloc_unit != 1) && > num_ballooned == 0) > >> > { > >>> > > alloc_unit = 1; > >>> > > continue; > >>> > > } > >> > > >> > Before the change, we may retry the 4K allocation when part or all 2M > >> > allocations were failed. This makes sense when memory is fragmented. > But > > Yes, but all the partially-allocated 2MB memory blocks are lost(mem leak). > > > >> > after the change, if part of 2M allocation were failed, we won't retry > >> > 4K allocation. Is this expected? > > Hi Jason, > > The patch doesn't break the "try 2MB first; then try 4K" logic: > > > > With the change, we'll retry the 2MB allocation in the next iteration of the > > same while (!done) loop -- we expect this retry will cause > > "alloc_error && (alloc_unit != 1) && num_ballooned == 0" to be true, > > so we'll later try 4K allocation, as we did before. > > If I read the code correctly, if part of 2M allocation fails. > alloc_balloon_pages() will have a non zero return value with alloc_error > set. Then it will match the following check: > > if ((alloc_error) || (num_ballooned == num_pages)) > { > > which will set done to true. So there's no second iteration of while > (!done) loop? Oh... I see the issue in my patch. Thanks for pointing this out, Jason! > Probably you need something like: > > if ((alloc_unit != 1) && (num_ballooned == 0)) { > alloc_unit = 1; > continue; > } > > if ((alloc_unit == 1) || (num_ballooned == num_pages)) { > ... > } Your code is good, except for one thing: alloc_balloon_pages() can return due to: if (bl_resp->hdr.size + sizeof(union dm_
RE: [PATCH] hv: hv_balloon: avoid memory leak on alloc_error of 2MB memory block
> -Original Message- > From: Jason Wang [mailto:jasow...@redhat.com] > Sent: Monday, November 24, 2014 16:48 PM > To: Dexuan Cui; gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; > driverdev-devel@linuxdriverproject.org; o...@aepfle.de; > a...@canonical.com; KY Srinivasan > Cc: Haiyang Zhang > Subject: Re: [PATCH] hv: hv_balloon: avoid memory leak on alloc_error of > 2MB memory block > > On 11/24/2014 03:54 PM, Dexuan Cui wrote: > >> -Original Message- > >> From: Jason Wang [mailto:jasow...@redhat.com] > >> Sent: Monday, November 24, 2014 15:28 PM > >> To: Dexuan Cui; gre...@linuxfoundation.org; linux- > ker...@vger.kernel.org; > >> driverdev-devel@linuxdriverproject.org; o...@aepfle.de; > >> a...@canonical.com; KY Srinivasan > >> Cc: Haiyang Zhang > >> Subject: Re: [PATCH] hv: hv_balloon: avoid memory leak on alloc_error > of > >> 2MB memory block > >> > >> On 11/24/2014 02:08 PM, Dexuan Cui wrote: > >>>> -Original Message- > >>>>> From: Jason Wang [mailto:jasow...@redhat.com] > >>>>> Sent: Monday, November 24, 2014 13:18 PM > >>>>> To: Dexuan Cui; gre...@linuxfoundation.org; linux- > >> ker...@vger.kernel.org; > >>>>> driverdev-devel@linuxdriverproject.org; o...@aepfle.de; > >>>>> a...@canonical.com; KY Srinivasan > >>>>> Cc: Haiyang Zhang > >>>>> Subject: Re: [PATCH] hv: hv_balloon: avoid memory leak on > >> alloc_error of > >>>>> 2MB memory block > >>>>> > >>>>> On 11/24/2014 01:56 PM, Dexuan Cui wrote: > >>>>>>> If num_ballooned is not 0, we shouldn't neglect the already- > >> allocated > >>>>> 2MB > >>>>>>> memory block(s). > >>>>>>> > >>>>>>> Cc: K. Y. Srinivasan > >>>>>>> Cc: > >>>>>>> Signed-off-by: Dexuan Cui > >>>>>>> --- > >>>>>>> drivers/hv/hv_balloon.c | 4 +++- > >>>>>>> 1 file changed, 3 insertions(+), 1 deletion(-) > >>>>>>> > >>>>>>> diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c > >>>>>>> index 5e90c5d..cba2d3b 100644 > >>>>>>> --- a/drivers/hv/hv_balloon.c > >>>>>>> +++ b/drivers/hv/hv_balloon.c > >>>>>>> @@ -1091,6 +1091,8 @@ static void balloon_up(struct > >> work_struct > >>>>> *dummy) > >>>>>>> bool done = false; > >>>>>>> int i; > >>>>>>> > >>>>>>> + /* The host does balloon_up in 2MB. */ > >>>>>>> + WARN_ON(num_pages % PAGES_IN_2M != 0); > >>>>>>> > >>>>>>> /* > >>>>>>>* We will attempt 2M allocations. However, if we fail to > >>>>>>> @@ -,7 +1113,7 @@ static void balloon_up(struct > >> work_struct > >>>>> *dummy) > >>>>>>> bl_resp, alloc_unit, > >>>>>>>&alloc_error); > >>>>>>> > >>>>>>> - if ((alloc_error) && (alloc_unit != 1)) { > >>>>>>> + if (alloc_error && (alloc_unit != 1) && > >> num_ballooned == 0) > >>>>> { > >>>>>>> alloc_unit = 1; > >>>>>>> continue; > >>>>>>> } > >>>>> Before the change, we may retry the 4K allocation when part or all > 2M > >>>>> allocations were failed. This makes sense when memory is > fragmented. > >> But > >>> Yes, but all the partially-allocated 2MB memory blocks are lost(mem > leak). > >>> > >>>>> after the change, if part of 2M allocation were failed, we won't retry > >>>>> 4K allocation. Is this expected? > >>> Hi Jason, > >>> The patch doesn't break the "try 2MB first; then try 4K" logic: > >>> > >>> With the change, we'll retry the 2MB allocation in the next iteration of > the > >>> same while (!done) loop -- we expect this retry will cause
[PATCH v2] hv: hv_balloon: avoid memory leak on alloc_error of 2MB memory block
If num_ballooned is not 0, we shouldn't neglect the already-partially-allocated 2MB memory block(s). Cc: Jason Wang Cc: K. Y. Srinivasan Signed-off-by: Dexuan Cui --- v2: I fixed the logic error in v1, pointed by Jason Wang: In v1: in the case of partially-allocated 2MB, alloc_error is true, so we'll run "done = true" and hence we won't proceed with the next iteration of trying 4K allocation. I also changed the WARN_ON to WARN_ON_ONCE in case the host behavior changes in the future. drivers/hv/hv_balloon.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index 5e90c5d..b958ded 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -1087,10 +1087,12 @@ static void balloon_up(struct work_struct *dummy) struct dm_balloon_response *bl_resp; int alloc_unit; int ret; - bool alloc_error = false; + bool alloc_error; bool done = false; int i; + /* The host balloons pages in 2M granularity. */ + WARN_ON_ONCE(num_pages % PAGES_IN_2M != 0); /* * We will attempt 2M allocations. However, if we fail to @@ -1107,16 +1109,18 @@ static void balloon_up(struct work_struct *dummy) num_pages -= num_ballooned; + alloc_error = false; num_ballooned = alloc_balloon_pages(&dm_device, num_pages, bl_resp, alloc_unit, &alloc_error); - if ((alloc_error) && (alloc_unit != 1)) { + if (alloc_unit != 1 && num_ballooned == 0) { alloc_unit = 1; continue; } - if ((alloc_error) || (num_ballooned == num_pages)) { + if ((alloc_unit == 1 && alloc_error) || + (num_ballooned == num_pages)) { bl_resp->more_pages = 0; done = true; dm_device.state = DM_INITIALIZED; -- 1.9.1 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] Drivers: hv: vmbus: prevent cpu offlining on newer hypervisors
> -Original Message- > From: devel [mailto:driverdev-devel-boun...@linuxdriverproject.org] On > Behalf Of Greg Kroah-Hartman > Sent: Thursday, November 27, 2014 11:03 AM > To: Vitaly Kuznetsov > Cc: de...@linuxdriverproject.org; Haiyang Zhang; linux- > ker...@vger.kernel.org > Subject: Re: [PATCH] Drivers: hv: vmbus: prevent cpu offlining on newer > hypervisors > > On Wed, Nov 26, 2014 at 02:52:22PM +0100, Vitaly Kuznetsov wrote: > > When an SMP Hyper-V guest is running on top of 2012R2 Server and > secondary > > cpus are sent offline (with echo 0 > > /sys/devices/system/cpu/cpu$cpu/online) > > the system freeze is observed. This happens due to the fact that on newer > > hypervisors (Win8, WS2012R2, ...) vmbus channel handlers are > distributed > > across all cpus (see init_vp_index() function in > drivers/hv/channel_mgmt.c) > > and on cpu offlining nobody reassigns them to CPU0. Prevent cpu > offlining > > when vmbus is loaded until the issue is fixed host-side. > > > > This patch also disables hibernation but it is OK as it is also broken (MCE > > error is hit on resume). Suspend still works. > > > > Tested with WS2008R2 and WS2012R2. > > > > Signed-off-by: Vitaly Kuznetsov > > --- > > drivers/hv/vmbus_drv.c | 19 +++ > > 1 file changed, 19 insertions(+) > > > > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c > > index 4d6b269..9a82249 100644 > > --- a/drivers/hv/vmbus_drv.c > > +++ b/drivers/hv/vmbus_drv.c > > @@ -32,6 +32,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -671,6 +672,13 @@ static void vmbus_isr(void) > > tasklet_schedule(&msg_dpc); > > } > > > > +#ifdef CONFIG_HOTPLUG_CPU > > +static int hyperv_cpu_disable(void) > > +{ > > + return -1; > > +} > > +#endif > > + > > /* > > * vmbus_bus_init -Main vmbus driver initialization routine. > > * > > @@ -711,6 +719,12 @@ static int vmbus_bus_init(int irq) > > if (ret) > > goto err_alloc; > > > > +#ifdef CONFIG_HOTPLUG_CPU > > + if ((vmbus_proto_version != VERSION_WS2008) && > > + (vmbus_proto_version != VERSION_WIN7)) > > + smp_ops.cpu_disable = hyperv_cpu_disable; > > +#endif > > + > > vmbus_request_offers(); > > > > return 0; > > @@ -964,6 +978,11 @@ static void __exit vmbus_exit(void) > > bus_unregister(&hv_bus); > > hv_cleanup(); > > acpi_bus_unregister_driver(&vmbus_acpi_driver); > > +#ifdef CONFIG_HOTPLUG_CPU > > + if ((vmbus_proto_version != VERSION_WS2008) && > > + (vmbus_proto_version != VERSION_WIN7)) > > + smp_ops.cpu_disable = native_cpu_disable; > > +#endif > > } > > #ifdef in a .c file is not a good idea to do if at all possible, please > only put this in one place, using a function call to "hide" the mess. > > greg k-h Hi Vitaly, The idea of the patch is good to me. I agree with Greg. BTW, maybe hv_cpu_hotplug_quirk() is a better name? Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] hv: hv_fcopy: drop the obsolete message on transfer failure
> -Original Message- > From: Greg KH [mailto:gre...@linuxfoundation.org] > Sent: Thursday, November 27, 2014 7:54 AM > To: Dexuan Cui > Cc: linux-ker...@vger.kernel.org; driverdev-devel@linuxdriverproject.org; > o...@aepfle.de; a...@canonical.com; jasow...@redhat.com; KY Srinivasan; > Haiyang Zhang; vkuzn...@redhat.com > Subject: Re: [PATCH] hv: hv_fcopy: drop the obsolete message on transfer > failure > > On Tue, Nov 11, 2014 at 09:03:26PM -0800, Dexuan Cui wrote: > > In the case the user-space daemon crashes, hangs or is killed, we > > need to down the semaphore, otherwise, after the daemon starts next > > time, the obsolete data in fcopy_transaction.message or > > fcopy_transaction.fcopy_msg will be used immediately. > > > > Cc: K. Y. Srinivasan > > Signed-off-by: Dexuan Cui > > --- > > drivers/hv/hv_fcopy.c | 9 + > > 1 file changed, 9 insertions(+) > > > > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c > > index 23b2ce2..177122a 100644 > > --- a/drivers/hv/hv_fcopy.c > > +++ b/drivers/hv/hv_fcopy.c > > @@ -86,6 +86,15 @@ static void fcopy_work_func(struct work_struct > *dummy) > > * process the pending transaction. > > */ > > fcopy_respond_to_host(HV_E_FAIL); > > + > > + /* In the case the user-space daemon crashes, hangs or is killed, we > > +* need to down the semaphore, otherwise, after the daemon starts > next > > +* time, the obsolete data in fcopy_transaction.message or > > +* fcopy_transaction.fcopy_msg will be used immediately. > > +*/ > > + if (down_trylock(&fcopy_transaction.read_sema)) > > + pr_debug("FCP: failed to acquire the semaphore\n"); > > Why is "FCP:" needed? pr_debug() should never need any type of prefix. > > Please fix. > > thanks, > > greg k-h Ok, I'll send a v2 soon. Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v2] hv: hv_fcopy: drop the obsolete message on transfer failure
In the case the user-space daemon crashes, hangs or is killed, we need to down the semaphore, otherwise, after the daemon starts next time, the obsolete data in fcopy_transaction.message or fcopy_transaction.fcopy_msg will be used immediately. Reviewed-by: Vitaly Kuznetsov Cc: K. Y. Srinivasan Signed-off-by: Dexuan Cui --- v2: I removed the "FCP" prefix as Greg asked. I also updated the output message a little: "FCP: failed to acquire the semaphore" --> "can not acquire the semaphore: it is benign" drivers/hv/hv_fcopy.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c index 23b2ce2..c518ad9 100644 --- a/drivers/hv/hv_fcopy.c +++ b/drivers/hv/hv_fcopy.c @@ -86,6 +86,15 @@ static void fcopy_work_func(struct work_struct *dummy) * process the pending transaction. */ fcopy_respond_to_host(HV_E_FAIL); + + /* In the case the user-space daemon crashes, hangs or is killed, we +* need to down the semaphore, otherwise, after the daemon starts next +* time, the obsolete data in fcopy_transaction.message or +* fcopy_transaction.fcopy_msg will be used immediately. +*/ + if (down_trylock(&fcopy_transaction.read_sema)) + pr_debug("can not acquire the semaphore: it is benign\n"); + } static int fcopy_handle_handshake(u32 version) -- 1.9.1 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH v2] hv: hv_fcopy: drop the obsolete message on transfer failure
> -Original Message- > From: Jason Wang [mailto:jasow...@redhat.com] > Sent: Thursday, November 27, 2014 15:15 PM > To: Dexuan Cui > Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; driverdev- > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; KY > Srinivasan; vkuzn...@redhat.com; Haiyang Zhang > Subject: Re: [PATCH v2] hv: hv_fcopy: drop the obsolete message on transfer > failure > - Original Message - > > In the case the user-space daemon crashes, hangs or is killed, we > > need to down the semaphore, otherwise, after the daemon starts next > > time, the obsolete data in fcopy_transaction.message or > > fcopy_transaction.fcopy_msg will be used immediately. > > > > Reviewed-by: Vitaly Kuznetsov > > Cc: K. Y. Srinivasan > > Signed-off-by: Dexuan Cui > > --- > > > > v2: I removed the "FCP" prefix as Greg asked. > > > > I also updated the output message a little: > > "FCP: failed to acquire the semaphore" --> > > "can not acquire the semaphore: it is benign" > > > > drivers/hv/hv_fcopy.c | 9 + > > 1 file changed, 9 insertions(+) > > > > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c > > index 23b2ce2..c518ad9 100644 > > --- a/drivers/hv/hv_fcopy.c > > +++ b/drivers/hv/hv_fcopy.c > > @@ -86,6 +86,15 @@ static void fcopy_work_func(struct work_struct > *dummy) > > * process the pending transaction. > > */ > > fcopy_respond_to_host(HV_E_FAIL); > > + > > + /* In the case the user-space daemon crashes, hangs or is killed, we > > +* need to down the semaphore, otherwise, after the daemon starts > next > > +* time, the obsolete data in fcopy_transaction.message or > > +* fcopy_transaction.fcopy_msg will be used immediately. > > +*/ > > Looks still racy, what happens if the daemon start before down_trylock() > but after fcopy_respont_to_host() here? Jason, Thanks for pointing this out! IMO we can resolve this by adding down_trylock() in fcopy_release(). What's your opinion? > > > + if (down_trylock(&fcopy_transaction.read_sema)) > > + pr_debug("can not acquire the semaphore: it is benign\n"); > > typo > > + > > } Sorry -- what typo do you mean? Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH v2] hv: hv_fcopy: drop the obsolete message on transfer failure
> -Original Message- > From: Jason Wang [mailto:jasow...@redhat.com] > Sent: Thursday, November 27, 2014 17:01 PM > To: Dexuan Cui > Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; driverdev- > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; KY > Srinivasan; vkuzn...@redhat.com; Haiyang Zhang > Subject: RE: [PATCH v2] hv: hv_fcopy: drop the obsolete message on transfer > failure > On Thu, Nov 27, 2014 at 4:50 PM, Dexuan Cui wrote: > >> -Original Message- > >> From: Jason Wang [mailto:jasow...@redhat.com] > >> Sent: Thursday, November 27, 2014 15:15 PM > >> To: Dexuan Cui > >> Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; > >> driverdev- > >> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; KY > >> Srinivasan; vkuzn...@redhat.com; Haiyang Zhang > >> Subject: Re: [PATCH v2] hv: hv_fcopy: drop the obsolete message on > >> transfer > >> failure > >> - Original Message - > >> > In the case the user-space daemon crashes, hangs or is killed, we > >> > need to down the semaphore, otherwise, after the daemon starts > >> next > >> > time, the obsolete data in fcopy_transaction.message or > >> > fcopy_transaction.fcopy_msg will be used immediately. > >> > > >> > Reviewed-by: Vitaly Kuznetsov > >> > Cc: K. Y. Srinivasan > >> > Signed-off-by: Dexuan Cui > >> > --- > >> > > >> > v2: I removed the "FCP" prefix as Greg asked. > >> > > >> > I also updated the output message a little: > >> > "FCP: failed to acquire the semaphore" --> > >> > "can not acquire the semaphore: it is benign" > >> > > >> > drivers/hv/hv_fcopy.c | 9 + > >> > 1 file changed, 9 insertions(+) > >> > > >> > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c > >> > index 23b2ce2..c518ad9 100644 > >> > --- a/drivers/hv/hv_fcopy.c > >> > +++ b/drivers/hv/hv_fcopy.c > >> > @@ -86,6 +86,15 @@ static void fcopy_work_func(struct work_struct > >> *dummy) > >> > * process the pending transaction. > >> > */ > >> > fcopy_respond_to_host(HV_E_FAIL); > >> > + > >> > + /* In the case the user-space daemon crashes, hangs or is > >> killed, we > >> > +* need to down the semaphore, otherwise, after the daemon > >> starts > >> next > >> > +* time, the obsolete data in fcopy_transaction.message or > >> > +* fcopy_transaction.fcopy_msg will be used immediately. > >> > +*/ > >> > >> Looks still racy, what happens if the daemon start before > >> down_trylock() > >> but after fcopy_respont_to_host() here? > > Jason, > > Thanks for pointing this out! > > IMO we can resolve this by adding down_trylock() in fcopy_release(). > > What's your opinion? > > > Looks better and need to cancel the timeout also here? OK, let me post a v3. > > > > > >> > >> > + if (down_trylock(&fcopy_transaction.read_sema)) > >> > + pr_debug("can not acquire the semaphore: it is > >> benign\n"); > >> > >> typo > >> > + > >> > } > > Sorry -- what typo do you mean? > > s/benign/begin/ ? I meant the issue(can't get the semaphore) is benign. I think we can just remove the message, as KY suggested. Instead, I'll add a comment for it. Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
In the case the user-space daemon crashes, hangs or is killed, we need to down the semaphore, otherwise, after the daemon starts next time, the obsolete data in fcopy_transaction.message or fcopy_transaction.fcopy_msg will be used immediately. Cc: Jason Wang Cc: Vitaly Kuznetsov Cc: K. Y. Srinivasan Signed-off-by: Dexuan Cui --- v2: I removed the "FCP" prefix as Greg asked. I also updated the output message a little: "FCP: failed to acquire the semaphore" --> "can not acquire the semaphore: it is benign" v3: I added the code in fcopy_release() as Jason Wang suggested. I removed the pr_debug (it isn't so meaningful)and added a comment instead. drivers/hv/hv_fcopy.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c index 23b2ce2..faa6ba6 100644 --- a/drivers/hv/hv_fcopy.c +++ b/drivers/hv/hv_fcopy.c @@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct *dummy) * process the pending transaction. */ fcopy_respond_to_host(HV_E_FAIL); + + /* In the case the user-space daemon crashes, hangs or is killed, we +* need to down the semaphore, otherwise, after the daemon starts next +* time, the obsolete data in fcopy_transaction.message or +* fcopy_transaction.fcopy_msg will be used immediately. +* +* NOTE: fcopy_read() happens to get the semaphore (very rare)? We're +* still OK, because we've reported the failure to the host. +*/ + if (down_trylock(&fcopy_transaction.read_sema)) + ; + } static int fcopy_handle_handshake(u32 version) @@ -351,6 +363,13 @@ static int fcopy_release(struct inode *inode, struct file *f) */ in_hand_shake = true; opened = false; + + if (cancel_delayed_work_sync(&fcopy_work)) { + /* We haven't up()-ed the semaphore(very rare)? */ + if (down_trylock(&fcopy_transaction.read_sema)) + ; + fcopy_respond_to_host(HV_E_FAIL); + } return 0; } -- 1.9.1 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
> -Original Message- > From: Jason Wang [mailto:jasow...@redhat.com] > Sent: Friday, November 28, 2014 14:47 PM > To: Dexuan Cui > Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; driverdev- > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; KY > Srinivasan; vkuzn...@redhat.com; Haiyang Zhang > Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer > failure > On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui wrote: > > In the case the user-space daemon crashes, hangs or is killed, we > > need to down the semaphore, otherwise, after the daemon starts next > > time, the obsolete data in fcopy_transaction.message or > > fcopy_transaction.fcopy_msg will be used immediately. > > > > Cc: Jason Wang > > Cc: Vitaly Kuznetsov > > Cc: K. Y. Srinivasan > > Signed-off-by: Dexuan Cui > > --- > > > > v2: I removed the "FCP" prefix as Greg asked. > > > > I also updated the output message a little: > > "FCP: failed to acquire the semaphore" --> > > "can not acquire the semaphore: it is benign" > > > > v3: I added the code in fcopy_release() as Jason Wang suggested. > > I removed the pr_debug (it isn't so meaningful)and added a > > comment instead. > > > > drivers/hv/hv_fcopy.c | 19 +++ > > 1 file changed, 19 insertions(+) > > > > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c > > index 23b2ce2..faa6ba6 100644 > > --- a/drivers/hv/hv_fcopy.c > > +++ b/drivers/hv/hv_fcopy.c > > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct > > *dummy) > > * process the pending transaction. > > */ > > fcopy_respond_to_host(HV_E_FAIL); > > + > > + /* In the case the user-space daemon crashes, hangs or is killed, we > > +* need to down the semaphore, otherwise, after the daemon starts > > next > > +* time, the obsolete data in fcopy_transaction.message or > > +* fcopy_transaction.fcopy_msg will be used immediately. > > +* > > +* NOTE: fcopy_read() happens to get the semaphore (very rare)? > > We're > > +* still OK, because we've reported the failure to the host. > > +*/ > > + if (down_trylock(&fcopy_transaction.read_sema)) > > + ; > > Sorry, I'm not quite understand how if () ; can help here. > > Btw, a question not relate to this patch. > > What happens if a daemon is resume from SIGSTOP and expires the check > here? Hi Jason, My idea is: here we need down_trylock(), but in case we can't get the semaphore, it's OK anyway: Scenario 1): 1.1: when the daemon is blocked on the pread(), the daemon receives SIGSTOP; 1.2: the host user runs the PowerShell Copy-VMFile command; 1.3.1: the driver reports the failure to the host user in 5s and 1.3.2: the driver down()-es the semaphore; 1.4: the daemon receives SIGCONT and it will be still blocked on the pread(). Without the down_trylock(), in 1.4, the daemon can receive an obsolete message. NOTE: in this scenario, the daemon is not killed. Scenario 2): In senario 1), if the daemon receives SIGCONT between 1.3.1 and 1.3.2 and do down() in fcopy_read(), it will receive the message but: the driver has reported the failure to the host user and the driver's 1.3.2 can't get the semaphore -- IMO this is acceptably OK, though in the VM, an incomplete file will be left there. BTW, I think in the daemon's hv_start_fcopy() we should add a close(target_fd) before open()-ing a new one. > > > > + > > } > > > > static int fcopy_handle_handshake(u32 version) > > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode *inode, > > struct file *f) > > */ > > in_hand_shake = true; > > opened = false; > > + > > + if (cancel_delayed_work_sync(&fcopy_work)) { > > + /* We haven't up()-ed the semaphore(very rare)? */ > > + if (down_trylock(&fcopy_transaction.read_sema)) > > + ; > > And this. Scenario 3): When the daemon exits(e.g., SIGKILL received), if there is a fcopy_work pending (scheduled but not start to run yet), we should cancel the work (as you suggested) and down() the semaphore, otherwise, the obsolete message will be received by the next instance of the daemon. Scenario 4): in the driver's hv_fcopy_onchannelcallback(): schedule_delayed_work(&fcopy_work, 5*HZ); > if fcopy_release() is running on another vcpu, just before the next line? fcopy_send_data(); In this case, fcopy_release() can cancel fcopy_w
RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
> -Original Message- > From: Jason Wang [mailto:jasow...@redhat.com] > Sent: Friday, November 28, 2014 18:13 PM > To: Dexuan Cui > Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; driverdev- > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; KY > Srinivasan; vkuzn...@redhat.com; Haiyang Zhang > Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer > failure > On Fri, Nov 28, 2014 at 4:36 PM, Dexuan Cui wrote: > >> -Original Message- > >> From: Jason Wang [mailto:jasow...@redhat.com] > >> Sent: Friday, November 28, 2014 14:47 PM > >> To: Dexuan Cui > >> Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; > >> driverdev- > >> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; KY > >> Srinivasan; vkuzn...@redhat.com; Haiyang Zhang > >> Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on > >> transfer > >> failure > >> On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui > >> wrote: > >> > In the case the user-space daemon crashes, hangs or is killed, we > >> > need to down the semaphore, otherwise, after the daemon starts > >> next > >> > time, the obsolete data in fcopy_transaction.message or > >> > fcopy_transaction.fcopy_msg will be used immediately. > >> > > >> > Cc: Jason Wang > >> > Cc: Vitaly Kuznetsov > >> > Cc: K. Y. Srinivasan > >> > Signed-off-by: Dexuan Cui > >> > --- > >> > > >> > v2: I removed the "FCP" prefix as Greg asked. > >> > > >> > I also updated the output message a little: > >> > "FCP: failed to acquire the semaphore" --> > >> > "can not acquire the semaphore: it is benign" > >> > > >> > v3: I added the code in fcopy_release() as Jason Wang suggested. > >> > I removed the pr_debug (it isn't so meaningful)and added a > >> > comment instead. > >> > > >> > drivers/hv/hv_fcopy.c | 19 +++ > >> > 1 file changed, 19 insertions(+) > >> > > >> > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c > >> > index 23b2ce2..faa6ba6 100644 > >> > --- a/drivers/hv/hv_fcopy.c > >> > +++ b/drivers/hv/hv_fcopy.c > >> > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct > >> > *dummy) > >> > * process the pending transaction. > >> > */ > >> > fcopy_respond_to_host(HV_E_FAIL); > >> > + > >> > + /* In the case the user-space daemon crashes, hangs or is > >> killed, we > >> > +* need to down the semaphore, otherwise, after the daemon > >> starts > >> > next > >> > +* time, the obsolete data in fcopy_transaction.message or > >> > +* fcopy_transaction.fcopy_msg will be used immediately. > >> > +* > >> > +* NOTE: fcopy_read() happens to get the semaphore (very rare)? > >> > We're > >> > +* still OK, because we've reported the failure to the host. > >> > +*/ > >> > + if (down_trylock(&fcopy_transaction.read_sema)) > >> > + ; > >> > >> Sorry, I'm not quite understand how if () ; can help here. > >> > >> Btw, a question not relate to this patch. > >> > >> What happens if a daemon is resume from SIGSTOP and expires the > >> check > >> here? > > Hi Jason, > > My idea is: here we need down_trylock(), but in case we can't get the > > semaphore, it's OK anyway: > > > > Scenario 1): > > 1.1: when the daemon is blocked on the pread(), the daemon receives > > SIGSTOP; > > 1.2: the host user runs the PowerShell Copy-VMFile command; > > 1.3.1: the driver reports the failure to the host user in 5s and > > 1.3.2: the driver down()-es the semaphore; > > 1.4: the daemon receives SIGCONT and it will be still blocked on the > > pread(). > > Without the down_trylock(), in 1.4, the daemon can receive an > > obsolete message. > > NOTE: in this scenario, the daemon is not killed. > > > > Scenario 2): > > In senario 1), if the daemon receives SIGCONT between 1.3.1 and 1.3.2 > > and > > do down() in fcopy_read(), it will receive the message but:
RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
> -Original Message- > From: Jason Wang [mailto:jasow...@redhat.com] > Sent: Monday, December 1, 2014 16:23 PM > To: Dexuan Cui > Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; driverdev- > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; KY > Srinivasan; vkuzn...@redhat.com; Haiyang Zhang > Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer > failure > On Fri, Nov 28, 2014 at 7:54 PM, Dexuan Cui wrote: > >> -Original Message- > >> From: Jason Wang [mailto:jasow...@redhat.com] > >> Sent: Friday, November 28, 2014 18:13 PM > >> To: Dexuan Cui > >> Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; > >> driverdev- > >> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; KY > >> Srinivasan; vkuzn...@redhat.com; Haiyang Zhang > >> Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on > >> transfer > >> failure > >> On Fri, Nov 28, 2014 at 4:36 PM, Dexuan Cui > >> wrote: > >> >> -Original Message- > >> >> From: Jason Wang [mailto:jasow...@redhat.com] > >> >> Sent: Friday, November 28, 2014 14:47 PM > >> >> To: Dexuan Cui > >> >> Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; > >> >> driverdev- > >> >> de...@linuxdriverproject.org; o...@aepfle.de; > >> a...@canonical.com; KY > >> >> Srinivasan; vkuzn...@redhat.com; Haiyang Zhang > >> >> Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message > >> on > >> >> transfer > >> >> failure > >> >> On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui > >> > >> >> wrote: > >> >> > In the case the user-space daemon crashes, hangs or is > >> killed, we > >> >> > need to down the semaphore, otherwise, after the daemon starts > >> >> next > >> >> > time, the obsolete data in fcopy_transaction.message or > >> >> > fcopy_transaction.fcopy_msg will be used immediately. > >> >> > > >> >> > Cc: Jason Wang > >> >> > Cc: Vitaly Kuznetsov > >> >> > Cc: K. Y. Srinivasan > >> >> > Signed-off-by: Dexuan Cui > >> >> > --- > >> >> > > >> >> > v2: I removed the "FCP" prefix as Greg asked. > >> >> > > >> >> > I also updated the output message a little: > >> >> > "FCP: failed to acquire the semaphore" --> > >> >> > "can not acquire the semaphore: it is benign" > >> >> > > >> >> > v3: I added the code in fcopy_release() as Jason Wang > >> suggested. > >> >> > I removed the pr_debug (it isn't so meaningful)and added a > >> >> > comment instead. > >> >> > > >> >> > drivers/hv/hv_fcopy.c | 19 +++ > >> >> > 1 file changed, 19 insertions(+) > >> >> > > >> >> > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c > >> >> > index 23b2ce2..faa6ba6 100644 > >> >> > --- a/drivers/hv/hv_fcopy.c > >> >> > +++ b/drivers/hv/hv_fcopy.c > >> >> > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct > >> work_struct > >> >> > *dummy) > >> >> > * process the pending transaction. > >> >> > */ > >> >> > fcopy_respond_to_host(HV_E_FAIL); > >> >> > + > >> >> > + /* In the case the user-space daemon crashes, hangs or is > >> >> killed, we > >> >> > +* need to down the semaphore, otherwise, after the daemon > >> >> starts > >> >> > next > >> >> > +* time, the obsolete data in fcopy_transaction.message or > >> >> > +* fcopy_transaction.fcopy_msg will be used immediately. > >> >> > +* > >> >> > +* NOTE: fcopy_read() happens to get the semaphore (very > >> rare)? > >> >> > We're > >> >> > +* still OK, because we've reported the failure to the host. > >> >> > +*/ > >> >> > + if (down_t
RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
> -Original Message- > From: Jason Wang [mailto:jasow...@redhat.com] > Sent: Monday, December 1, 2014 18:18 PM > To: Dexuan Cui > Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; driverdev- > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; KY > Srinivasan; vkuzn...@redhat.com; Haiyang Zhang > Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer > failure > > > > On Mon, Dec 1, 2014 at 5:47 PM, Dexuan Cui wrote: > >> -Original Message- > >> From: Jason Wang [mailto:jasow...@redhat.com] > >> Sent: Monday, December 1, 2014 16:23 PM > >> To: Dexuan Cui > >> Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; > >> driverdev- > >> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; KY > >> Srinivasan; vkuzn...@redhat.com; Haiyang Zhang > >> Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on > >> transfer > >> failure > >> On Fri, Nov 28, 2014 at 7:54 PM, Dexuan Cui > >> wrote: > >> >> -Original Message- > >> >> From: Jason Wang [mailto:jasow...@redhat.com] > >> >> Sent: Friday, November 28, 2014 18:13 PM > >> >> To: Dexuan Cui > >> >> Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; > >> >> driverdev- > >> >> de...@linuxdriverproject.org; o...@aepfle.de; > >> a...@canonical.com; KY > >> >> Srinivasan; vkuzn...@redhat.com; Haiyang Zhang > >> >> Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message > >> on > >> >> transfer > >> >> failure > >> >> On Fri, Nov 28, 2014 at 4:36 PM, Dexuan Cui > >> > >> >> wrote: > >> >> >> -Original Message- > >> >> >> From: Jason Wang [mailto:jasow...@redhat.com] > >> >> >> Sent: Friday, November 28, 2014 14:47 PM > >> >> >> To: Dexuan Cui > >> >> >> Cc: gre...@linuxfoundation.org; > >> linux-ker...@vger.kernel.org; > >> >> >> driverdev- > >> >> >> de...@linuxdriverproject.org; o...@aepfle.de; > >> >> a...@canonical.com; KY > >> >> >> Srinivasan; vkuzn...@redhat.com; Haiyang Zhang > >> >> >> Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete > >> message > >> >> on > >> >> >> transfer > >> >> >> failure > >> >> >> On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui > >> >> > >> >> >> wrote: > >> >> >> > In the case the user-space daemon crashes, hangs or is > >> >> killed, we > >> >> >> > need to down the semaphore, otherwise, after the daemon > >> starts > >> >> >> next > >> >> >> > time, the obsolete data in fcopy_transaction.message or > >> >> >> > fcopy_transaction.fcopy_msg will be used immediately. > >> >> >> > > >> >> >> > Cc: Jason Wang > >> >> >> > Cc: Vitaly Kuznetsov > >> >> >> > Cc: K. Y. Srinivasan > >> >> >> > Signed-off-by: Dexuan Cui > >> >> >> > --- > >> >> >> > > >> >> >> > v2: I removed the "FCP" prefix as Greg asked. > >> >> >> > > >> >> >> > I also updated the output message a little: > >> >> >> > "FCP: failed to acquire the semaphore" --> > >> >> >> > "can not acquire the semaphore: it is benign" > >> >> >> > > >> >> >> > v3: I added the code in fcopy_release() as Jason Wang > >> >> suggested. > >> >> >> > I removed the pr_debug (it isn't so meaningful)and > >> added a > >> >> >> > comment instead. > >> >> >> > > >> >> >> > drivers/hv/hv_fcopy.c | 19 +++ > >> >> >> > 1 file changed, 19 insertions(+) > >> >> >> > > >> >> >> > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c > >> >> >> >
RE: [PATCH v2] Drivers: hv: vmbus: prevent cpu offlining on newer hypervisors
> -Original Message- > From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] > Sent: Monday, December 1, 2014 18:53 PM > To: KY Srinivasan; Haiyang Zhang > Cc: de...@linuxdriverproject.org; Greg Kroah-Hartman; linux- > ker...@vger.kernel.org; Dexuan Cui > Subject: [PATCH v2] Drivers: hv: vmbus: prevent cpu offlining on newer > hypervisors > > When an SMP Hyper-V guest is running on top of 2012R2 Server and > secondary > cpus are sent offline (with echo 0 > /sys/devices/system/cpu/cpu$cpu/online) > the system freeze is observed. This happens due to the fact that on newer > hypervisors (Win8, WS2012R2, ...) vmbus channel handlers are distributed > across all cpus (see init_vp_index() function in drivers/hv/channel_mgmt.c) > and on cpu offlining nobody reassigns them to CPU0. Prevent cpu offlining > when vmbus is loaded until the issue is fixed host-side. > > This patch also disables hibernation but it is OK as it is also broken (MCE > error is hit on resume). Suspend still works. > > Tested with WS2008R2 and WS2012R2. > > Signed-off-by: Vitaly Kuznetsov > > --- > Changes since v1: > - introduce hv_cpu_hotplug_quirk() function to not spread #ifdefs [Greg KH] > - add pr_notice() message "hv_vmbus: CPU offlining is not supported by > hypervisor" > --- > drivers/hv/vmbus_drv.c | 33 + > 1 file changed, 33 insertions(+) > > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c > index 4d6b269..2e6b38e 100644 > --- a/drivers/hv/vmbus_drv.c > +++ b/drivers/hv/vmbus_drv.c > @@ -32,6 +32,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -671,6 +672,36 @@ static void vmbus_isr(void) > tasklet_schedule(&msg_dpc); > } > > +#ifdef CONFIG_HOTPLUG_CPU > +static int hyperv_cpu_disable(void) > +{ > + return -1; > +} > + > +static void hv_cpu_hotplug_quirk(bool vmbus_loaded) > +{ > + /* > + * Offlining a CPU when running on newer hypervisors (WS2012R2, > Win8, > + * ...) is not supported at this moment as channel interrupts are > + * distributed across all of them. > + */ > + > + if ((vmbus_proto_version == VERSION_WS2008) || > + (vmbus_proto_version == VERSION_WIN7)) > + return; > + > + if (vmbus_loaded) { > + smp_ops.cpu_disable = hyperv_cpu_disable; > + pr_notice("CPU offlining is not supported by hypervisor"); > + } else > + smp_ops.cpu_disable = native_cpu_disable; > +} > +#else > +static void hv_cpu_hotplug_quirk(bool vmbus_loaded) > +{ > +} > +#endif > + > /* > * vmbus_bus_init -Main vmbus driver initialization routine. > * > @@ -711,6 +742,7 @@ static int vmbus_bus_init(int irq) > if (ret) > goto err_alloc; > > + hv_cpu_hotplug_quirk(true); > vmbus_request_offers(); > > return 0; > @@ -964,6 +996,7 @@ static void __exit vmbus_exit(void) > bus_unregister(&hv_bus); > hv_cleanup(); > acpi_bus_unregister_driver(&vmbus_acpi_driver); > + hv_cpu_hotplug_quirk(false); > } Acked-by: Dexuan Cui ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
> -Original Message- > From: KY Srinivasan > Sent: Monday, December 1, 2014 23:55 PM > To: Dexuan Cui; Jason Wang > Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; driverdev- > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > vkuzn...@redhat.com; Haiyang Zhang > Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer > failure > > > > > -----Original Message- > > From: Dexuan Cui > > Sent: Monday, December 1, 2014 3:01 AM > > To: Jason Wang > > Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; driverdev- > > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; KY > > Srinivasan; vkuzn...@redhat.com; Haiyang Zhang > > Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer > > failure > > > > > -Original Message- > > > From: Jason Wang [mailto:jasow...@redhat.com] > > > Sent: Monday, December 1, 2014 18:18 PM > > > To: Dexuan Cui > > > Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; > > > driverdev- de...@linuxdriverproject.org; o...@aepfle.de; > > > a...@canonical.com; KY Srinivasan; vkuzn...@redhat.com; Haiyang > Zhang > > > Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on > > > transfer failure > > > > > > > > > > > > On Mon, Dec 1, 2014 at 5:47 PM, Dexuan Cui > > wrote: > > > >> -Original Message- > > > >> From: Jason Wang [mailto:jasow...@redhat.com] > > > >> Sent: Monday, December 1, 2014 16:23 PM > > > >> To: Dexuan Cui > > > >> Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; > > > >> driverdev- > > > >> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > > > >> KY Srinivasan; vkuzn...@redhat.com; Haiyang Zhang > > > >> Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on > > > >> transfer failure On Fri, Nov 28, 2014 at 7:54 PM, Dexuan Cui > > > >> > > > >> wrote: > > > >> >> -Original Message- > > > >> >> From: Jason Wang [mailto:jasow...@redhat.com] >> Sent: > > > >> Friday, November 28, 2014 18:13 PM >> To: Dexuan Cui >> Cc: > > > >> gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; >> > > > >> driverdev- >> de...@linuxdriverproject.org; o...@aepfle.de; > > > >> a...@canonical.com; KY >> Srinivasan; vkuzn...@redhat.com; > Haiyang > > > >> Zhang >> Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete > > > >> message on >> transfer >> failure >> On Fri, Nov 28, 2014 at > > > >> 4:36 PM, Dexuan Cui >> wrote: > > > >> >> >> -Original Message- >> >> From: Jason Wang > > > >> [mailto:jasow...@redhat.com] >> >> Sent: Friday, November 28, > > > >> 2014 14:47 PM >> >> To: Dexuan Cui >> >> Cc: > > > >> gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; >> >> > > > >> driverdev- >> >> de...@linuxdriverproject.org; o...@aepfle.de; > > > >> >> a...@canonical.com; KY >> >> Srinivasan; vkuzn...@redhat.com; > > > >> Haiyang Zhang >> >> Subject: Re: [PATCH v3] hv: hv_fcopy: drop > > > >> the obsolete message >> on >> >> transfer >> >> failure >> > > > >> >> On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui >> > > > >> >> >> wrote: > > > >> >> >> > In the case the user-space daemon crashes, hangs or is > > > >> >> killed, we >> >> > need to down the semaphore, otherwise, > > > >> after the daemon starts >> >> next >> >> > time, the obsolete > > > >> data in fcopy_transaction.message or >> >> > > > > >> fcopy_transaction.fcopy_msg will be used immediately. > > > >> >> >> > > > > >> >> >> > Cc: Jason Wang >> >> > Cc: > > > >> Vitaly Kuznetsov >> >> > Cc: K. Y. > > > >> Srinivasan >> >> > Signed-off-by: Dexuan Cui > > > >> >> >> > --- >
[PATCH] tools: hv: kvp_daemon: make IPv6-only-injection work
Currently IPv6-only-injection doesn't work because the daemon doesn't parse any IPv6 information at all once it finds the dhcp_enabled flag is true. But according to the Hyper-v host team, the flag is only for IPv4. In the case the host only injects 1 IPv6 address, the dhcp flag is true, but we shouldn't ignore the IPv6 address and we should pass BOOTPROTO=none to the distro-specific script hv_set_ipconfig. Tested in Ubuntu 14.10 and RHEL7. Cc: K. Y. Srinivasan Signed-off-by: Dexuan Cui --- tools/hv/hv_kvp_daemon.c | 47 +++ 1 file changed, 31 insertions(+), 16 deletions(-) diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c index 6a6432a..6ef6c04 100644 --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -1145,6 +1145,9 @@ static int kvp_write_file(FILE *f, char *s1, char *s2, char *s3) } +/* How many ipv6 addresses the host is trying to inject? */ +static int num_ipv6_injected; + static int process_ip_string(FILE *f, char *ip_string, int type) { int error = 0; @@ -1190,6 +1193,7 @@ static int process_ip_string(FILE *f, char *ip_string, int type) switch (type) { case IPADDR: snprintf(str, sizeof(str), "%s", "IPV6ADDR"); + num_ipv6_injected++; break; case NETMASK: snprintf(str, sizeof(str), "%s", "IPV6NETMASK"); @@ -1308,27 +1312,12 @@ static int kvp_set_ip_info(char *if_name, struct hv_kvp_ipaddr_value *new_val) if (error) goto setval_error; - if (new_val->dhcp_enabled) { - error = kvp_write_file(file, "BOOTPROTO", "", "dhcp"); - if (error) - goto setval_error; - - /* -* We are done!. -*/ - goto setval_done; - - } else { - error = kvp_write_file(file, "BOOTPROTO", "", "none"); - if (error) - goto setval_error; - } - /* * Write the configuration for ipaddress, netmask, gateway and * name servers. */ + num_ipv6_injected = 0; error = process_ip_string(file, (char *)new_val->ip_addr, IPADDR); if (error) goto setval_error; @@ -1345,6 +1334,32 @@ static int kvp_set_ip_info(char *if_name, struct hv_kvp_ipaddr_value *new_val) if (error) goto setval_error; + /* +* Here "dhcp_enabled" is only for IPv4 according to Hyper-V host team. +* +* In the case the host only injects 1 IPv6 address: +* new_val->dhcp_enabled is true, but we can't pass BOOTPROTO=dhcp to +* the script hv_set_ifconfig, because in some distros (like RHEL7) +* BOOTPROTO=dhcp has a special meaning in the config file (e.g., +* /etc/sysconfig/network-scripts/ifcfg-eth0): the network init program +* ignores any static IP addr information once there is +* BOOTPROTO=dhcp; as a result, IPv6-only injection can't work. +* +* In the case of IPv6-only injection, BOOTPROTO=dhcp doesn't affect +* Ubuntu because it's ignored by the Ubuntu version of +* hv_set_ifconfig and it doesn't seem to have special meaning in +* Ubuntu. +*/ + if (new_val->dhcp_enabled && num_ipv6_injected == 0) { + error = kvp_write_file(file, "BOOTPROTO", "", "dhcp"); + if (error) + goto setval_error; + } else { + error = kvp_write_file(file, "BOOTPROTO", "", "none"); + if (error) + goto setval_error; + } + setval_done: fclose(file); -- 1.9.1 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH 0/5] Tools: hv: fix compiler warnings and do minor cleanup
> -Original Message- > From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] > Sent: Tuesday, December 9, 2014 23:48 PM > To: KY Srinivasan > Cc: Haiyang Zhang; de...@linuxdriverproject.org; linux- > ker...@vger.kernel.org; Dexuan Cui > Subject: [PATCH 0/5] Tools: hv: fix compiler warnings and do minor cleanup > > When someone does 'make' in tools/hv/ issues appear: > - hv_fcopy_daemon is not being built; > - lots of compiler warnings. > > This is just a cleanup. Compile-tested by myself on top of linux-next/master. > > Piggyback this series and send "[PATCH 5/5] Tools: hv: do not add redundant > '/' > in hv_start_fcopy()" > > Vitaly Kuznetsov (5): > Tools: hv: add mising fcopyd to the Makefile > Tools: hv: remove unused bytes_written from kvp_update_file() > Tools: hv: address compiler warnings for hv_kvp_daemon.c > Tools: hv: address compiler warnings for hv_fcopy_daemon.c > Tools: hv: do not add redundant '/' in hv_start_fcopy() > > tools/hv/Makefile | 4 ++-- > tools/hv/hv_fcopy_daemon.c | 10 ++ > tools/hv/hv_kvp_daemon.c | 29 +---- > 3 files changed, 17 insertions(+), 26 deletions(-) > > -- > 1.9.3 Hi Vitaly, Thanks for the patchset! Acked-by: Dexuan Cui PS, I added Greg into the TO list. The hv code in drivers/hv/ and tools/hv/ usually has to go into Greg's tree first. -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] tools: hv: kvp_daemon: make IPv6-only-injection work
> -Original Message- > From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] > Sent: Tuesday, December 9, 2014 21:06 PM > To: Dexuan Cui > Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; driverdev- > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com; KY Srinivasan; Haiyang Zhang > Subject: Re: [PATCH] tools: hv: kvp_daemon: make IPv6-only-injection work > .. > > +* Here "dhcp_enabled" is only for IPv4 according to Hyper-V host > team. > > +* > > +* In the case the host only injects 1 IPv6 address: > > +* new_val->dhcp_enabled is true, but we can't pass > BOOTPROTO=dhcp to > > +* the script hv_set_ifconfig, because in some distros (like RHEL7) > > +* BOOTPROTO=dhcp has a special meaning in the config file (e.g., > > +* /etc/sysconfig/network-scripts/ifcfg-eth0): the network init > program > > +* ignores any static IP addr information once there is > > +* BOOTPROTO=dhcp; as a result, IPv6-only injection can't work. > > +* > > +* In the case of IPv6-only injection, BOOTPROTO=dhcp doesn't affect > > +* Ubuntu because it's ignored by the Ubuntu version of > > +* hv_set_ifconfig and it doesn't seem to have special meaning in > > +* Ubuntu. > > +*/ > > I just checked and adding "IPV6ADDR=something" when "BOOTPROTO=dhcp" > works for me with both RHEL7 and Fedora21. It doesn't work in my side. :-( Running 'ifup eth0' shows some errors(I use "set -x") ... + /sbin/dhclient -H localhost -1 -q -lf /var/lib/dhclient/dhclient--eth0.lease -pf /var/run/dhclient-eth0.pid eth0 grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied. grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied. grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied. grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied. grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied. grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied. done. I'm trying to find out the cause. > Other than that I think bringing distribution specifics into kernel.git > is not a good idea. /etc/sysconfig/network-scripts/ifcfg-* format is > distro-specific and not all Linux distros support it. Moreover, I agree. > different distros can treat setting differently. I think it was wrong to > stick to this format in kvp daemon from very beginning. We can also think the current format used in kvp daemon is already distro-agnostic -- it just happens to look like the style of network config file used in RHEL :-) > > As a solution I would suggest doing the following: kvp daemon writes all > received request details in distro-agnostic format in some temporary > place and then calls distro-specific script to set things up. Actually, > we already have such script: tools/hv/hv_set_ifconfig.sh Yeah, this is exactly what we already have today. > As for this bug I propose the following: remove skipping all > IPADDR/MASK/... settings in case of "BOOTPROTO=dhcp" and let > distro-specific script deal with the rest. > -- > Vitaly OK, so the patch would be 1-line only: diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c index 22b0764..53fdaad 100644 --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -1314,10 +1314,8 @@ static int kvp_set_ip_info(char *if_name, struct hv_kvp_ipaddr_value *new_val) goto setval_error; /* -* We are done!. +* We are not done... TODO: add comment here. */ - goto setval_done; - } else { error = kvp_write_file(file, "BOOTPROTO", "", "none"); if (error) I'll send out a v2 after I resolve the "grep ... Permission dinied" issue. Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] tools: hv: kvp_daemon: make IPv6-only-injection work
Thanks, -- Dexuan > -Original Message- > From: Dexuan Cui > Sent: Wednesday, December 10, 2014 15:34 PM > To: 'Vitaly Kuznetsov' > Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; driverdev- > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com; KY Srinivasan; Haiyang Zhang > Subject: RE: [PATCH] tools: hv: kvp_daemon: make IPv6-only-injection work > > > -Original Message- > > From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] > > Sent: Tuesday, December 9, 2014 21:06 PM > > To: Dexuan Cui > > Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; driverdev- > > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > > jasow...@redhat.com; KY Srinivasan; Haiyang Zhang > > Subject: Re: [PATCH] tools: hv: kvp_daemon: make IPv6-only-injection work > > .. > > > + * Here "dhcp_enabled" is only for IPv4 according to Hyper-V host > > team. > > > + * > > > + * In the case the host only injects 1 IPv6 address: > > > + * new_val->dhcp_enabled is true, but we can't pass > > BOOTPROTO=dhcp to > > > + * the script hv_set_ifconfig, because in some distros (like RHEL7) > > > + * BOOTPROTO=dhcp has a special meaning in the config file (e.g., > > > + * /etc/sysconfig/network-scripts/ifcfg-eth0): the network init > > program > > > + * ignores any static IP addr information once there is > > > + * BOOTPROTO=dhcp; as a result, IPv6-only injection can't work. > > > + * > > > + * In the case of IPv6-only injection, BOOTPROTO=dhcp doesn't affect > > > + * Ubuntu because it's ignored by the Ubuntu version of > > > + * hv_set_ifconfig and it doesn't seem to have special meaning in > > > + * Ubuntu. > > > + */ > > > > I just checked and adding "IPV6ADDR=something" when > "BOOTPROTO=dhcp" > > works for me with both RHEL7 and Fedora21. > It doesn't work in my side. :-( > Running 'ifup eth0' shows some errors(I use "set -x") > ... > + /sbin/dhclient -H localhost -1 -q -lf > /var/lib/dhclient/dhclient--eth0.lease -pf > /var/run/dhclient-eth0.pid eth0 > grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied. BTW, I run with root, and 'chown 777 /etc/sysconfig/network-scripts/ifcfg-eth0" doesn't help. Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH] tools: hv: kvp_daemon: make IPv6-only-injection work
> -Original Message- > From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] > Sent: Wednesday, December 10, 2014 17:08 PM > To: Dexuan Cui > Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; driverdev- > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com; KY Srinivasan; Haiyang Zhang > Subject: Re: [PATCH] tools: hv: kvp_daemon: make IPv6-only-injection work > > Dexuan Cui writes: > > > Thanks, > > -- Dexuan > > > >> -Original Message- > >> From: Dexuan Cui > >> Sent: Wednesday, December 10, 2014 15:34 PM > >> To: 'Vitaly Kuznetsov' > >> Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; driverdev- > >> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > >> jasow...@redhat.com; KY Srinivasan; Haiyang Zhang > >> Subject: RE: [PATCH] tools: hv: kvp_daemon: make IPv6-only-injection > work > >> > >> > -Original Message- > >> > From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] > >> > Sent: Tuesday, December 9, 2014 21:06 PM > >> > To: Dexuan Cui > >> > Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; > driverdev- > >> > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > >> > jasow...@redhat.com; KY Srinivasan; Haiyang Zhang > >> > Subject: Re: [PATCH] tools: hv: kvp_daemon: make IPv6-only-injection > work > >> > .. > >> > > + * Here "dhcp_enabled" is only for IPv4 according to Hyper-V > >> > > host > >> > team. > >> > > + * > >> > > + * In the case the host only injects 1 IPv6 address: > >> > > + * new_val->dhcp_enabled is true, but we can't pass > >> > BOOTPROTO=dhcp to > >> > > + * the script hv_set_ifconfig, because in some distros (like > >> > > RHEL7) > >> > > + * BOOTPROTO=dhcp has a special meaning in the config file > >> > > (e.g., > >> > > + * /etc/sysconfig/network-scripts/ifcfg-eth0): the network init > >> > program > >> > > + * ignores any static IP addr information once there is > >> > > + * BOOTPROTO=dhcp; as a result, IPv6-only injection can't work. > >> > > + * > >> > > + * In the case of IPv6-only injection, BOOTPROTO=dhcp doesn't > >> > > affect > >> > > + * Ubuntu because it's ignored by the Ubuntu version of > >> > > + * hv_set_ifconfig and it doesn't seem to have special meaning > >> > > in > >> > > + * Ubuntu. > >> > > + */ > >> > > >> > I just checked and adding "IPV6ADDR=something" when > >> "BOOTPROTO=dhcp" > >> > works for me with both RHEL7 and Fedora21. > >> It doesn't work in my side. :-( > >> Running 'ifup eth0' shows some errors(I use "set -x") > >> ... > >> + /sbin/dhclient -H localhost -1 -q -lf > >> /var/lib/dhclient/dhclient--eth0.lease > -pf > >> /var/run/dhclient-eth0.pid eth0 > >> grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied. > > BTW, I run with root, and > > 'chown 777 /etc/sysconfig/network-scripts/ifcfg-eth0" doesn't help. > > > > s,chown,chmod, :-) But it won't help in case of SELinux mislabeling. > > > Thanks, > > -- Dexuan > > -- > Vitaly Hi Vitally, I think you're correct: BOOTPROTO=dhcp + "no IPADDR" + IPV6ADDR in RHEL7's /etc/sysconfig/network-scripts/ifcfg-eth0 works for me too. My previous "grep: ... ifcfg-eth0: Permission denied" issue can't repro now. Maybe it's because I messed up the config file somehow(?). I'll send out a v2 patch according to your suggestion. Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[PATCH v2] tools: hv: kvp_daemon: make IPv6-only-injection work
In the case the host only injects an IPv6 address, the dhcp_enabled flag is true (it's only for IPv4 according to Hyper-V host team), but we still need to proceed to parse the IPv6 information. Cc: Vitaly Kuznetsov Cc: K. Y. Srinivasan Signed-off-by: Dexuan Cui --- v2: removed the distro-specific logic as Vitaly suggested. tools/hv/hv_kvp_daemon.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c index 6a6432a..4b3ee35 100644 --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -1308,16 +1308,17 @@ static int kvp_set_ip_info(char *if_name, struct hv_kvp_ipaddr_value *new_val) if (error) goto setval_error; + /* +* The dhcp_enabled flag is only for IPv4. In the case the host only +* injects an IPv6 address, the flag is true, but we still need to +* proceed to parse and pass the IPv6 information to the +* disto-specific script hv_set_ifconfig. +*/ if (new_val->dhcp_enabled) { error = kvp_write_file(file, "BOOTPROTO", "", "dhcp"); if (error) goto setval_error; - /* -* We are done!. -*/ - goto setval_done; - } else { error = kvp_write_file(file, "BOOTPROTO", "", "none"); if (error) @@ -1345,7 +1346,6 @@ static int kvp_set_ip_info(char *if_name, struct hv_kvp_ipaddr_value *new_val) if (error) goto setval_error; -setval_done: fclose(file); /* -- 1.9.1 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
RE: [PATCH v2] tools: hv: kvp_daemon: make IPv6-only-injection work
> -Original Message- > From: devel [mailto:driverdev-devel-boun...@linuxdriverproject.org] On > Behalf Of Dexuan Cui > Sent: Wednesday, December 10, 2014 19:33 PM > To: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org; driverdev- > de...@linuxdriverproject.org; vkuzn...@redhat.com; o...@aepfle.de; > a...@canonical.com; jasow...@redhat.com; KY Srinivasan > Cc: Haiyang Zhang > Subject: [PATCH v2] tools: hv: kvp_daemon: make IPv6-only-injection work > > In the case the host only injects an IPv6 address, the dhcp_enabled flag is > true (it's only for IPv4 according to Hyper-V host team), but we still need to > proceed to parse the IPv6 information. > > Cc: Vitaly Kuznetsov > Cc: K. Y. Srinivasan > Signed-off-by: Dexuan Cui > --- > > v2: removed the distro-specific logic as Vitaly suggested. > > tools/hv/hv_kvp_daemon.c | 12 ++-- > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c > index 6a6432a..4b3ee35 100644 > --- a/tools/hv/hv_kvp_daemon.c > +++ b/tools/hv/hv_kvp_daemon.c > @@ -1308,16 +1308,17 @@ static int kvp_set_ip_info(char *if_name, struct > hv_kvp_ipaddr_value *new_val) > if (error) > goto setval_error; > > + /* > + * The dhcp_enabled flag is only for IPv4. In the case the host only > + * injects an IPv6 address, the flag is true, but we still need to > + * proceed to parse and pass the IPv6 information to the > + * disto-specific script hv_set_ifconfig. > + */ > if (new_val->dhcp_enabled) { > error = kvp_write_file(file, "BOOTPROTO", "", "dhcp"); > if (error) > goto setval_error; > > - /* > - * We are done!. > - */ > - goto setval_done; > - > } else { > error = kvp_write_file(file, "BOOTPROTO", "", "none"); > if (error) > @@ -1345,7 +1346,6 @@ static int kvp_set_ip_info(char *if_name, struct > hv_kvp_ipaddr_value *new_val) > if (error) > goto setval_error; > > -setval_done: > fclose(file); > > /* > -- > 1.9.1 Hi Vitaly, Can you please ACK the v2 patch? Or, please let me know if you have new comments. Thanks, -- Dexuan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel