RE: [PATCH 1/1] Drivers: HV: Send one page worth of kmsg dump over Hyper-V during panic

2018-07-11 Thread Sunil Muthuswamy
Thanks, Michael. In which branch should I fix these now that the changes have 
been
merged with the char-misc-next branch?

Comments inline.

> -Original Message-
> From: Michael Kelley (EOSG)
> Sent: Tuesday, July 10, 2018 6:05 PM
> To: KY Srinivasan ; gre...@linuxfoundation.org; linux-
> ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de;
> a...@canonical.com; jasow...@redhat.com; Stephen Hemminger
> ; vkuzn...@redhat.com
> Cc: Sunil Muthuswamy 
> Subject: RE: [PATCH 1/1] Drivers: HV: Send one page worth of kmsg dump
> over Hyper-V during panic
> 
> From k...@linuxonhyperv.com   Sent: Saturday,
> July 7, 2018 7:57 PM
> >
> > From: Sunil Muthuswamy 
> >
> > In the VM mode on Hyper-V, currently, when the kernel panics, an error
> > code and few register values are populated in an MSR and the Hypervisor
> > notified. This information is collected on the host. The amount of
> > information currently collected is found to be limited and not very
> > actionable. To gather more actionable data, such as stack trace, the
> > proposal is to write one page worth of kmsg data on an allocated page
> > and the Hypervisor notified of the page address through the MSR.
> >
> > - Sysctl option to control the behavior, with ON by default.
> >
> > Cc: K. Y. Srinivasan 
> > Cc: Stephen Hemminger 
> > Signed-off-by: Sunil Muthuswamy 
> > Signed-off-by: K. Y. Srinivasan 
> > ---
> 
> > +   /*
> > +* Write dump contents to the page. No need to synchronize; panic
> should
> > +* be single-threaded.
> > +*/
> > +   if (!kmsg_dump_get_buffer(dumper, true, hv_panic_page,
> > + PAGE_SIZE, &bytes_written)) {
> > +   pr_err("Hyper-V: Unable to get kmsg data for panic\n");
> > +   return;
> 
> From what I can see, the return value from kmsg_dump_get_buffer()
> is not an indication of success or failure -- it's an indication of whether
> there is more data available.   There's no reason to output an error
> message.
> 
That seems correct. Will address this.
> > @@ -1065,6 +1136,32 @@ static int vmbus_bus_init(void)
> >  * Only register if the crash MSRs are available
> >  */
> > if (ms_hyperv.misc_features &
> HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) {
> > +   u64 hyperv_crash_ctl;
> > +   /*
> > +* Sysctl registration is not fatal, since by default
> > +* reporting is enabled.
> > +*/
> > +   hv_ctl_table_hdr = register_sysctl_table(hv_root_table);
> > +   if (!hv_ctl_table_hdr)
> > +   pr_err("Hyper-V: sysctl table register error");
> > +
> > +   /*
> > +* Register for panic kmsg callback only if the right
> > +* capability is supported by the hypervisor.
> > +*/
> > +   rdmsrl(HV_X64_MSR_CRASH_CTL, hyperv_crash_ctl);
> > +   if (hyperv_crash_ctl &
> HV_CRASH_CTL_CRASH_NOTIFY_MSG) {
> 
> vmbus_drv.c is architecture independent code, and should not be
> referencing
> x86/x64 MSRs.   Reading the MSR (and maybe the test as well?) should go
> in a separate function in an x86-specific source file.
> 
I will move the code.
> And just to confirm, is this the right way to test for the feature?  Usually,
> feature determination is based on one of the feature registers.  The
> NOTIFY_MSG flag seems to have a dual meaning -- on read it indicates
> the feature is present.  On write in hyperv_report_panic_msg(), it evidently
> means that the guest is sending a full page of data to Hyper-V.
> 
As per my conversation with John, this seems to be correct and something
also he suggested. The host sets these bits depending on whether it supports
these features or not.

> > @@ -1081,6 +1178,11 @@ static int vmbus_bus_init(void)
> > bus_unregister(&hv_bus);
> > +   free_page((unsigned long)hv_panic_page);
> > +   if (!hv_ctl_table_hdr) {
> 
> The above test is backwards.  Remove the bang.
Good call, will do.
> 
> > @@ -1785,10 +1887,18 @@ static void __exit vmbus_exit(void)
> > +   free_page((unsigned long)hv_panic_page);
> > +   if (!hv_ctl_table_hdr) {
> 
> Same here.  Test is backwards.
> 
Will fix.
> Michael



[PATCH] x86/Hyper-V: Support for free page reporting

2020-05-19 Thread Sunil Muthuswamy
Linux has support for free page reporting now (36e66c554b5c) for
virtualized environment. On Hyper-V when virtually backed VMs are
configured, Hyper-V will advertise cold memory discard capability,
when supported. This patch adds the support to hook into the free
page reporting infrastructure and leverage the Hyper-V cold memory
discard hint hypercall to report/free these pages back to the host.

Signed-off-by: Sunil Muthuswamy 
---
First patch mail bounced backed. Sending it again with the email
addresses fixed.
---
 arch/x86/hyperv/hv_init.c | 24 
 arch/x86/kernel/cpu/mshyperv.c|  6 +-
 drivers/hv/hv_balloon.c   | 93 +++
 include/asm-generic/hyperv-tlfs.h | 29 ++
 include/asm-generic/mshyperv.h|  2 +
 5 files changed, 152 insertions(+), 2 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 624f5d9b0f79..925e2f7eb82c 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -506,3 +506,27 @@ bool hv_is_hibernation_supported(void)
return acpi_sleep_state_supported(ACPI_STATE_S4);
 }
 EXPORT_SYMBOL_GPL(hv_is_hibernation_supported);
+
+u64 hv_query_ext_cap(void)
+{
+   u64 *cap;
+   unsigned long flags;
+   u64 ext_cap = 0;
+
+   /*
+* Querying extended capabilities is an extended hypercall. Check if the
+* partition supports extended hypercall, first.
+*/
+   if (!(ms_hyperv.b_features & HV_ENABLE_EXTENDED_HYPERCALLS))
+   return 0;
+
+   local_irq_save(flags);
+   cap = *(u64 **)this_cpu_ptr(hyperv_pcpu_input_arg);
+   if (hv_do_hypercall(HV_EXT_CALL_QUERY_CAPABILITIES, NULL, cap) ==
+   HV_STATUS_SUCCESS)
+   ext_cap = *cap;
+
+   local_irq_restore(flags);
+   return ext_cap;
+}
+EXPORT_SYMBOL_GPL(hv_query_ext_cap);
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index ebf34c7bc8bc..2de3f692c8bf 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -224,11 +224,13 @@ static void __init ms_hyperv_init_platform(void)
 * Extract the features and hints
 */
ms_hyperv.features = cpuid_eax(HYPERV_CPUID_FEATURES);
+   ms_hyperv.b_features = cpuid_ebx(HYPERV_CPUID_FEATURES);
ms_hyperv.misc_features = cpuid_edx(HYPERV_CPUID_FEATURES);
ms_hyperv.hints= cpuid_eax(HYPERV_CPUID_ENLIGHTMENT_INFO);
 
-   pr_info("Hyper-V: features 0x%x, hints 0x%x, misc 0x%x\n",
-   ms_hyperv.features, ms_hyperv.hints, ms_hyperv.misc_features);
+   pr_info("Hyper-V: features 0x%x, additional features: 0x%x, hints 0x%x, 
misc 0x%x\n",
+   ms_hyperv.features, ms_hyperv.b_features, ms_hyperv.hints,
+   ms_hyperv.misc_features);
 
ms_hyperv.max_vp_index = cpuid_eax(HYPERV_CPUID_IMPLEMENT_LIMITS);
ms_hyperv.max_lp_index = cpuid_ebx(HYPERV_CPUID_IMPLEMENT_LIMITS);
diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index 32e3bc0aa665..77be31094556 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -563,6 +564,10 @@ struct hv_dynmem_device {
 * The negotiated version agreed by host.
 */
__u32 version;
+
+#ifdef CONFIG_PAGE_REPORTING
+   struct page_reporting_dev_info pr_dev_info;
+#endif
 };
 
 static struct hv_dynmem_device dm_device;
@@ -1565,6 +1570,83 @@ static void balloon_onchannelcallback(void *context)
 
 }
 
+#ifdef CONFIG_PAGE_REPORTING
+static int hv_free_page_report(struct page_reporting_dev_info *pr_dev_info,
+   struct scatterlist *sgl, unsigned int nents)
+{
+   unsigned long flags;
+   struct hv_memory_hint *hint;
+   int i;
+   u64 status;
+   struct scatterlist *sg;
+
+   WARN_ON(nents > HV_MAX_GPA_PAGE_RANGES);
+   local_irq_save(flags);
+   hint = *(struct hv_memory_hint **)this_cpu_ptr(hyperv_pcpu_input_arg);
+   if (!hint) {
+   local_irq_restore(flags);
+   return -ENOSPC;
+   }
+
+   hint->type = HV_EXT_MEMORY_HEAT_HINT_TYPE_COLD_DISCARD;
+   hint->reserved = 0;
+   for (i = 0, sg = sgl; sg; sg = sg_next(sg), i++) {
+   int order;
+   union hv_gpa_page_range *range;
+
+   order = get_order(sg->length);
+   range = &hint->ranges[i];
+   range->address_space = 0;
+   range->page.largepage = 1;
+   range->page.additional_pages = (1ull << (order - 9)) - 1;
+   range->base_large_pfn = page_to_pfn(sg_page(sg)) >> 9;
+   }
+
+   status = hv_do_rep_hypercall(HV_EXT_CALL_MEMORY_HEAT_HINT, nents, 0,
+hint, NULL);
+   local_irq_restore(flags);
+   status &= HV_HYPERCALL_RESULT_MASK;
+   if (sta

RE: [EXTERNAL] Re: [PATCH] Hyper-V: pci: x64: Generalize irq/msi set-up and handling

2021-01-07 Thread Sunil Muthuswamy
> What is this SoB chain supposed to say?

Quoting from the link you shared:
"The Signed-off-by: tag indicates that the signer was involved in the 
development of
the patch, or that he/she was in the patch's delivery path."

My intent to include Boqun in the Signed-off by tag was to indicate that he was 
involved
in the development of the patch here.

- Sunil


RE: [EXTERNAL] Re: [PATCH] Hyper-V: pci: x64: Generalize irq/msi set-up and handling

2021-01-07 Thread Sunil Muthuswamy
> There seems to be a long tradition of dreaming up random formats for
> the subject lines of Hyper-V-related patches.  Look at all the
> different ways these are spelled, hyphenated, and capitalized:
> 
I am reading this as a suggestion to unify the format of the subject of
the Hyper-V patches. Wei and other maintainers of the Hyper-V branch; do
you have any suggestions? If we have already defined a format and it is me
who is not following it, please forward the document my way.

> On Thu, Jan 07, 2021 at 05:05:36AM +, Sunil Muthuswamy wrote:
> > Currently, operations related to irq/msi in Hyper-V vPCI are
> 
> In comments in the patch, you use "IRQ" and "MSI".  I don't know
> whether "vPCI" means something or is a typo.  I suppose it probably
> means "virtual PCI" as below.
> 

Yes, vPCI here means virtual PCI.


RE: [EXTERNAL] Re: [PATCH] Hyper-V: pci: x64: Generalize irq/msi set-up and handling

2021-01-07 Thread Sunil Muthuswamy
> Do you see "Co-developed-by" in the title of that section? This is how
> you express co-authorship.
Yes, I do now.

> 
> As it is now:
> 
> Signed-off-by: Sunil Muthuswamy 
> Signed-off-by: Boqun Feng (Microsoft) 
> 
> it says that you authored the patch and Boqun handled it further, i.e.,
> he's in the patch's delivery path. But he isn't - you're sending it.

thanks, I will send out a new patch with this updated.


[PATCH v2 1/2] Hyper-V: pci: x64: Generalize irq/msi set-up and handling

2021-01-07 Thread Sunil Muthuswamy
Currently, operations related to irq/msi in Hyper-V vPCI are
x86-specific code. In order to support virtual PCI on Hyper-V for
other architectures, introduce generic interfaces to replace the
x86-specific ones. There are no functional changes in this patch.

Co-developed-by: Boqun Feng (Microsoft) 
Signed-off-by: Boqun Feng (Microsoft) 
Signed-off-by: Sunil Muthuswamy 
---
In V2:
- Addressed feedback on SoB tab.
- Added a second patch to move the MSI entry definition.
---
 arch/x86/include/asm/mshyperv.h | 24 +
 drivers/pci/controller/pci-hyperv.c | 33 +
 2 files changed, 44 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index ffc289992d1b..05b32ef57e34 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -245,6 +245,30 @@ bool hv_vcpu_is_preempted(int vcpu);
 static inline void hv_apic_init(void) {}
 #endif
 
+#define hv_msi_handler handle_edge_irq
+#define hv_msi_handler_name"edge"
+#define hv_msi_prepare pci_msi_prepare
+
+/* Returns the Hyper-V PCI parent MSI vector domain. */
+static inline struct irq_domain *hv_msi_parent_vector_domain(void)
+{
+   return x86_vector_domain;
+}
+
+/* Returns the interrupt vector mapped to the given IRQ. */
+static inline unsigned int hv_msi_get_int_vector(struct irq_data *data)
+{
+   struct irq_cfg *cfg = irqd_cfg(data);
+
+   return cfg->vector;
+}
+
+/* Get the IRQ delivery mode. */
+static inline u8 hv_msi_irq_delivery_mode(void)
+{
+   return APIC_DELIVERY_MODE_FIXED;
+}
+
 static inline void hv_set_msi_entry_from_desc(union hv_msi_entry *msi_entry,
  struct msi_desc *msi_desc)
 {
diff --git a/drivers/pci/controller/pci-hyperv.c 
b/drivers/pci/controller/pci-hyperv.c
index 6db8d96a78eb..9ca740d275d7 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -43,12 +43,11 @@
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -1194,7 +1193,6 @@ static void hv_irq_mask(struct irq_data *data)
 static void hv_irq_unmask(struct irq_data *data)
 {
struct msi_desc *msi_desc = irq_data_get_msi_desc(data);
-   struct irq_cfg *cfg = irqd_cfg(data);
struct hv_retarget_device_interrupt *params;
struct hv_pcibus_device *hbus;
struct cpumask *dest;
@@ -1223,7 +1221,7 @@ static void hv_irq_unmask(struct irq_data *data)
   (hbus->hdev->dev_instance.b[7] << 8) |
   (hbus->hdev->dev_instance.b[6] & 0xf8) |
   PCI_FUNC(pdev->devfn);
-   params->int_target.vector = cfg->vector;
+   params->int_target.vector = hv_msi_get_int_vector(data);
 
/*
 * Honoring apic->delivery_mode set to APIC_DELIVERY_MODE_FIXED by
@@ -1324,7 +1322,7 @@ static u32 hv_compose_msi_req_v1(
int_pkt->wslot.slot = slot;
int_pkt->int_desc.vector = vector;
int_pkt->int_desc.vector_count = 1;
-   int_pkt->int_desc.delivery_mode = APIC_DELIVERY_MODE_FIXED;
+   int_pkt->int_desc.delivery_mode = hv_msi_irq_delivery_mode();
 
/*
 * Create MSI w/ dummy vCPU set, overwritten by subsequent retarget in
@@ -1345,7 +1343,7 @@ static u32 hv_compose_msi_req_v2(
int_pkt->wslot.slot = slot;
int_pkt->int_desc.vector = vector;
int_pkt->int_desc.vector_count = 1;
-   int_pkt->int_desc.delivery_mode = APIC_DELIVERY_MODE_FIXED;
+   int_pkt->int_desc.delivery_mode = hv_msi_irq_delivery_mode();
 
/*
 * Create MSI w/ dummy vCPU set targeting just one vCPU, overwritten
@@ -1372,7 +1370,6 @@ static u32 hv_compose_msi_req_v2(
  */
 static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 {
-   struct irq_cfg *cfg = irqd_cfg(data);
struct hv_pcibus_device *hbus;
struct vmbus_channel *channel;
struct hv_pci_dev *hpdev;
@@ -1422,7 +1419,7 @@ static void hv_compose_msi_msg(struct irq_data *data, 
struct msi_msg *msg)
size = hv_compose_msi_req_v1(&ctxt.int_pkts.v1,
dest,
hpdev->desc.win_slot.slot,
-   cfg->vector);
+   hv_msi_get_int_vector(data));
break;
 
case PCI_PROTOCOL_VERSION_1_2:
@@ -1430,7 +1427,7 @@ static void hv_compose_msi_msg(struct irq_data *data, 
struct msi_msg *msg)
size = hv_compose_msi_req_v2(&ctxt.int_pkts.v2,
dest,
hpdev->desc.win_slot.slot,
-   cfg->vector);
+   hv

[PATCH v2 2/2] Hyper-V: pci: x64: Moving the MSI entry definition to arch specific location

2021-01-07 Thread Sunil Muthuswamy
The Hyper-V MSI entry is architecture specific. Currently, it is
defined in an arch neutral location. This patch moves it to an
arch specific location.

Signed-off-by: Sunil Muthuswamy 
---
 arch/x86/include/asm/hyperv-tlfs.h | 7 +++
 include/asm-generic/hyperv-tlfs.h  | 8 
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
b/arch/x86/include/asm/hyperv-tlfs.h
index 6bf42aed387e..a15c17c7f019 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -523,6 +523,13 @@ struct hv_partition_assist_pg {
u32 tlb_lock_count;
 };
 
+union hv_msi_entry {
+   u64 as_uint64;
+   struct {
+   u32 address;
+   u32 data;
+   } __packed;
+};
 
 #include 
 
diff --git a/include/asm-generic/hyperv-tlfs.h 
b/include/asm-generic/hyperv-tlfs.h
index e73a11850055..6265f4494970 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -408,14 +408,6 @@ struct hv_tlb_flush_ex {
 } __packed;
 
 /* HvRetargetDeviceInterrupt hypercall */
-union hv_msi_entry {
-   u64 as_uint64;
-   struct {
-   u32 address;
-   u32 data;
-   } __packed;
-};
-
 struct hv_interrupt_entry {
u32 source; /* 1 for MSI(-X) */
u32 reserved1;
-- 
2.17.1


RE: [EXTERNAL] Re: [PATCH v2 1/2] Hyper-V: pci: x64: Generalize irq/msi set-up and handling

2021-01-08 Thread Sunil Muthuswamy
> At the very least you could pick one of the subject line prefixes that
> has been used before for either mshyperv.h or pci-hyperv.c instead of
> making up something completely new and different.
> 
Will keep that in mind going forward.


[PATCH v4] x86/Hyper-V: Support for free page reporting

2021-03-18 Thread Sunil Muthuswamy
Linux has support for free page reporting now (36e66c554b5c) for
virtualized environment. On Hyper-V when virtually backed VMs are
configured, Hyper-V will advertise cold memory discard capability,
when supported. This patch adds the support to hook into the free
page reporting infrastructure and leverage the Hyper-V cold memory
discard hint hypercall to report/free these pages back to the host.

Signed-off-by: Sunil Muthuswamy 
Tested-by: Matheus Castello 
---
In V2:
- Addressed feedback comments
- Added page reporting config option tied to hyper-v balloon config

In V3:
- Addressed feedback from Vitaly

In V4:
- Queried and cached the Hyper-V extended capability for the lifetime
  of the VM
- Addressed feedback from Michael Kelley.
---
 arch/x86/hyperv/hv_init.c | 45 +++-
 arch/x86/kernel/cpu/mshyperv.c|  9 ++--
 drivers/hv/Kconfig|  1 +
 drivers/hv/hv_balloon.c   | 89 +++
 include/asm-generic/hyperv-tlfs.h | 35 +++-
 include/asm-generic/mshyperv.h|  3 +-
 6 files changed, 174 insertions(+), 8 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 9d100257b3af..3937c56145b5 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -498,6 +498,8 @@ void __init hyperv_init(void)
x86_init.irqs.create_pci_msi_domain = hv_create_pci_msi_domain;
 #endif
 
+   /* Query the VMs extended capability once, so that it can be cached. */
+   hv_query_ext_cap(0);
return;
 
 remove_cpuhp_state:
@@ -601,7 +603,7 @@ EXPORT_SYMBOL_GPL(hv_is_hibernation_supported);
 
 enum hv_isolation_type hv_get_isolation_type(void)
 {
-   if (!(ms_hyperv.features_b & HV_ISOLATION))
+   if (!(ms_hyperv.priv_high & HV_ISOLATION))
return HV_ISOLATION_TYPE_NONE;
return FIELD_GET(HV_ISOLATION_TYPE, ms_hyperv.isolation_config_b);
 }
@@ -612,3 +614,44 @@ bool hv_is_isolation_supported(void)
return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;
 }
 EXPORT_SYMBOL_GPL(hv_is_isolation_supported);
+
+/* Bit mask of the extended capability to query: see HV_EXT_CAPABILITY_xxx */
+bool hv_query_ext_cap(u64 cap_query)
+{
+   /*
+* The address of the 'hv_extended_cap' variable will be used as an
+* output parameter to the hypercall below and so it should be
+* compatible with 'virt_to_phys'. Which means, it's address should be
+* directly mapped. Use 'static' to keep it compatible; stack variables
+* can be virtually mapped, making them imcompatible with
+* 'virt_to_phys'.
+* Hypercall input/output addresses should also be 8-byte aligned.
+*/
+   static u64 hv_extended_cap __aligned(8);
+   static bool hv_extended_cap_queried;
+   u64 status;
+
+   /*
+* Querying extended capabilities is an extended hypercall. Check if the
+* partition supports extended hypercall, first.
+*/
+   if (!(ms_hyperv.priv_high & HV_ENABLE_EXTENDED_HYPERCALLS))
+   return false;
+
+   /* Extended capabilities do not change at runtime. */
+   if (hv_extended_cap_queried)
+   return hv_extended_cap & cap_query;
+
+   status = hv_do_hypercall(HV_EXT_CALL_QUERY_CAPABILITIES, NULL,
+&hv_extended_cap);
+   hv_extended_cap_queried = true;
+   status &= HV_HYPERCALL_RESULT_MASK;
+   if (status != HV_STATUS_SUCCESS) {
+   pr_err("Hyper-V: Extended query capabilities hypercall failed 
0x%llx\n",
+  status);
+   return false;
+   }
+
+   return hv_extended_cap & cap_query;
+}
+EXPORT_SYMBOL_GPL(hv_query_ext_cap);
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index cebed535ec56..3546d3e21787 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -265,12 +265,13 @@ static void __init ms_hyperv_init_platform(void)
 * Extract the features and hints
 */
ms_hyperv.features = cpuid_eax(HYPERV_CPUID_FEATURES);
-   ms_hyperv.features_b = cpuid_ebx(HYPERV_CPUID_FEATURES);
+   ms_hyperv.priv_high = cpuid_ebx(HYPERV_CPUID_FEATURES);
ms_hyperv.misc_features = cpuid_edx(HYPERV_CPUID_FEATURES);
ms_hyperv.hints= cpuid_eax(HYPERV_CPUID_ENLIGHTMENT_INFO);
 
-   pr_info("Hyper-V: features 0x%x, hints 0x%x, misc 0x%x\n",
-   ms_hyperv.features, ms_hyperv.hints, ms_hyperv.misc_features);
+   pr_info("Hyper-V: privilege flags low 0x%x, high 0x%x, hints 0x%x, misc 
0x%x\n",
+   ms_hyperv.features, ms_hyperv.priv_high, ms_hyperv.hints,
+   ms_hyperv.misc_features);
 
ms_hyperv.max_vp_index = cpuid_eax(HYPERV_CPUID_IMPLEMENT_LIMITS);
ms_hyperv.max_lp_index = cpuid_

RE: [PATCH v4] x86/Hyper-V: Support for free page reporting

2021-03-19 Thread Sunil Muthuswamy
> What's the strategy for this flag in the unlikely event that the hypercall 
> fails?
> It doesn't seem right to have hv_query_ext_cap() fail, but leave the
> static flag set to true.  Just move that line down to after the status check
> has succeeded?

That call should not fail in any normal circumstances. The current idea was to
avoid repeating the same call on persistent failure. But, since we don't expect
the query capability to be called in any kind of hot path, I am ok moving this
down.

> 
> Other than the above about the flag when the hypercall fails,
> everything else looks good.
Thanks for the review.


[PATCH v5] x86/Hyper-V: Support for free page reporting

2021-03-23 Thread Sunil Muthuswamy
Linux has support for free page reporting now (36e66c554b5c) for
virtualized environment. On Hyper-V when virtually backed VMs are
configured, Hyper-V will advertise cold memory discard capability,
when supported. This patch adds the support to hook into the free
page reporting infrastructure and leverage the Hyper-V cold memory
discard hint hypercall to report/free these pages back to the host.

Signed-off-by: Sunil Muthuswamy 
Tested-by: Matheus Castello 
---
In V2:
- Addressed feedback comments
- Added page reporting config option tied to hyper-v balloon config

In V3:
- Addressed feedback from Vitaly

In V4:
- Queried and cached the Hyper-V extended capability for the lifetime
  of the VM
- Addressed feedback from Michael Kelley.

In v5:
- Added a comment clarifying handling of failed query extended
  capability hypercall to address Michael's feedback.
---
 arch/x86/hyperv/hv_init.c | 51 +-
 arch/x86/kernel/cpu/mshyperv.c|  9 ++--
 drivers/hv/Kconfig|  1 +
 drivers/hv/hv_balloon.c   | 89 +++
 include/asm-generic/hyperv-tlfs.h | 35 +++-
 include/asm-generic/mshyperv.h|  3 +-
 6 files changed, 180 insertions(+), 8 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 9d100257b3af..7c9da3f65afa 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -498,6 +498,8 @@ void __init hyperv_init(void)
x86_init.irqs.create_pci_msi_domain = hv_create_pci_msi_domain;
 #endif
 
+   /* Query the VMs extended capability once, so that it can be cached. */
+   hv_query_ext_cap(0);
return;
 
 remove_cpuhp_state:
@@ -601,7 +603,7 @@ EXPORT_SYMBOL_GPL(hv_is_hibernation_supported);
 
 enum hv_isolation_type hv_get_isolation_type(void)
 {
-   if (!(ms_hyperv.features_b & HV_ISOLATION))
+   if (!(ms_hyperv.priv_high & HV_ISOLATION))
return HV_ISOLATION_TYPE_NONE;
return FIELD_GET(HV_ISOLATION_TYPE, ms_hyperv.isolation_config_b);
 }
@@ -612,3 +614,50 @@ bool hv_is_isolation_supported(void)
return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;
 }
 EXPORT_SYMBOL_GPL(hv_is_isolation_supported);
+
+/* Bit mask of the extended capability to query: see HV_EXT_CAPABILITY_xxx */
+bool hv_query_ext_cap(u64 cap_query)
+{
+   /*
+* The address of the 'hv_extended_cap' variable will be used as an
+* output parameter to the hypercall below and so it should be
+* compatible with 'virt_to_phys'. Which means, it's address should be
+* directly mapped. Use 'static' to keep it compatible; stack variables
+* can be virtually mapped, making them imcompatible with
+* 'virt_to_phys'.
+* Hypercall input/output addresses should also be 8-byte aligned.
+*/
+   static u64 hv_extended_cap __aligned(8);
+   static bool hv_extended_cap_queried;
+   u64 status;
+
+   /*
+* Querying extended capabilities is an extended hypercall. Check if the
+* partition supports extended hypercall, first.
+*/
+   if (!(ms_hyperv.priv_high & HV_ENABLE_EXTENDED_HYPERCALLS))
+   return false;
+
+   /* Extended capabilities do not change at runtime. */
+   if (hv_extended_cap_queried)
+   return hv_extended_cap & cap_query;
+
+   status = hv_do_hypercall(HV_EXT_CALL_QUERY_CAPABILITIES, NULL,
+&hv_extended_cap);
+
+   /*
+* The query extended capabilities hypercall should not fail under
+* any normal circumstances. Avoid repeatedly making the hypercall, on
+* error.
+*/
+   hv_extended_cap_queried = true;
+   status &= HV_HYPERCALL_RESULT_MASK;
+   if (status != HV_STATUS_SUCCESS) {
+   pr_err("Hyper-V: Extended query capabilities hypercall failed 
0x%llx\n",
+  status);
+   return false;
+   }
+
+   return hv_extended_cap & cap_query;
+}
+EXPORT_SYMBOL_GPL(hv_query_ext_cap);
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index cebed535ec56..3546d3e21787 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -265,12 +265,13 @@ static void __init ms_hyperv_init_platform(void)
 * Extract the features and hints
 */
ms_hyperv.features = cpuid_eax(HYPERV_CPUID_FEATURES);
-   ms_hyperv.features_b = cpuid_ebx(HYPERV_CPUID_FEATURES);
+   ms_hyperv.priv_high = cpuid_ebx(HYPERV_CPUID_FEATURES);
ms_hyperv.misc_features = cpuid_edx(HYPERV_CPUID_FEATURES);
ms_hyperv.hints= cpuid_eax(HYPERV_CPUID_ENLIGHTMENT_INFO);
 
-   pr_info("Hyper-V: features 0x%x, hints 0x%x, misc 0x%x\n",
-   ms_hyperv.features, ms_hyperv.hints, ms_hyperv.misc_features);
+   pr_info("Hype

RE: [PATCH v8 1/6] arm64: hyperv: Add Hyper-V hypercall and register access utilities

2021-03-03 Thread Sunil Muthuswamy
> +}
> +EXPORT_SYMBOL_GPL(hv_do_fast_hypercall8);
> +
> +
nit: Extra line, here and few other places

> +u64 hv_get_vpreg(u32 msr)
> +{
> + struct hv_get_vp_registers_input*input;
> + struct hv_get_vp_registers_output   *output;
> + u64 result;
It seems like the code is using both spaces and tabs to align variable names. 
Can
we stick to one or the other, preferably spaces.

> +
> + /*
> +  * Allocate a power of 2 size so alignment to that size is
> +  * guaranteed, since the hypercall input and output areas
> +  * must not cross a page boundary.
> +  */
> + input = kzalloc(roundup_pow_of_two(sizeof(input->header) +
> + sizeof(input->element[0])), GFP_ATOMIC);
> + output = kmalloc(roundup_pow_of_two(sizeof(*output)), GFP_ATOMIC);
> +
Check for null from these malloc routines? Here and in other places.

> + __hv_get_vpreg_128(msr, input, output);

+ * Linux-specific definitions for managing interactions with Microsoft's
+ * Hyper-V hypervisor. The definitions in this file are specific to
+ * the ARM64 architecture.  See include/asm-generic/mshyperv.h for
nit: Two space before 'See'. Here and in couple of other places.



RE: [PATCH v8 1/6] arm64: hyperv: Add Hyper-V hypercall and register access utilities

2021-03-03 Thread Sunil Muthuswamy
> > +
> > +   /*
> > +* Allocate a power of 2 size so alignment to that size is
> > +* guaranteed, since the hypercall input and output areas
> > +* must not cross a page boundary.
> > +*/
> > +   input = kzalloc(roundup_pow_of_two(sizeof(input->header) +
> > +   sizeof(input->element[0])), GFP_ATOMIC);
> > +   output = kmalloc(roundup_pow_of_two(sizeof(*output)), GFP_ATOMIC);
> > +
> Check for null from these malloc routines? Here and in other places.

Between, is there a plan to setup a percpu input/output page for hypercall
input/output on ARM (like we do for x64)? I didn't check the specific size 
requirement
for this particular call, but, that generally will remove the need for these
allocations.


RE: [PATCH v8 6/6] Drivers: hv: Enable Hyper-V code to be built on ARM64

2021-03-03 Thread Sunil Muthuswamy
> Update drivers/hv/Kconfig so CONFIG_HYPERV can be selected on
> ARM64, causing the Hyper-V specific code to be built.
> 
> Signed-off-by: Michael Kelley 
> ---
>  drivers/hv/Kconfig | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> index 79e5356..d492682 100644
> --- a/drivers/hv/Kconfig
> +++ b/drivers/hv/Kconfig
> @@ -4,7 +4,8 @@ menu "Microsoft Hyper-V guest support"
> 
>  config HYPERV
>   tristate "Microsoft Hyper-V client drivers"
> - depends on X86 && ACPI && X86_LOCAL_APIC && HYPERVISOR_GUEST
> + depends on ACPI && ((X86 && X86_LOCAL_APIC && HYPERVISOR_GUEST) \
> + || (ARM64 && !CPU_BIG_ENDIAN))
>   select PARAVIRT
>   select X86_HV_CALLBACK_VECTOR
>   help
> --
> 1.8.3.1

Reviewed-by: Sunil Muthuswamy 



RE: [PATCH v8 4/6] arm64: hyperv: Initialize hypervisor on boot

2021-03-03 Thread Sunil Muthuswamy
> +static u64 hypercall_output __initdata;
> +
> +static int __init hyperv_init(void)
> +{
> + struct hv_get_vpindex_from_apicid_input *input;
> + u64 status;
> + int i;
nit: both, tabs & spaces are being used to indent variable names. Can we stick
to one?

> +
> + /*
> +  * Hypercall inputs must not cross a page boundary, so allocate
> +  * power of 2 size, which will be aligned to that size.
> +  */
> + input = kzalloc(roundup_pow_of_two(sizeof(input->header) +
> + sizeof(input->element[0])), GFP_KERNEL);
> + if (!input)
> + return -ENOMEM;
Similar to the comment on the other patch, can this be done through a
per-cpu hypercall input page?

> + if ((status & HV_HYPERCALL_RESULT_MASK) == HV_STATUS_SUCCESS)
> + hv_vp_index[i] = hypercall_output;
> + else {
CNF suggests using braces even for single line 'if' statements if the 'else' is 
a multi-line
statement

> + pr_warn("Hyper-V: No VP index for CPU %d MPIDR %llx 
> status %llx\n",
> + i, cpu_logical_map(i), status);
> + hv_vp_index[i] = VP_INVAL;
> + }

> +bool hv_is_hyperv_initialized(void)
> +{
> + return hyperv_initialized;
> +}
> +EXPORT_SYMBOL_GPL(hv_is_hyperv_initialized);

This to me seems like more of an indication that whether the hypervisor is 
Hyper-V or
something else, rather than if necessary Hyper-V pieces have been initialized. 
If that is
the case, suggest renaming. Maybe just call it 'hv_is_hyperv'?


RE: [PATCH v8 2/6] arm64: hyperv: Add Hyper-V clocksource/clockevent support

2021-03-03 Thread Sunil Muthuswamy
> +/* Define the interrupt ID used by STIMER0 Direct Mode interrupts. This
> + * value can't come from ACPI tables because it is needed before the
> + * Linux ACPI subsystem is initialized.
> + */
> +#define HYPERV_STIMER0_VECTOR31
nit: On x64, this is defined in HEX

> +
> + return 0;
> +}
> +TIMER_ACPI_DECLARE(hyperv, ACPI_SIG_GTDT, hyperv_timer_init);
> --
> 1.8.3.1

Reviewed-by: Sunil Muthuswamy 



RE: [PATCH v8 3/6] arm64: hyperv: Add kexec and panic handlers

2021-03-03 Thread Sunil Muthuswamy
> +EXPORT_SYMBOL_GPL(hv_setup_crash_handler);
> +
> +void hv_remove_crash_handler(void)
> +{
> +}
> +EXPORT_SYMBOL_GPL(hv_remove_crash_handler);
> --
> 1.8.3.1

Reviewed-by: Sunil Muthuswamy 


RE: [PATCH v9 5/7] arm64: hyperv: Initialize hypervisor on boot

2021-03-15 Thread Sunil Muthuswamy
>  static int num_standard_resources;
> @@ -340,6 +341,9 @@ void __init __no_sanitize_address setup_arch(char 
> **cmdline_p)
>   if (acpi_disabled)
>   unflatten_device_tree();
> 
> + /* Do after acpi_boot_table_init() so local FADT is available */
> + hyperv_early_init();
> +
>   bootmem_init();
> 
>   kasan_init();
> --
> 1.8.3.1

Reviewed-by: Sunil Muthuswamy 


RE: [PATCH v9 2/7] arm64: hyperv: Add Hyper-V hypercall and register access utilities

2021-03-15 Thread Sunil Muthuswamy
> +static inline u64 hv_get_register(unsigned int reg)
> +{
> + return hv_get_vpreg(reg);
> +}
> +
> +/* SMCCC hypercall parameters */
> +#define HV_SMCCC_FUNC_NUMBER 1
> +#define HV_FUNC_ID   ARM_SMCCC_CALL_VAL( \
> + ARM_SMCCC_STD_CALL, \
> + ARM_SMCCC_SMC_64,   \
> + ARM_SMCCC_OWNER_VENDOR_HYP, \
> + HV_SMCCC_FUNC_NUMBER)
> +
> +#include 
> +
> +#endif
> --
> 1.8.3.1

Reviewed-by: Sunil Muthuswamy 


RE: [PATCH v3] x86/Hyper-V: Support for free page reporting

2021-03-17 Thread Sunil Muthuswamy
> > +   if (!(ms_hyperv.priv_high & HV_ENABLE_EXTENDED_HYPERCALLS))
> > +   return 0;
> 
> Return 'false' since the function is declared as bool?
Will fix this in the next iteration.

> > +   if (hv_do_hypercall(HV_EXT_CALL_QUERY_CAPABILITIES, NULL, cap) ==
> > +   HV_STATUS_SUCCESS)
> 
> Need to mask before checking for HV_STATUS_SUCCESS.  With regard to the
> reserved fields in the returned 64 bit status, the TLFS says "Callers should 
> ignore the
> value in these bits".  There's no promise that they are zero.
Coming in next version.

> 
> > +   ext_cap = *cap;
> > +
> > +   local_irq_restore(flags);
> > +   return ext_cap & cap_query;
> > +}
> 
> As I noted in a review comment back in May, the output arg here is
> only 64 bits in size and could just live on the stack with assurance that
> it won't cross a page boundary.  So the code could be:
> 
> bool hv_query_ext_cap(u64 cap_query)
> {
>   u64 cap;
>   u64 status;
> 
>   if(!(ms_hyperv.priv_high & HV_ENABLE_EXTENDED_HYPERCALLS))
>   return false;
> 
>   status = hv_do_hypercall(HV_EXT_CALL_QUERY_CAPABILITIES, NULL, &cap);
>   if ((status & HV_HYPERCALL_RESULT_MASK) != HV_STATUS_SUCCESS)
>   cap = 0;
> 
>   return extcap & cap;
> }
> 
> But if you think there's value in using the designated page for hypercall 
> args,
> I'm OK with just fixing the testing of the status.

Hypercall input/output addresses should be 'virt_to_phys' compatible as 
'hv_do_hypercall'
will call that on the address to get the physical address, to pass on to the 
hypervisor. Stack
variables can be virtually allocated and are not compatible with 
'virt_to_phys', but we should
be able to use 'static' variable for this. Will address this in next version.

> 
> > -   pr_info("Hyper-V: features 0x%x, hints 0x%x, misc 0x%x\n",
> > -   ms_hyperv.features, ms_hyperv.hints, ms_hyperv.misc_features);
> > +   pr_info("Hyper-V: privilege flags low:0x%x, high:0x%x, hints:0x%x, 
> > misc:0x%x\n",
> 
> Nit.  Could we just use a space instead of a colon before each of the printed 
> hex values?
Sure, coming in next version.

> > @@ -23,6 +23,7 @@ config HYPERV_UTILS
> >  config HYPERV_BALLOON
> > tristate "Microsoft Hyper-V Balloon driver"
> > depends on HYPERV
> > +   select PAGE_REPORTING
> 
> With this selection made, are the #ifdef CONFIG_PAGE_REPORTING occurrences
> below really needed?  I looked at the virtio balloon driver, which is also 
> does
> "select PAGE_REPORTING", and it does not have any #ifdef's.

Good point. Don't think we need extra 'ifdefs' for page reporting now that it is
implicit with Hyper-V Balloon. Coming in next version.

> >  static struct hv_dynmem_device dm_device;
> > @@ -1568,6 +1573,84 @@ static void balloon_onchannelcallback(void *context)
> >
> >  }
> >
> > +#ifdef CONFIG_PAGE_REPORTING
> > +/* Hyper-V only supports reporting 2MB pages or higher */
> 
> I'm guessing the above is the same on ARM64 where the guest is using 16K
> or 64K page size, because Hyper-V always uses 4K pages and expects all guest
> communication to be in units of 4K pages.

Yes.
 
> > +   range->page.additional_pages =
> > +   (sg->length / HV_MIN_PAGE_REPORTING_LEN) - 1;
> 
> Perhaps verify that sg->length is at least 2 Meg? (similar to verifying that 
> nents
> isn't too big).  If it isn't at least 2 Meg, then additional_pages will get 
> set to -1,
> and I suspect weird things will happen.
I will add an assert.

> 
> I was also thinking about whether sg->length could be big enough to overflow
> the additional_pages field.  sg->length is an unsigned int, so I don't think 
> so.
Yes, the additional_pages is designed to accommodate 32-bits.

> 
> > +   range->base_large_pfn =
> > +   page_to_pfn(sg_page(sg)) >> HV_MIN_PAGE_REPORTING_ORDER;
> 
> page_to_pfn() will do the wrong thing on ARM64 with 16K or 64K pages.
> Use page_to_hvpfn() instead.
Good point.

> > +static void enable_page_reporting(void)
> > +{
> > +   int ret;
> > +
> > +   BUILD_BUG_ON(pageblock_order < HV_MIN_PAGE_REPORTING_ORDER);
> 
> The BUILD_BUG_ON won't work in the case where pageblock_order is
> actually a variable rather than a constant, though that's currently only ia64 
> and
> powerpc, which we don't directly care about.  Nonetheless, this would break if
> pageblock_order were to become a variable.
> 
I have moved this to a conditional statement. The compiler can optimize the code
away when it is a constant.

> > +   if (ret < 0) {
> > +   dm_device.pr_dev_info.report = NULL;
> > +   pr_err("Failed to enable cold memory discard: %d\n", ret);
> > +   } else {
> > +   pr_info("Cold memory discard hint enabled\n");
> > +   }
> 
> Should the above two messages be prefixed with "Hyper-V: "?
Not needed, as you also replied later.

> Nit:  Typically the error path undoes things in the reverse order. So
> the disable_page_reporting() would occur before the 

RE: [EXTERNAL] Re: [RFC PATCH 12/12] HV/Storvsc: Add bounce buffer support for Storvsc

2021-03-01 Thread Sunil Muthuswamy
> Hi Christoph:
>   Thanks a lot for your review. There are some reasons.
>   1) Vmbus drivers don't use DMA API now.
What is blocking us from making the Hyper-V drivers use the DMA API's? They
will be a null-op generally, when there is no bounce buffer support needed.

>   2) Hyper-V Vmbus channel ring buffer already play bounce buffer
> role for most vmbus drivers. Just two kinds of packets from
> netvsc/storvsc are uncovered.
How does this make a difference here?

>   3) In AMD SEV-SNP based Hyper-V guest, the access physical address
> of shared memory should be bounce buffer memory physical address plus
> with a shared memory boundary(e.g, 48bit) reported Hyper-V CPUID. It's
> called virtual top of memory(vTom) in AMD spec and works as a watermark.
> So it needs to ioremap/memremap the associated physical address above
> the share memory boundary before accessing them. swiotlb_bounce() uses
> low end physical address to access bounce buffer and this doesn't work
> in this senario. If something wrong, please help me correct me.
> 
There are alternative implementations of swiotlb on top of the core swiotlb
API's. One option is to have Hyper-V specific swiotlb wrapper DMA API's with
the custom logic above.

> Thanks.
> 
> 
> On 3/1/2021 2:54 PM, Christoph Hellwig wrote:
> > This should be handled by the DMA mapping layer, just like for native
> > SEV support.
I agree with Christoph's comment that in principle, this should be handled using
the DMA API's


[PATCH v2] x86/Hyper-V: Support for free page reporting

2021-01-05 Thread Sunil Muthuswamy
Linux has support for free page reporting now (36e66c554b5c) for
virtualized environment. On Hyper-V when virtually backed VMs are
configured, Hyper-V will advertise cold memory discard capability,
when supported. This patch adds the support to hook into the free
page reporting infrastructure and leverage the Hyper-V cold memory
discard hint hypercall to report/free these pages back to the host.

Signed-off-by: Sunil Muthuswamy 
Tested-by: Matheus Castello 
---
In V2:
- Addressed feedback comments
- Added page reporting config option tied to hyper-v balloon config
---
 arch/x86/hyperv/hv_init.c | 31 +++
 arch/x86/kernel/cpu/mshyperv.c|  6 +-
 drivers/hv/Kconfig|  1 +
 drivers/hv/hv_balloon.c   | 93 +++
 include/asm-generic/hyperv-tlfs.h | 32 ++-
 include/asm-generic/mshyperv.h|  2 +
 6 files changed, 162 insertions(+), 3 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index e04d90af4c27..1ed7ba07e009 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -528,3 +528,34 @@ bool hv_is_hibernation_supported(void)
return acpi_sleep_state_supported(ACPI_STATE_S4);
 }
 EXPORT_SYMBOL_GPL(hv_is_hibernation_supported);
+
+// Bit mask of the extended capability to query: see HV_EXT_CAPABILITY_xxx
+bool hv_query_ext_cap(u64 cap_query)
+{
+   u64 *cap;
+   unsigned long flags;
+   u64 ext_cap = 0;
+
+   /*
+* Querying extended capabilities is an extended hypercall. Check if the
+* partition supports extended hypercall, first.
+*/
+   if (!(ms_hyperv.priv_high & HV_ENABLE_EXTENDED_HYPERCALLS))
+   return 0;
+
+   /*
+* Repurpose the input page arg to accept output from Hyper-V for
+* now because this is the only call that needs output from the
+* hypervisor. It should be fixed properly by introducing an
+* output arg once we have more places that require output.
+*/
+   local_irq_save(flags);
+   cap = *(u64 **)this_cpu_ptr(hyperv_pcpu_input_arg);
+   if (hv_do_hypercall(HV_EXT_CALL_QUERY_CAPABILITIES, NULL, cap) ==
+   HV_STATUS_SUCCESS)
+   ext_cap = *cap;
+
+   local_irq_restore(flags);
+   return ext_cap & cap_query;
+}
+EXPORT_SYMBOL_GPL(hv_query_ext_cap);
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 05ef1f4550cb..7a7735918350 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -225,11 +225,13 @@ static void __init ms_hyperv_init_platform(void)
 * Extract the features and hints
 */
ms_hyperv.features = cpuid_eax(HYPERV_CPUID_FEATURES);
+   ms_hyperv.priv_high = cpuid_ebx(HYPERV_CPUID_FEATURES);
ms_hyperv.misc_features = cpuid_edx(HYPERV_CPUID_FEATURES);
ms_hyperv.hints= cpuid_eax(HYPERV_CPUID_ENLIGHTMENT_INFO);
 
-   pr_info("Hyper-V: features 0x%x, hints 0x%x, misc 0x%x\n",
-   ms_hyperv.features, ms_hyperv.hints, ms_hyperv.misc_features);
+   pr_info("Hyper-V: features 0x%x, privilege flags high: 0x%x, hints 
0x%x, misc 0x%x\n",
+   ms_hyperv.features, ms_hyperv.priv_high, ms_hyperv.hints,
+   ms_hyperv.misc_features);
 
ms_hyperv.max_vp_index = cpuid_eax(HYPERV_CPUID_IMPLEMENT_LIMITS);
ms_hyperv.max_lp_index = cpuid_ebx(HYPERV_CPUID_IMPLEMENT_LIMITS);
diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 79e5356a737a..66c794d92391 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -23,6 +23,7 @@ config HYPERV_UTILS
 config HYPERV_BALLOON
tristate "Microsoft Hyper-V Balloon driver"
depends on HYPERV
+   select PAGE_REPORTING
help
  Select this option to enable Hyper-V Balloon driver.
 
diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index 8c471823a5af..bb0624bf615b 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -563,6 +564,10 @@ struct hv_dynmem_device {
 * The negotiated version agreed by host.
 */
__u32 version;
+
+#ifdef CONFIG_PAGE_REPORTING
+   struct page_reporting_dev_info pr_dev_info;
+#endif
 };
 
 static struct hv_dynmem_device dm_device;
@@ -1568,6 +1573,84 @@ static void balloon_onchannelcallback(void *context)
 
 }
 
+#ifdef CONFIG_PAGE_REPORTING
+/* Hyper-V only supports reporting 2MB pages or higher */
+#define HV_MIN_PAGE_REPORTING_ORDER9
+#define HV_MIN_PAGE_REPORTING_LEN (HV_HYP_PAGE_SIZE << 
HV_MIN_PAGE_REPORTING_ORDER)
+static int hv_free_page_report(struct page_reporting_dev_info *pr_dev_info,
+   struct scatterlist *sgl, unsigned int nents)
+{
+   unsigned long flags;
+   struct hv_memory_hint *hint;
+   int i;
+   u64 status;
+   str

RE: [EXTERNAL] Re: [PATCH v2] x86/Hyper-V: Support for free page reporting

2021-01-06 Thread Sunil Muthuswamy
> > +// Bit mask of the extended capability to query: see HV_EXT_CAPABILITY_xxx
> 
> Please don't use '//' comments in Linux (here and below)
Will fix in v3.

> > +   /*
> > +* Repurpose the input page arg to accept output from Hyper-V for
> > +* now because this is the only call that needs output from the
> > +* hypervisor. It should be fixed properly by introducing an
> > +* output arg once we have more places that require output.
> > +*/
> 
> I remember there was a patch from Wei (realter to running Linux as root
> partition) doing the job, we can probably merge it early to avoid this
> re-purposing.
I would prefer getting this in right now and we can update this once Wei's
patches gets merged.

> > -   pr_info("Hyper-V: features 0x%x, hints 0x%x, misc 0x%x\n",
> > -   ms_hyperv.features, ms_hyperv.hints, ms_hyperv.misc_features);
> > +   pr_info("Hyper-V: features 0x%x, privilege flags high: 0x%x, hints 
> > 0x%x, misc 0x%x\n",
> > +   ms_hyperv.features, ms_hyperv.priv_high, ms_hyperv.hints,
> > +   ms_hyperv.misc_features);
> 
> Even if we would like to avoid the curn and keep 'ms_hyperv.features',
> we could've reported this as
> 
> "privilege flags low:%0x%x high:0x%x" to avoid the misleading 'features'.
> 
Sure, will change it in v3.

> > +   WARN_ON(nents > HV_MEMORY_HINT_MAX_GPA_PAGE_RANGES);
> 
> WARN_ON_ONCE() maybe?
Sure, coming in v3.

> > +   for (i = 0, sg = sgl; sg; sg = sg_next(sg), i++) {
> 
> This looks like an open-coded for_each_sg() version.
Thanks, will change in v3.

> > +   BUILD_BUG_ON(pageblock_order < HV_MIN_PAGE_REPORTING_ORDER);
> > +   if (!hv_query_ext_cap(HV_EXT_CAPABILITY_MEMORY_COLD_DISCARD_HINT)) {
> > +   pr_info("Cold memory discard hint not supported by Hyper-V\n");
> > +   return;
> > +   }
> 
> Which Hyper-V versions/Azure instance types don't support it? In case
> there's something fairly modern on the list, I'd suggest to drop this
> pr_info (or convert it to pr_debug) to not mislead users into thinking
> there's something wrong. In case they don't see 'Cold memory discard
> hint enabled' they know it is unsupported.
> 
This is not available in Hypervisor spec 5.0 and was added in 6.0. I don't see 
any problems
with making it 'pr_debug'

> > +#define HV_EXT_MEMORY_HEAT_HINT_TYPE_COLD_DISCARD  2
> > +struct hv_memory_hint {
> > +   u64 type:2;
> > +   u64 reserved:62;
> > +   union hv_gpa_page_range ranges[];
> > +};
> 
> Other similar structures use '__packed' but I remember there was some
> disagreement if this is needed or not when everything is properly padded
> (or doesn't require padding like in this case).
> 
There are other places where we don't use '__packed' where everything is
properly aligned. But, I will add it as I don't see a problem with it.

- Sunil


[PATCH v3] x86/Hyper-V: Support for free page reporting

2021-01-06 Thread Sunil Muthuswamy
Linux has support for free page reporting now (36e66c554b5c) for
virtualized environment. On Hyper-V when virtually backed VMs are
configured, Hyper-V will advertise cold memory discard capability,
when supported. This patch adds the support to hook into the free
page reporting infrastructure and leverage the Hyper-V cold memory
discard hint hypercall to report/free these pages back to the host.

Signed-off-by: Sunil Muthuswamy 
Tested-by: Matheus Castello 
---
In V2:
- Addressed feedback comments
- Added page reporting config option tied to hyper-v balloon config

In V3:
- Addressed feedback from Vitaly
---
 arch/x86/hyperv/hv_init.c | 31 +++
 arch/x86/kernel/cpu/mshyperv.c|  6 +-
 drivers/hv/Kconfig|  1 +
 drivers/hv/hv_balloon.c   | 93 +++
 include/asm-generic/hyperv-tlfs.h | 32 ++-
 include/asm-generic/mshyperv.h|  2 +
 6 files changed, 162 insertions(+), 3 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index e04d90af4c27..5b610e47d091 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -528,3 +528,34 @@ bool hv_is_hibernation_supported(void)
return acpi_sleep_state_supported(ACPI_STATE_S4);
 }
 EXPORT_SYMBOL_GPL(hv_is_hibernation_supported);
+
+/* Bit mask of the extended capability to query: see HV_EXT_CAPABILITY_xxx */
+bool hv_query_ext_cap(u64 cap_query)
+{
+   u64 *cap;
+   unsigned long flags;
+   u64 ext_cap = 0;
+
+   /*
+* Querying extended capabilities is an extended hypercall. Check if the
+* partition supports extended hypercall, first.
+*/
+   if (!(ms_hyperv.priv_high & HV_ENABLE_EXTENDED_HYPERCALLS))
+   return 0;
+
+   /*
+* Repurpose the input page arg to accept output from Hyper-V for
+* now because this is the only call that needs output from the
+* hypervisor. It should be fixed properly by introducing an
+* output arg once we have more places that require output.
+*/
+   local_irq_save(flags);
+   cap = *(u64 **)this_cpu_ptr(hyperv_pcpu_input_arg);
+   if (hv_do_hypercall(HV_EXT_CALL_QUERY_CAPABILITIES, NULL, cap) ==
+   HV_STATUS_SUCCESS)
+   ext_cap = *cap;
+
+   local_irq_restore(flags);
+   return ext_cap & cap_query;
+}
+EXPORT_SYMBOL_GPL(hv_query_ext_cap);
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 05ef1f4550cb..f4c0d69c61ae 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -225,11 +225,13 @@ static void __init ms_hyperv_init_platform(void)
 * Extract the features and hints
 */
ms_hyperv.features = cpuid_eax(HYPERV_CPUID_FEATURES);
+   ms_hyperv.priv_high = cpuid_ebx(HYPERV_CPUID_FEATURES);
ms_hyperv.misc_features = cpuid_edx(HYPERV_CPUID_FEATURES);
ms_hyperv.hints= cpuid_eax(HYPERV_CPUID_ENLIGHTMENT_INFO);
 
-   pr_info("Hyper-V: features 0x%x, hints 0x%x, misc 0x%x\n",
-   ms_hyperv.features, ms_hyperv.hints, ms_hyperv.misc_features);
+   pr_info("Hyper-V: privilege flags low:0x%x, high:0x%x, hints:0x%x, 
misc:0x%x\n",
+   ms_hyperv.features, ms_hyperv.priv_high, ms_hyperv.hints,
+   ms_hyperv.misc_features);
 
ms_hyperv.max_vp_index = cpuid_eax(HYPERV_CPUID_IMPLEMENT_LIMITS);
ms_hyperv.max_lp_index = cpuid_ebx(HYPERV_CPUID_IMPLEMENT_LIMITS);
diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 79e5356a737a..66c794d92391 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -23,6 +23,7 @@ config HYPERV_UTILS
 config HYPERV_BALLOON
tristate "Microsoft Hyper-V Balloon driver"
depends on HYPERV
+   select PAGE_REPORTING
help
  Select this option to enable Hyper-V Balloon driver.
 
diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index 8c471823a5af..c0ff0a48f540 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -563,6 +564,10 @@ struct hv_dynmem_device {
 * The negotiated version agreed by host.
 */
__u32 version;
+
+#ifdef CONFIG_PAGE_REPORTING
+   struct page_reporting_dev_info pr_dev_info;
+#endif
 };
 
 static struct hv_dynmem_device dm_device;
@@ -1568,6 +1573,84 @@ static void balloon_onchannelcallback(void *context)
 
 }
 
+#ifdef CONFIG_PAGE_REPORTING
+/* Hyper-V only supports reporting 2MB pages or higher */
+#define HV_MIN_PAGE_REPORTING_ORDER9
+#define HV_MIN_PAGE_REPORTING_LEN (HV_HYP_PAGE_SIZE << 
HV_MIN_PAGE_REPORTING_ORDER)
+static int hv_free_page_report(struct page_reporting_dev_info *pr_dev_info,
+   struct scatterlist *sgl, unsigned int nents)
+{
+   unsigned long flags;
+   struct hv_memory_hint *hint;
+   int i;
+ 

[PATCH] Hyper-V: pci: x64: Generalize irq/msi set-up and handling

2021-01-06 Thread Sunil Muthuswamy
Currently, operations related to irq/msi in Hyper-V vPCI are
x86-specific code. In order to support virtual PCI on Hyper-V for
other architectures, introduce generic interfaces to replace the
x86-specific ones. There are no functional changes in this patch.

Signed-off-by: Sunil Muthuswamy 
Signed-off-by: Boqun Feng (Microsoft) 
---
 arch/x86/include/asm/mshyperv.h | 24 +
 drivers/pci/controller/pci-hyperv.c | 33 +
 2 files changed, 44 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index ffc289992d1b..05b32ef57e34 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -245,6 +245,30 @@ bool hv_vcpu_is_preempted(int vcpu);
 static inline void hv_apic_init(void) {}
 #endif
 
+#define hv_msi_handler handle_edge_irq
+#define hv_msi_handler_name"edge"
+#define hv_msi_prepare pci_msi_prepare
+
+/* Returns the Hyper-V PCI parent MSI vector domain. */
+static inline struct irq_domain *hv_msi_parent_vector_domain(void)
+{
+   return x86_vector_domain;
+}
+
+/* Returns the interrupt vector mapped to the given IRQ. */
+static inline unsigned int hv_msi_get_int_vector(struct irq_data *data)
+{
+   struct irq_cfg *cfg = irqd_cfg(data);
+
+   return cfg->vector;
+}
+
+/* Get the IRQ delivery mode. */
+static inline u8 hv_msi_irq_delivery_mode(void)
+{
+   return APIC_DELIVERY_MODE_FIXED;
+}
+
 static inline void hv_set_msi_entry_from_desc(union hv_msi_entry *msi_entry,
  struct msi_desc *msi_desc)
 {
diff --git a/drivers/pci/controller/pci-hyperv.c 
b/drivers/pci/controller/pci-hyperv.c
index 6db8d96a78eb..9ca740d275d7 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -43,12 +43,11 @@
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -1194,7 +1193,6 @@ static void hv_irq_mask(struct irq_data *data)
 static void hv_irq_unmask(struct irq_data *data)
 {
struct msi_desc *msi_desc = irq_data_get_msi_desc(data);
-   struct irq_cfg *cfg = irqd_cfg(data);
struct hv_retarget_device_interrupt *params;
struct hv_pcibus_device *hbus;
struct cpumask *dest;
@@ -1223,7 +1221,7 @@ static void hv_irq_unmask(struct irq_data *data)
   (hbus->hdev->dev_instance.b[7] << 8) |
   (hbus->hdev->dev_instance.b[6] & 0xf8) |
   PCI_FUNC(pdev->devfn);
-   params->int_target.vector = cfg->vector;
+   params->int_target.vector = hv_msi_get_int_vector(data);
 
/*
 * Honoring apic->delivery_mode set to APIC_DELIVERY_MODE_FIXED by
@@ -1324,7 +1322,7 @@ static u32 hv_compose_msi_req_v1(
int_pkt->wslot.slot = slot;
int_pkt->int_desc.vector = vector;
int_pkt->int_desc.vector_count = 1;
-   int_pkt->int_desc.delivery_mode = APIC_DELIVERY_MODE_FIXED;
+   int_pkt->int_desc.delivery_mode = hv_msi_irq_delivery_mode();
 
/*
 * Create MSI w/ dummy vCPU set, overwritten by subsequent retarget in
@@ -1345,7 +1343,7 @@ static u32 hv_compose_msi_req_v2(
int_pkt->wslot.slot = slot;
int_pkt->int_desc.vector = vector;
int_pkt->int_desc.vector_count = 1;
-   int_pkt->int_desc.delivery_mode = APIC_DELIVERY_MODE_FIXED;
+   int_pkt->int_desc.delivery_mode = hv_msi_irq_delivery_mode();
 
/*
 * Create MSI w/ dummy vCPU set targeting just one vCPU, overwritten
@@ -1372,7 +1370,6 @@ static u32 hv_compose_msi_req_v2(
  */
 static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 {
-   struct irq_cfg *cfg = irqd_cfg(data);
struct hv_pcibus_device *hbus;
struct vmbus_channel *channel;
struct hv_pci_dev *hpdev;
@@ -1422,7 +1419,7 @@ static void hv_compose_msi_msg(struct irq_data *data, 
struct msi_msg *msg)
size = hv_compose_msi_req_v1(&ctxt.int_pkts.v1,
dest,
hpdev->desc.win_slot.slot,
-   cfg->vector);
+   hv_msi_get_int_vector(data));
break;
 
case PCI_PROTOCOL_VERSION_1_2:
@@ -1430,7 +1427,7 @@ static void hv_compose_msi_msg(struct irq_data *data, 
struct msi_msg *msg)
size = hv_compose_msi_req_v2(&ctxt.int_pkts.v2,
dest,
hpdev->desc.win_slot.slot,
-   cfg->vector);
+   hv_msi_get_int_vector(data));
break;
 
default:
@@ -1541,12 +1538,13 @@ static struct irq_chip hv_msi_irq_chip = {
   

RE: [PATCH net] hv_sock: Fix hang when a connection is closed

2019-07-29 Thread Sunil Muthuswamy



> -Original Message-
> From: Dexuan Cui 
> Sent: Sunday, July 28, 2019 11:32 AM
> To: Sunil Muthuswamy ; David Miller 
> ; net...@vger.kernel.org
> Cc: KY Srinivasan ; Haiyang Zhang 
> ; Stephen Hemminger
> ; sas...@kernel.org; Michael Kelley 
> ; linux-hyp...@vger.kernel.org; linux-
> ker...@vger.kernel.org; o...@aepfle.de; a...@canonical.com; 
> jasow...@redhat.com; vkuznets ;
> marcelo.ce...@canonical.com
> Subject: [PATCH net] hv_sock: Fix hang when a connection is closed
> 
> 
> hvs_do_close_lock_held() may decrease the reference count to 0 and free the
> sk struct completely, and then the following release_sock(sk) may hang.
> 
> Fixes: a9eeb998c28d ("hv_sock: Add support for delayed close")
> Signed-off-by: Dexuan Cui 
> Cc: sta...@vger.kernel.org
> 
> ---
> With the proper kernel debugging options enabled, first a warning can
> appear:
> 
> kworker/1:0/4467 is freeing memory ..., with a lock still held there!
> stack backtrace:
> Workqueue: events vmbus_onmessage_work [hv_vmbus]
> Call Trace:
>  dump_stack+0x67/0x90
>  debug_check_no_locks_freed.cold.52+0x78/0x7d
>  slab_free_freelist_hook+0x85/0x140
>  kmem_cache_free+0xa5/0x380
>  __sk_destruct+0x150/0x260
>  hvs_close_connection+0x24/0x30 [hv_sock]
>  vmbus_onmessage_work+0x1d/0x30 [hv_vmbus]
>  process_one_work+0x241/0x600
>  worker_thread+0x3c/0x390
>  kthread+0x11b/0x140
>  ret_from_fork+0x24/0x30
> 
> and then the following release_sock(sk) can hang:
> 
> watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:4467]
> ...
> irq event stamp: 62890
> CPU: 1 PID: 4467 Comm: kworker/1:0 Tainted: GW 5.2.0+ #39
> Workqueue: events vmbus_onmessage_work [hv_vmbus]
> RIP: 0010:queued_spin_lock_slowpath+0x2b/0x1e0
> ...
> Call Trace:
>  do_raw_spin_lock+0xab/0xb0
>  release_sock+0x19/0xb0
>  vmbus_onmessage_work+0x1d/0x30 [hv_vmbus]
>  process_one_work+0x241/0x600
>  worker_thread+0x3c/0x390
>  kthread+0x11b/0x140
>  ret_from_fork+0x24/0x30
> 
>  net/vmw_vsock/hyperv_transport.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/net/vmw_vsock/hyperv_transport.c 
> b/net/vmw_vsock/hyperv_transport.c
> index f2084e3f7aa4..efbda8ef1eff 100644
> --- a/net/vmw_vsock/hyperv_transport.c
> +++ b/net/vmw_vsock/hyperv_transport.c
> @@ -309,9 +309,16 @@ static void hvs_close_connection(struct vmbus_channel 
> *chan)
>  {
>   struct sock *sk = get_per_channel_state(chan);
> 
> + /* Grab an extra reference since hvs_do_close_lock_held() may decrease
> +  * the reference count to 0 by calling sock_put(sk).
> +  */
> + sock_hold(sk);
> +

To me, it seems like when 'hvs_close_connection' is called, there should always 
be
an outstanding reference to the socket. The reference that is dropped by
' hvs_do_close_lock_held' is a legitimate reference that was taken by 
'hvs_close_lock_held'.
Or, in other words, I think the right solution is to always maintain a 
reference to socket
until this routine is called and drop that here. That can be done by taking the 
reference to
the socket prior to ' vmbus_set_chn_rescind_callback(chan, 
hvs_close_connection)' and
dropping that reference at the end of 'hvs_close_connection'.

>   lock_sock(sk);
>   hvs_do_close_lock_held(vsock_sk(sk), true);
>   release_sock(sk);
> +
> + sock_put(sk);
>  }
> 
>  static void hvs_open_connection(struct vmbus_channel *chan)
> --
> 2.19.1

Thanks for taking a look at this. We should queue this fix and the other 
hvsocket fixes
for the stable branch.


[PATCH net-next] hv_sock: perf: Allow the socket buffer size options to influence the actual socket buffers

2019-05-22 Thread Sunil Muthuswamy
Currently, the hv_sock buffer size is static and can't scale to the
bandwidth requirements of the application. This change allows the
applications to influence the socket buffer sizes using the SO_SNDBUF and
the SO_RCVBUF socket options.

Few interesting points to note:
1. Since the VMBUS does not allow a resize operation of the ring size, the
socket buffer size option should be set prior to establishing the
connection for it to take effect.
2. Setting the socket option comes with the cost of that much memory being
reserved/allocated by the kernel, for the lifetime of the connection.

Perf data:
Total Data Transfer: 1GB
Single threaded reader/writer
Results below are summarized over 10 iterations.

Linux hvsocket writer + Windows hvsocket reader:
|-|
|Packet size ->   |  128B   |   1KB   |   4KB   |   
 64KB |
|-|
|SO_SNDBUF size | | Throughput in MB/s (min/max/avg/median):
  |
|   v | 
  |
|-|
|  Default| 109/118/114/116 | 636/774/701/700 | 435/507/480/476 |   
410/491/462/470   |
|  16KB   | 110/116/112/111 | 575/705/662/671 | 749/900/854/869 |   
592/824/692/676   |
|  32KB   | 108/120/115/115 | 703/823/767/772 | 718/878/850/866 | 
1593/2124/2000/2085 |
|  64KB   | 108/119/114/114 | 592/732/683/688 | 805/934/903/911 | 
1784/1943/1862/1843 |
|-|

Windows hvsocket writer + Linux hvsocket reader:
|-|
|Packet size ->   | 128B|  1KB|  4KB|   
 64KB |
|-|
|SO_RCVBUF size | |   Throughput in MB/s (min/max/avg/median):  
  |
|   v | 
  |
|-|
|  Default| 69/82/75/73 | 313/343/333/336 |   418/477/446/445   |   
659/701/676/678   |
|  16KB   | 69/83/76/77 | 350/401/375/382 |   506/548/517/516   |   
602/624/615/615   |
|  32KB   | 62/83/73/73 | 471/529/496/494 |   830/1046/935/939  | 
944/1180/1070/1100  |
|  64KB   | 64/70/68/69 | 467/533/501/497 | 1260/1590/1430/1431 | 
1605/1819/1670/1660 |
|-|

Signed-off-by: Sunil Muthuswamy 
---
- The table above exceeds the 75char limit for the patch. If that's a
problem, I can try to squeeze it in somehow within the limit.

- The patch has been previously submitted to net and reviewed. The
feedback was to submit it to net-next.

 net/vmw_vsock/hyperv_transport.c | 50 
 1 file changed, 40 insertions(+), 10 deletions(-)

diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index 982a8dc..8d3a7b0 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -23,14 +23,14 @@
 #include 
 #include 
 
-/* The host side's design of the feature requires 6 exact 4KB pages for
- * recv/send rings respectively -- this is suboptimal considering memory
- * consumption, however unluckily we have to live with it, before the
- * host comes up with a better design in the future.
+/* Older (VMBUS version 'VERSION_WIN10' or before) Windows hosts have some
+ * stricter requirements on the hv_sock ring buffer size of six 4K pages. Newer
+ * hosts don't have this limitation; but, keep the defaults the same for 
compat.
  */
 #define PAGE_SIZE_4K   4096
 #define RINGBUFFER_HVS_RCV_SIZE (PAGE_SIZE_4K * 6)
 #define RINGBUFFER_HVS_SND_SIZE (PAGE_SIZE_4K * 6)
+#define RINGBUFFER_HVS_MAX_SIZE (PAGE_SIZE_4K * 64)
 
 /* The MTU is 16KB per the host side's design */
 #define HVS_MTU_SIZE   (1024 * 16)
@@ -344,9 +344,12 @@ static void hvs_open_connection(struct vmbus_channel *chan)
 
struct sockaddr_vm addr;
struct sock *sk, *new = NULL;
-   struct vsock_sock *vnew;
-   struct hvsock *hvs, *hvs_new;
+   struct vsock_sock *vnew = NULL;
+   struct hvsock *hvs = NULL;
+   struct hvsock *hvs_new = NULL;
+   int rcvbuf;
int ret;
+   int sndbuf;
 
if_type = &chan->offermsg.offer.if_type;
if_instance = &chan->offermsg.offer.if_instance;
@@ -388,9 +391,34 @@ sta

[PATCH net-next] hv_sock: perf: loop in send() to maximize bandwidth

2019-05-22 Thread Sunil Muthuswamy
Currently, the hv_sock send() iterates once over the buffer, puts data into
the VMBUS channel and returns. It doesn't maximize on the case when there
is a simultaneous reader draining data from the channel. In such a case,
the send() can maximize the bandwidth (and consequently minimize the cpu
cycles) by iterating until the channel is found to be full.

Perf data:
Total Data Transfer: 10GB/iteration
Single threaded reader/writer, Linux hvsocket writer with Windows hvsocket
reader
Packet size: 64KB
CPU sys time was captured using the 'time' command for the writer to send
10GB of data.
'Send Buffer Loop' is with the patch applied.
The values below are over 10 iterations.

||
||Current|   Send Buffer Loop|
||
|| Throughput | CPU sys  | Throughput | CPU sys  |
|| (MB/s) | time (s) | (MB/s) | time (s) |
||
| Min| 407|   7.048  |401 |  5.958   |
||
| Max| 455|   7.563  |542 |  6.993   |
||
| Avg| 440|   7.411  |451 |  6.639   |
||
| Median | 446|   7.417  |447 |  6.761   |
||

Observation:
1. The avg throughput doesn't really change much with this change for this
scenario. This is most probably because the bottleneck on throughput is
somewhere else.
2. The average system (or kernel) cpu time goes down by 10%+ with this
change, for the same amount of data transfer.

Signed-off-by: Sunil Muthuswamy 
---
- The patch has been previously submitted to net and reviewed. The
feedback was to submit it to net-next.

 net/vmw_vsock/hyperv_transport.c | 45 +++-
 1 file changed, 31 insertions(+), 14 deletions(-)

diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index 982a8dc..7c13032 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -55,8 +55,9 @@ struct hvs_recv_buf {
 };
 
 /* We can send up to HVS_MTU_SIZE bytes of payload to the host, but let's use
- * a small size, i.e. HVS_SEND_BUF_SIZE, to minimize the dynamically-allocated
- * buffer, because tests show there is no significant performance difference.
+ * a smaller size, i.e. HVS_SEND_BUF_SIZE, to maximize concurrency between the
+ * guest and the host processing as one VMBUS packet is the smallest processing
+ * unit.
  *
  * Note: the buffer can be eliminated in the future when we add new VMBus
  * ringbuffer APIs that allow us to directly copy data from userspace buffer
@@ -644,7 +645,9 @@ static ssize_t hvs_stream_enqueue(struct vsock_sock *vsk, 
struct msghdr *msg,
struct hvsock *hvs = vsk->trans;
struct vmbus_channel *chan = hvs->chan;
struct hvs_send_buf *send_buf;
-   ssize_t to_write, max_writable, ret;
+   ssize_t to_write, max_writable;
+   ssize_t ret = 0;
+   ssize_t bytes_written = 0;
 
BUILD_BUG_ON(sizeof(*send_buf) != PAGE_SIZE_4K);
 
@@ -652,20 +655,34 @@ static ssize_t hvs_stream_enqueue(struct vsock_sock *vsk, 
struct msghdr *msg,
if (!send_buf)
return -ENOMEM;
 
-   max_writable = hvs_channel_writable_bytes(chan);
-   to_write = min_t(ssize_t, len, max_writable);
-   to_write = min_t(ssize_t, to_write, HVS_SEND_BUF_SIZE);
-
-   ret = memcpy_from_msg(send_buf->data, msg, to_write);
-   if (ret < 0)
-   goto out;
+   /* Reader(s) could be draining data from the channel as we write.
+* Maximize bandwidth, by iterating until the channel is found to be
+* full.
+*/
+   while (len) {
+   max_writable = hvs_channel_writable_bytes(chan);
+   if (!max_writable)
+   break;
+   to_write = min_t(ssize_t, len, max_writable);
+   to_write = min_t(ssize_t, to_write, HVS_SEND_BUF_SIZE);
+   /* memcpy_from_msg is safe for loop as it advances the offsets
+* within the message iterator.
+*/
+   ret = memcpy_from_msg(send_buf->data, msg, to_write);
+   if (ret < 0)
+   goto out;
 
-   ret = hvs_send_data(hvs->chan, send_buf, to_write);
-   if (ret < 0)
-   goto out;
+   ret = hvs_send_data(hvs->chan, send_buf, to_write);
+   if (ret < 0)
+   goto out;
 
-   ret = to_write;
+   bytes_written += to_write;
+   len -= to_write;
+   }
 out:
+   /* If any data has been sent, return that */
+   if (

RE: [PATCH] hv_sock: Fix data loss upon socket close

2019-05-14 Thread Sunil Muthuswamy



> -Original Message-
> From: linux-hyperv-ow...@vger.kernel.org  
> On Behalf Of David Miller
> Sent: Thursday, May 9, 2019 1:58 PM
> To: Sunil Muthuswamy 
> Cc: KY Srinivasan ; Haiyang Zhang 
> ; Stephen Hemminger
> ; sas...@kernel.org; Dexuan Cui 
> ; Michael Kelley ;
> net...@vger.kernel.org; linux-hyp...@vger.kernel.org; 
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] hv_sock: Fix data loss upon socket close
> 
> From: Sunil Muthuswamy 
> Date: Wed, 8 May 2019 23:10:35 +
> 
> > +static inline void hvs_shutdown_lock_held(struct hvsock *hvs, int mode)
> 
> Please do not use the inline keyword in foo.c files, let the compiler decide.
> 
Thanks, will fix in the next version.
> Also, longer term thing, I notice that vsock_remove_socket() is very
> inefficient locking-wise.  It takes the table lock to do the placement
> test, and takes it again to do the removal.  Might even be racy.
Agreed. The check & remove should be done as an atomic operation.
This can be taken up as a separate patch.


RE: [PATCH] hv_sock: Fix data loss upon socket close

2019-05-14 Thread Sunil Muthuswamy



> -Original Message-
> From: Dexuan Cui 
> Sent: Friday, May 10, 2019 8:57 PM
> To: Sunil Muthuswamy ; KY Srinivasan 
> ; Haiyang Zhang ;
> Stephen Hemminger ; Sasha Levin ; 
> David S. Miller ;
> Michael Kelley 
> Cc: net...@vger.kernel.org; linux-hyp...@vger.kernel.org; 
> linux-kernel@vger.kernel.org
> Subject: RE: [PATCH] hv_sock: Fix data loss upon socket close
> 
> > From: Sunil Muthuswamy 
> > Sent: Wednesday, May 8, 2019 4:11 PM
> >
> > Currently, when a hvsock socket is closed, the socket is shutdown and
> > immediately a RST is sent. There is no wait for the FIN packet to arrive
> > from the other end. This can lead to data loss since the connection is
> > terminated abruptly. This can manifest easily in cases of a fast guest
> > hvsock writer and a much slower host hvsock reader. Essentially hvsock is
> > not following the proper STREAM(TCP) closing handshake mechanism.
> 
> Hi Sunil,
> It looks to me the above description is inaccurate.
> 
> In the upstream Linux kernel, closing a hv_sock file descriptor may hang
> in vmbus_hvsock_device_unregister() -> msleep(), until the host side of
> the connection is closed. This is bad and should be fixed, but I don't think
> the current code can cause data loss: when Linux calls hvs_destruct() ->
> vmbus_hvsock_device_unregister() -> vmbus_device_unregister() -> ...
> -> vmbus_close() to close the channel, Linux knows the host app has
> already called close(), and normally that means the host app has
> received all the data from the connection.
> 
> BTW, technically speaking, in hv_sock there is no RST packet, while there
> is indeed a payload_len==0 packet, which is similar to TCP FIN.
> 
> I think by saying "a RST is sent" you mean Linux VM closes the channel.
> 
> > The fix involves adding support for the delayed close of hvsock, which is
> > in-line with other socket providers such as virtio.
> 
> With this "delayed close" patch, Linux's close() won't hang until the host
> also closes the connection. This is good!
> 
The next version of the patch will only focus on implementing the delayed
(or background) close logic. I will update the title and the description of the
next version patch to more accurately reflect the change. 
> > While closing, the
> > socket waits for a constant timeout, for the FIN packet to arrive from the
> > other end. On timeout, it will terminate the connection (i.e a RST).
> 
> As I mentioned above, I suppose the "RST" means Linux closes the channel.
> 
> When Linux closes a connection, the FIN packet is written into the shared
> guest-to-host channel ringbuffer immediately, so the host is able to see it
> immediately, but the real question is: what if the host kernel and/or host app
> can not (timely) receive the data from the ringbuffer, inclding the FIN?
> 
> Does the host kernel guarantee it *always* timely fetches/caches all the
> data from a connection, even if the host app has not accept()'d the
> conection, or the host app is reading from the connection too slowly?
> 
> If the host doesn't guarantee that, then even with this patch there is still
> a chance Linux can time out, and close the channel before the host
> finishes receiving all the data.
> 
TCP protocol does not guarantee that all the data gets delivered, especially
during close. The applications are required to mitigate this by using a
combination of shutdown() and subsequent read() on both client and server
side.
> I'm curious how Windows guest implements the "async close"?
> Does Windows guest also use the same timeout strategy here? If yes,
> what's the timeout value used?
> 
Windows also implements the delayed close logic in a similar fashion. You can
lookup the MSDN article on 'graceful shutdown'.
> > diff --git a/net/vmw_vsock/hyperv_transport.c
> > b/net/vmw_vsock/hyperv_transport.c
> > index a827547..62b986d 100644
> 
> Sorry, I need more time to review the rest of patch. Will try to reply ASAP.
> 
> > -static int hvs_update_recv_data(struct hvsock *hvs)
> > +static int hvs_update_recv_data(struct vsock_sock *vsk)
> >  {
> > struct hvs_recv_buf *recv_buf;
> > u32 payload_len;
> > +   struct hvsock *hvs = vsk->trans;
> >
> > recv_buf = (struct hvs_recv_buf *)(hvs->recv_desc + 1);
> > payload_len = recv_buf->hdr.data_size;
> > @@ -543,8 +591,12 @@ static int hvs_update_recv_data(struct hvsock *hvs)
> > if (payload_len > HVS_MTU_SIZE)
> > return -EIO;
> >
> > -   if (payload_len == 0)
> > +   /* Peer shutdown */
> > +   if (payl

RE: [PATCH v2 net] hv_sock: Fix hang when a connection is closed

2019-07-31 Thread Sunil Muthuswamy
> -Original Message-
> From: Dexuan Cui 
> Sent: Tuesday, July 30, 2019 6:26 PM
> To: Sunil Muthuswamy ; David Miller 
> ; net...@vger.kernel.org
> Cc: KY Srinivasan ; Haiyang Zhang 
> ; Stephen Hemminger
> ; sas...@kernel.org; Michael Kelley 
> ; linux-hyp...@vger.kernel.org; linux-
> ker...@vger.kernel.org; o...@aepfle.de; a...@canonical.com; 
> jasow...@redhat.com; vkuznets ;
> marcelo.ce...@canonical.com
> Subject: [PATCH v2 net] hv_sock: Fix hang when a connection is closed
> 
> 
> There is a race condition for an established connection that is being closed
> by the guest: the refcnt is 4 at the end of hvs_release() (Note: here the
> 'remove_sock' is false):
> 
> 1 for the initial value;
> 1 for the sk being in the bound list;
> 1 for the sk being in the connected list;
> 1 for the delayed close_work.
> 
> After hvs_release() finishes, __vsock_release() -> sock_put(sk) *may*
> decrease the refcnt to 3.
> 
> Concurrently, hvs_close_connection() runs in another thread:
>   calls vsock_remove_sock() to decrease the refcnt by 2;
>   call sock_put() to decrease the refcnt to 0, and free the sk;
>   next, the "release_sock(sk)" may hang due to use-after-free.
> 
> In the above, after hvs_release() finishes, if hvs_close_connection() runs
> faster than "__vsock_release() -> sock_put(sk)", then there is not any issue,
> because at the beginning of hvs_close_connection(), the refcnt is still 4.
> 
> The issue can be resolved if an extra reference is taken when the
> connection is established.
> 
> Fixes: a9eeb998c28d ("hv_sock: Add support for delayed close")
> Signed-off-by: Dexuan Cui 
> Cc: sta...@vger.kernel.org
> 
> ---
> 
> Changes in v2:
>   Changed the location of the sock_hold() call.
>   Updated the changelog accordingly.
> 
>   Thanks Sunil for the suggestion!
> 
> 
> With the proper kernel debugging options enabled, first a warning can
> appear:
> 
> kworker/1:0/4467 is freeing memory ..., with a lock still held there!
> stack backtrace:
> Workqueue: events vmbus_onmessage_work [hv_vmbus]
> Call Trace:
>  dump_stack+0x67/0x90
>  debug_check_no_locks_freed.cold.52+0x78/0x7d
>  slab_free_freelist_hook+0x85/0x140
>  kmem_cache_free+0xa5/0x380
>  __sk_destruct+0x150/0x260
>  hvs_close_connection+0x24/0x30 [hv_sock]
>  vmbus_onmessage_work+0x1d/0x30 [hv_vmbus]
>  process_one_work+0x241/0x600
>  worker_thread+0x3c/0x390
>  kthread+0x11b/0x140
>  ret_from_fork+0x24/0x30
> 
> and then the following release_sock(sk) can hang:
> 
> watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:4467]
> ...
> irq event stamp: 62890
> CPU: 1 PID: 4467 Comm: kworker/1:0 Tainted: GW 5.2.0+ #39
> Workqueue: events vmbus_onmessage_work [hv_vmbus]
> RIP: 0010:queued_spin_lock_slowpath+0x2b/0x1e0
> ...
> Call Trace:
>  do_raw_spin_lock+0xab/0xb0
>  release_sock+0x19/0xb0
>  vmbus_onmessage_work+0x1d/0x30 [hv_vmbus]
>  process_one_work+0x241/0x600
>  worker_thread+0x3c/0x390
>  kthread+0x11b/0x140
>  ret_from_fork+0x24/0x30
> 
>  net/vmw_vsock/hyperv_transport.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/net/vmw_vsock/hyperv_transport.c 
> b/net/vmw_vsock/hyperv_transport.c
> index f2084e3f7aa4..9d864ebeb7b3 100644
> --- a/net/vmw_vsock/hyperv_transport.c
> +++ b/net/vmw_vsock/hyperv_transport.c
> @@ -312,6 +312,11 @@ static void hvs_close_connection(struct vmbus_channel 
> *chan)
>   lock_sock(sk);
>   hvs_do_close_lock_held(vsock_sk(sk), true);
>   release_sock(sk);
> +
> + /* Release the refcnt for the channel that's opened in
> +  * hvs_open_connection().
> +  */
> + sock_put(sk);
>  }
> 
>  static void hvs_open_connection(struct vmbus_channel *chan)
> @@ -407,6 +412,9 @@ static void hvs_open_connection(struct vmbus_channel 
> *chan)
>   }
> 
>   set_per_channel_state(chan, conn_from_host ? new : sk);
> +
> + /* This reference will be dropped by hvs_close_connection(). */
> + sock_hold(conn_from_host ? new : sk);
>   vmbus_set_chn_rescind_callback(chan, hvs_close_connection);
> 
>   /* Set the pending send size to max packet size to always get
> --
> 2.19.1

Reviewed-by: Sunil Muthuswamy 


RE: [PATCH net] hvsock: fix epollout hang from race condition

2019-06-17 Thread Sunil Muthuswamy



> -Original Message-
> From: linux-hyperv-ow...@vger.kernel.org  
> On Behalf Of David Miller
> Sent: Sunday, June 16, 2019 1:55 PM
> To: Dexuan Cui 
> Cc: Sunil Muthuswamy ; KY Srinivasan 
> ; Haiyang Zhang ;
> Stephen Hemminger ; sas...@kernel.org; Michael Kelley 
> ;
> net...@vger.kernel.org; linux-hyp...@vger.kernel.org; 
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH net] hvsock: fix epollout hang from race condition
> 
> From: Dexuan Cui 
> Date: Sat, 15 Jun 2019 03:22:32 +
> 
> > These warnings are not introduced by this patch from Sunil.
> >
> > I'm not sure why I didn't notice these warnings before.
> > Probably my gcc version is not new eought?
> >
> > Actually these warnings are bogus, as I checked the related functions,
> > which may confuse the compiler's static analysis.
> >
> > I'm going to make a patch to initialize the pointers to NULL to suppress
> > the warnings. My patch will be based on the latest's net.git + this patch
> > from Sunil.
> 
> Sunil should then resubmit his patch against something that has the
> warning suppression patch applied.

David, Dexuan's patch to suppress the warnings seems to be applied now
to the 'net' branch. Can we please get this patch applied as well?


[PATCH net v2] hvsock: fix epollout hang from race condition

2019-06-17 Thread Sunil Muthuswamy
Currently, hvsock can enter into a state where epoll_wait on EPOLLOUT will
not return even when the hvsock socket is writable, under some race
condition. This can happen under the following sequence:
- fd = socket(hvsocket)
- fd_out = dup(fd)
- fd_in = dup(fd)
- start a writer thread that writes data to fd_out with a combination of
  epoll_wait(fd_out, EPOLLOUT) and
- start a reader thread that reads data from fd_in with a combination of
  epoll_wait(fd_in, EPOLLIN)
- On the host, there are two threads that are reading/writing data to the
  hvsocket

stack:
hvs_stream_has_space
hvs_notify_poll_out
vsock_poll
sock_poll
ep_poll

Race condition:
check for epollout from ep_poll():
assume no writable space in the socket
hvs_stream_has_space() returns 0
check for epollin from ep_poll():
assume socket has some free space < HVS_PKT_LEN(HVS_SEND_BUF_SIZE)
hvs_stream_has_space() will clear the channel pending send size
host will not notify the guest because the pending send size has
been cleared and so the hvsocket will never mark the
socket writable

Now, the EPOLLOUT will never return even if the socket write buffer is
empty.

The fix is to set the pending size to the default size and never change it.
This way the host will always notify the guest whenever the writable space
is bigger than the pending size. The host is already optimized to *only*
notify the guest when the pending size threshold boundary is crossed and
not everytime.

This change also reduces the cpu usage somewhat since hv_stream_has_space()
is in the hotpath of send:
vsock_stream_sendmsg()->hv_stream_has_space()
Earlier hv_stream_has_space was setting/clearing the pending size on every
call.

Signed-off-by: Sunil Muthuswamy 
Reviewed-by: Dexuan Cui 
---
- Resubmitting the patch after taking care of some spurious warnings.

 net/vmw_vsock/hyperv_transport.c | 39 ---
 1 file changed, 8 insertions(+), 31 deletions(-)

diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index e4801c7..cd3f47f 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -220,18 +220,6 @@ static void hvs_set_channel_pending_send_size(struct 
vmbus_channel *chan)
set_channel_pending_send_size(chan,
  HVS_PKT_LEN(HVS_SEND_BUF_SIZE));
 
-   /* See hvs_stream_has_space(): we must make sure the host has seen
-* the new pending send size, before we can re-check the writable
-* bytes.
-*/
-   virt_mb();
-}
-
-static void hvs_clear_channel_pending_send_size(struct vmbus_channel *chan)
-{
-   set_channel_pending_send_size(chan, 0);
-
-   /* Ditto */
virt_mb();
 }
 
@@ -301,9 +289,6 @@ static void hvs_channel_cb(void *ctx)
if (hvs_channel_readable(chan))
sk->sk_data_ready(sk);
 
-   /* See hvs_stream_has_space(): when we reach here, the writable bytes
-* may be already less than HVS_PKT_LEN(HVS_SEND_BUF_SIZE).
-*/
if (hv_get_bytes_to_write(&chan->outbound) > 0)
sk->sk_write_space(sk);
 }
@@ -404,6 +389,13 @@ static void hvs_open_connection(struct vmbus_channel *chan)
set_per_channel_state(chan, conn_from_host ? new : sk);
vmbus_set_chn_rescind_callback(chan, hvs_close_connection);
 
+   /* Set the pending send size to max packet size to always get
+* notifications from the host when there is enough writable space.
+* The host is optimized to send notifications only when the pending
+* size boundary is crossed, and not always.
+*/
+   hvs_set_channel_pending_send_size(chan);
+
if (conn_from_host) {
new->sk_state = TCP_ESTABLISHED;
sk->sk_ack_backlog++;
@@ -697,23 +689,8 @@ static s64 hvs_stream_has_data(struct vsock_sock *vsk)
 static s64 hvs_stream_has_space(struct vsock_sock *vsk)
 {
struct hvsock *hvs = vsk->trans;
-   struct vmbus_channel *chan = hvs->chan;
-   s64 ret;
-
-   ret = hvs_channel_writable_bytes(chan);
-   if (ret > 0)  {
-   hvs_clear_channel_pending_send_size(chan);
-   } else {
-   /* See hvs_channel_cb() */
-   hvs_set_channel_pending_send_size(chan);
-
-   /* Re-check the writable bytes to avoid race */
-   ret = hvs_channel_writable_bytes(chan);
-   if (ret > 0)
-   hvs_clear_channel_pending_send_size(chan);
-   }
 
-   return ret;
+   return hvs_channel_writable_bytes(hvs->chan);
 }
 
 static u64 hvs_stream_rcvhiwat(struct vsock_sock *vsk)
-- 
2.7.4



RE: [PATCH net] hvsock: fix epollout hang from race condition

2019-06-17 Thread Sunil Muthuswamy



> -Original Message-
> From: David Miller 
> Sent: Monday, June 17, 2019 11:56 AM
> To: Sunil Muthuswamy 
> Cc: Dexuan Cui ; KY Srinivasan ; 
> Haiyang Zhang ; Stephen
> Hemminger ; sas...@kernel.org; Michael Kelley 
> ; net...@vger.kernel.org;
> linux-hyp...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH net] hvsock: fix epollout hang from race condition
> 
> From: Sunil Muthuswamy 
> Date: Mon, 17 Jun 2019 18:47:08 +
> 
> >
> >
> >> -Original Message-
> >> From: linux-hyperv-ow...@vger.kernel.org 
> >>  On Behalf Of David Miller
> >> Sent: Sunday, June 16, 2019 1:55 PM
> >> To: Dexuan Cui 
> >> Cc: Sunil Muthuswamy ; KY Srinivasan 
> >> ; Haiyang Zhang
> ;
> >> Stephen Hemminger ; sas...@kernel.org; Michael 
> >> Kelley ;
> >> net...@vger.kernel.org; linux-hyp...@vger.kernel.org; 
> >> linux-kernel@vger.kernel.org
> >> Subject: Re: [PATCH net] hvsock: fix epollout hang from race condition
> >>
> >> From: Dexuan Cui 
> >> Date: Sat, 15 Jun 2019 03:22:32 +
> >>
> >> > These warnings are not introduced by this patch from Sunil.
> >> >
> >> > I'm not sure why I didn't notice these warnings before.
> >> > Probably my gcc version is not new eought?
> >> >
> >> > Actually these warnings are bogus, as I checked the related functions,
> >> > which may confuse the compiler's static analysis.
> >> >
> >> > I'm going to make a patch to initialize the pointers to NULL to suppress
> >> > the warnings. My patch will be based on the latest's net.git + this patch
> >> > from Sunil.
> >>
> >> Sunil should then resubmit his patch against something that has the
> >> warning suppression patch applied.
> >
> > David, Dexuan's patch to suppress the warnings seems to be applied now
> > to the 'net' branch. Can we please get this patch applied as well?
> 
> I don't know how else to say "Suni should then resubmit his patch"
> 
> Please just resubmit it!

The patch does not change at all. So, I was hoping we could reapply it. But, I 
have
resubmitted the patch. Thanks.


RE: [EXTERNAL] Re: [PATCH] x86/Hyper-V: Support for free page reporting

2020-05-22 Thread Sunil Muthuswamy
> > +   if (hv_do_hypercall(HV_EXT_CALL_QUERY_CAPABILITIES, NULL, cap) ==
> > +   HV_STATUS_SUCCESS)
> 
> You're using the input page as the output parameter. Ideally we should
> introduce hyperv_pcpu_output_arg page, but that would waste one page per
> cpu just for this one call.
> 
> So for now I think this setup is fine, but I would like to add the
> following comment.
> 
> /*
>  * Repurpose the input_arg page to accept output from Hyper-V for
>  * now because this is the only call that needs output from the
>  * hypervisor. It should be fixed properly by introducing an
>  * output_arg page once we have more places that require output.
>  */

Sounds good. Will add it in v2.

> > +#ifdef CONFIG_PAGE_REPORTING
> > +static int hv_free_page_report(struct page_reporting_dev_info *pr_dev_info,
> > +   struct scatterlist *sgl, unsigned int nents)
> > +{
> > +   unsigned long flags;
> > +   struct hv_memory_hint *hint;
> > +   int i;
> > +   u64 status;
> > +   struct scatterlist *sg;
> > +
> > +   WARN_ON(nents > HV_MAX_GPA_PAGE_RANGES);
> 
> Should we return -ENOSPC here?

This is more of an assert because PAGE_REPORTING_CAPACITY is set to 32 and has
already been checked that it is < HV_MAX_GPA_PAGE_RANGES in 
enable_page_reporting.

> > +   hint->type = HV_EXT_MEMORY_HEAT_HINT_TYPE_COLD_DISCARD;
> > +   hint->reserved = 0;
> > +   for (i = 0, sg = sgl; sg; sg = sg_next(sg), i++) {
> > +   int order;
> > +   union hv_gpa_page_range *range;
> > +
> 
> Unfortunately I can't find the semantics of this hypercall in TLFS 6, so
> I have a few questions here.

This structure is not specific to this hypercall.

> 
> > +   order = get_order(sg->length);
> > +   range = &hint->ranges[i];
> > +   range->address_space = 0;
> 
> I guess this means all address spaces?

'address_space' is being used here just as a zero initializer for the union. I 
think the use
of 'address_space' in the definition of hv_gpa_page_range is not appropriate. 
This struct is
defined in the TLFS 6.0 with the same name, if you want to look it up.

> 
> > +   range->page.largepage = 1;
> 
> What effect does this have? What if the page is a 4k page?

Page reporting infrastructure doesn't report 4k pages today, but only huge 
pages (see 
PAGE_REPORTING_MIN_ORDER in page_reporting.h). Additionally, the Hyper-V 
hypervisor
only supports reporting of 2M pages and above. The current code assumes that 
the minimum
order will be 9 i.e 2M pages and above.
If we feel that this could change in the future, or an implementation detail 
that should be
protected against, I can add some checks in hv_balloon.c. But, in that case, 
the ballon driver
should be able to query the page reporting min order somehow, which it is 
currently, since it is
private.
Alexander, do you have any suggestions or feedback here?

> 
> > +   range->page.additional_pages = (1ull << (order - 9)) - 1;
> 
> What is 9 here? Is there a macro name *ORDER that you can use?

It is to determine the count of 2M pages. I will define a macro.

> 
> Wei.


RE: [EXTERNAL] Re: [PATCH] x86/Hyper-V: Support for free page reporting

2020-05-22 Thread Sunil Muthuswamy
> As the only usage of this function looks like
> if (!(hv_query_ext_cap() & HV_EXT_CAPABILITY_MEMORY_COLD_DISCARD_HINT))
> 
> I would've change the interface to
> 
> bool hv_query_ext_cap(u64 cap)
> 
> so the usage would look like
> 
> if (!(hv_query_ext_cap(HV_EXT_CAPABILITY_MEMORY_COLD_DISCARD_HINT))

Good idea. Will do in v2.

> > ms_hyperv.features = cpuid_eax(HYPERV_CPUID_FEATURES);
> > +   ms_hyperv.b_features = cpuid_ebx(HYPERV_CPUID_FEATURES);
> > ms_hyperv.misc_features = cpuid_edx(HYPERV_CPUID_FEATURES);
> > ms_hyperv.hints= cpuid_eax(HYPERV_CPUID_ENLIGHTMENT_INFO);
> >
> > -   pr_info("Hyper-V: features 0x%x, hints 0x%x, misc 0x%x\n",
> > -   ms_hyperv.features, ms_hyperv.hints, ms_hyperv.misc_features);
> > +   pr_info("Hyper-V: features 0x%x, additional features: 0x%x, hints 0x%x, 
> > misc 0x%x\n",
> > +   ms_hyperv.features, ms_hyperv.b_features, ms_hyperv.hints,
> > +   ms_hyperv.misc_features);
> 
> HYPERV_CPUID_FEATURES(0x4003) EAX and EBX correspond to Partition
> Privilege Flags (TLFS), I'd suggest to take the opportunity and rename
> this to something like 'privilege flags low=0x%x high=0x%x'.
> 
> Also, I don't quite like 'ms_hyperv.b_features' as I'll always have to
> look at what it's being assigned to understand what it holds. I'd even
> suggest to rename ms_hyperv.features to ms_hyperv.priv_low and
> ms_hyperv.b_features tp ms_hyperv.priv_high. Or maybe even better, pack
> them to the same 'u64 ms_hyperv.privileges'.

Good idea. I will make the change to rename this to 'priv_high' in v2. I like 
the idea of
combining 'features' & 'priv_high' to a u64, but that would be a cleanup change 
and a
separate patch.

> 
> Why is largepage always '1'?
I have responded to a similar question by Wei. Page reporting only supports 
huge pages
and, so does the Hyper-V hypervisor. Let's follow this there.

> 
> > +   range->page.additional_pages = (1ull << (order - 9)) - 1;
> > +   range->base_large_pfn = page_to_pfn(sg_page(sg)) >> 9;
> 
> What is '9'? Could you please define it through PAGE_*/HPAGE_* macro?
Yes, I will define a macro. Essentially, it is to get a count of 2M pages.

> Nit: you could've just used
> 
> if (status & HV_HYPERCALL_RESULT_MASK != HV_STATUS_SUCCESS) {
Sure, coming in v2.

> ...
> 
> > +   pr_err("Cold memory discard hypercall failed with status 
> > %llx\n",
> > +   status);
> > +   return -1;
> 
> -EFAULT or something like it maybe?
Coming in v2.

> > +#ifdef CONFIG_PAGE_REPORTING
> > +   if (enable_page_reporting() < 0)
> > +   goto probe_error;
> 
> Why? The hyperv-balloon driver itself may still be functional and you
> already set dm_device.pr_dev_info.report to NULL.
An error here would reflect an internal error and should not happen and
it was to make it easy to catch such errors, which are otherwise difficult
with just a print. But, the code should follow the general spirit. I will change
this in v2.

> 
> >  enum HV_GENERIC_SET_FORMAT {
> > HV_GENERIC_SET_SPARSE_4K,
> > HV_GENERIC_SET_ALL,
> > @@ -371,6 +379,12 @@ union hv_gpa_page_range {
> 
> There is a comment before this structure:
> 
> /* HvFlushGuestPhysicalAddressList hypercall */
> 
> which is now obsolete.

I will add that it also applies to 'HvExtCallMemoryHeatHint' hypercall also.

> 
> > u64 largepage:1;
> > u64 basepfn:52;
> > } page;
> > +   struct {
> > +   u64:12;
> 
> What is this unnamed member? Another 'reserved', 'pad'? Let's name it.

Sure, coming in v2.

> 
> > +   u64 page_size:1;
> > +   u64 reserved:8;
> > +   u64 base_large_pfn:43;
> > +   };
> 
> Please name this structure in the union.

Sure, coming in v2.

> > + */
> > +#define HV_MAX_GPA_PAGE_RANGES ((PAGE_SIZE - sizeof(u64)) / \
> > +   sizeof(union hv_gpa_page_range))
> > +
> 
> The name HV_MAX_GPA_PAGE_RANGES sounds too generic and I think this is
> specific to the HvExtCallMemoryHeatHint hypercall as other hypercalls
> may have a different header length.
> 

Good idea to rename. Coming in v2.

> > +/* HvExtCallMemoryHeatHint hypercall */
> > +#define HV_EXT_MEMORY_HEAT_HINT_TYPE_COLD_DISCARD  2
> > +struct hv_memory_hint {
> > +   u64 type:2;
> > +   u64 reserved:62;
> > +   union hv_gpa_page_range ranges[1];
> 
> Why '[1]' and not '[]'? If it was '[]' you could've used 'sizeof(struct
> hv_memory_hint)' in HV_MAX_GPA_PAGE_RANGES macro definition instead of
> 'sizeof(u64)'.
> 
Good idea, coming in v2.

- Sunil


RE: [EXTERNAL] Re: [PATCH RFC v1 08/18] x86/hyperv: handling hypercall page setup for root

2020-09-16 Thread Sunil Muthuswamy
> 
> On Tue, Sep 15, 2020 at 12:32:29PM +0200, Vitaly Kuznetsov wrote:
> > Wei Liu  writes:
> >
> > > When Linux is running as the root partition, the hypercall page will
> > > have already been setup by Hyper-V. Copy the content over to the
> > > allocated page.
> >
> > And we can't setup a new hypercall page by writing something different
> > to HV_X64_MSR_HYPERCALL, right?
> >
> 
> My understanding is that we can't, but Sunil can maybe correct me.

That is correct. For root partition, the hypervisor has already allocated the
hypercall page. The root is required to query the page, map it in its address
space and wrmsr to enable it. It cannot change the location of the page. For
guest, it can allocate and assign the hypercall page. This is covered a bit in 
the
hypervisor TLFS (section 3.13 in TLFS v6), for the guest side. The root side is 
not covered there, yet.
 


[PATCH net] hvsock: fix epollout hang from race condition

2019-06-13 Thread Sunil Muthuswamy
Currently, hvsock can enter into a state where epoll_wait on EPOLLOUT will
not return even when the hvsock socket is writable, under some race
condition. This can happen under the following sequence:
- fd = socket(hvsocket)
- fd_out = dup(fd)
- fd_in = dup(fd)
- start a writer thread that writes data to fd_out with a combination of
  epoll_wait(fd_out, EPOLLOUT) and
- start a reader thread that reads data from fd_in with a combination of
  epoll_wait(fd_in, EPOLLIN)
- On the host, there are two threads that are reading/writing data to the
  hvsocket

stack:
hvs_stream_has_space
hvs_notify_poll_out
vsock_poll
sock_poll
ep_poll

Race condition:
check for epollout from ep_poll():
assume no writable space in the socket
hvs_stream_has_space() returns 0
check for epollin from ep_poll():
assume socket has some free space < HVS_PKT_LEN(HVS_SEND_BUF_SIZE)
hvs_stream_has_space() will clear the channel pending send size
host will not notify the guest because the pending send size has
been cleared and so the hvsocket will never mark the
socket writable

Now, the EPOLLOUT will never return even if the socket write buffer is
empty.

The fix is to set the pending size to the default size and never change it.
This way the host will always notify the guest whenever the writable space
is bigger than the pending size. The host is already optimized to *only*
notify the guest when the pending size threshold boundary is crossed and
not everytime.

This change also reduces the cpu usage somewhat since hv_stream_has_space()
is in the hotpath of send:
vsock_stream_sendmsg()->hv_stream_has_space()
Earlier hv_stream_has_space was setting/clearing the pending size on every
call.

Signed-off-by: Sunil Muthuswamy 
---
 net/vmw_vsock/hyperv_transport.c | 39 ---
 1 file changed, 8 insertions(+), 31 deletions(-)

diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index 982a8dc..8d1ea9e 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -220,18 +220,6 @@ static void hvs_set_channel_pending_send_size(struct 
vmbus_channel *chan)
set_channel_pending_send_size(chan,
  HVS_PKT_LEN(HVS_SEND_BUF_SIZE));
 
-   /* See hvs_stream_has_space(): we must make sure the host has seen
-* the new pending send size, before we can re-check the writable
-* bytes.
-*/
-   virt_mb();
-}
-
-static void hvs_clear_channel_pending_send_size(struct vmbus_channel *chan)
-{
-   set_channel_pending_send_size(chan, 0);
-
-   /* Ditto */
virt_mb();
 }
 
@@ -301,9 +289,6 @@ static void hvs_channel_cb(void *ctx)
if (hvs_channel_readable(chan))
sk->sk_data_ready(sk);
 
-   /* See hvs_stream_has_space(): when we reach here, the writable bytes
-* may be already less than HVS_PKT_LEN(HVS_SEND_BUF_SIZE).
-*/
if (hv_get_bytes_to_write(&chan->outbound) > 0)
sk->sk_write_space(sk);
 }
@@ -404,6 +389,13 @@ static void hvs_open_connection(struct vmbus_channel *chan)
set_per_channel_state(chan, conn_from_host ? new : sk);
vmbus_set_chn_rescind_callback(chan, hvs_close_connection);
 
+   /* Set the pending send size to max packet size to always get
+* notifications from the host when there is enough writable space.
+* The host is optimized to send notifications only when the pending
+* size boundary is crossed, and not always.
+*/
+   hvs_set_channel_pending_send_size(chan);
+
if (conn_from_host) {
new->sk_state = TCP_ESTABLISHED;
sk->sk_ack_backlog++;
@@ -697,23 +689,8 @@ static s64 hvs_stream_has_data(struct vsock_sock *vsk)
 static s64 hvs_stream_has_space(struct vsock_sock *vsk)
 {
struct hvsock *hvs = vsk->trans;
-   struct vmbus_channel *chan = hvs->chan;
-   s64 ret;
-
-   ret = hvs_channel_writable_bytes(chan);
-   if (ret > 0)  {
-   hvs_clear_channel_pending_send_size(chan);
-   } else {
-   /* See hvs_channel_cb() */
-   hvs_set_channel_pending_send_size(chan);
-
-   /* Re-check the writable bytes to avoid race */
-   ret = hvs_channel_writable_bytes(chan);
-   if (ret > 0)
-   hvs_clear_channel_pending_send_size(chan);
-   }
 
-   return ret;
+   return hvs_channel_writable_bytes(hvs->chan);
 }
 
 static u64 hvs_stream_rcvhiwat(struct vsock_sock *vsk)
-- 
2.7.4


RE: [PATCH net] hv_sock: Suppress bogus "may be used uninitialized" warnings

2019-06-15 Thread Sunil Muthuswamy


> -Original Message-
> From: linux-hyperv-ow...@vger.kernel.org  
> On Behalf Of Dexuan Cui
> Sent: Friday, June 14, 2019 10:01 PM
> To: net...@vger.kernel.org; da...@davemloft.net; Michael Kelley 
> 
> Cc: linux-hyp...@vger.kernel.org; linux-kernel@vger.kernel.org; KY Srinivasan 
> ; Stephen Hemminger
> ; Haiyang Zhang ; Sasha Levin 
> ;
> o...@aepfle.de; a...@canonical.com; jasow...@redhat.com; vkuznets 
> ; marcelo.ce...@canonical.com;
> Dexuan Cui 
> Subject: [PATCH net] hv_sock: Suppress bogus "may be used uninitialized" 
> warnings
> 
> gcc 8.2.0 may report these bogus warnings under some condition:
> 
> warning: ‘vnew’ may be used uninitialized in this function
> warning: ‘hvs_new’ may be used uninitialized in this function
> 
> Actually, the 2 pointers are only initialized and used if the variable
> "conn_from_host" is true. The code is not buggy here.
> 
> Signed-off-by: Dexuan Cui 
> ---
>  net/vmw_vsock/hyperv_transport.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/vmw_vsock/hyperv_transport.c 
> b/net/vmw_vsock/hyperv_transport.c
> index 8d1ea9eda8a2..cd3f47f54fa7 100644
> --- a/net/vmw_vsock/hyperv_transport.c
> +++ b/net/vmw_vsock/hyperv_transport.c
> @@ -329,8 +329,8 @@ static void hvs_open_connection(struct vmbus_channel 
> *chan)
> 
>   struct sockaddr_vm addr;
>   struct sock *sk, *new = NULL;
> - struct vsock_sock *vnew;
> - struct hvsock *hvs, *hvs_new;
> + struct vsock_sock *vnew = NULL;
> + struct hvsock *hvs, *hvs_new = NULL;
>   int ret;
> 
These are all already fixed under 
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=ac383f58f3c98de37fa67452acc5bd677396e9f3
 
Its just that that commit hasn't merged with the 'net' branch yet.
>   if_type = &chan->offermsg.offer.if_type;
> --
> 2.19.1



RE: [PATCH net] hvsock: fix epollout hang from race condition

2019-06-15 Thread Sunil Muthuswamy



> -Original Message-
> From: Dexuan Cui 
> Sent: Friday, June 14, 2019 10:04 PM
> To: David Miller ; Sunil Muthuswamy 
> 
> Cc: KY Srinivasan ; Haiyang Zhang 
> ; Stephen Hemminger
> ; sas...@kernel.org; Michael Kelley 
> ; net...@vger.kernel.org; linux-
> hyp...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: RE: [PATCH net] hvsock: fix epollout hang from race condition
> 
> > From: linux-hyperv-ow...@vger.kernel.org
> >  On Behalf Of Dexuan Cui
> > Sent: Friday, June 14, 2019 8:23 PM
> > To: David Miller ; Sunil Muthuswamy
> > 
> > Cc: KY Srinivasan ; Haiyang Zhang
> > ; Stephen Hemminger
> > ; sas...@kernel.org; Michael Kelley
> > ; net...@vger.kernel.org;
> > linux-hyp...@vger.kernel.org; linux-kernel@vger.kernel.org
> > Subject: RE: [PATCH net] hvsock: fix epollout hang from race condition
> >
> > > From: linux-hyperv-ow...@vger.kernel.org
> > >  On Behalf Of David Miller
> > > Sent: Friday, June 14, 2019 7:15 PM
> > > To: Sunil Muthuswamy 
> > >
> > > This adds lots of new warnings:
> > >
> > > net/vmw_vsock/hyperv_transport.c: In function ‘hvs_probe’:
> > > net/vmw_vsock/hyperv_transport.c:205:20: warning: ‘vnew’ may be used
> > > uninitialized in this function [-Wmaybe-uninitialized]
> > >remote->svm_port = host_ephemeral_port++;
> > >~^~~
> > > net/vmw_vsock/hyperv_transport.c:332:21: note: ‘vnew’ was declared
> > here
> > >   struct vsock_sock *vnew;
> > >  ^~~~
> > > net/vmw_vsock/hyperv_transport.c:406:22: warning: ‘hvs_new’ may be
> > > used uninitialized in this function [-Wmaybe-uninitialized]
> > >hvs_new->vm_srv_id = *if_type;
> > >~~~^~
> > > net/vmw_vsock/hyperv_transport.c:333:23: note: ‘hvs_new’ was declared
> > > here
> > >   struct hvsock *hvs, *hvs_new;
> > >^~~
> >
> > Hi David,
> > These warnings are not introduced by this patch from Sunil.
> 
> Well, technically speaking, the warnings are caused by Suni's patch, but I 
> think it should
> be a bug of gcc (I'm using "gcc (Ubuntu 8.2.0-12ubuntu1) 8.2.0"). As you can 
> see, the
> line numbers given by gcc, i.e. line 205, line 406, are not related to the 
> warnings.
> 
> Actually, the same warnings can repro with the below one-line patch on this 
> commit of
> today's net.git:
> 9a33629ba6b2 ("hv_netvsc: Set probe mode to sync")
> 
> --- a/net/vmw_vsock/hyperv_transport.c
> +++ b/net/vmw_vsock/hyperv_transport.c
> @@ -403,6 +403,7 @@ static void hvs_open_connection(struct vmbus_channel 
> *chan)
> 
> set_per_channel_state(chan, conn_from_host ? new : sk);
> vmbus_set_chn_rescind_callback(chan, hvs_close_connection);
> +   asm ("nop");
> 
> if (conn_from_host) {
> new->sk_state = TCP_ESTABLISHED;
> 
> It looks a simple inline assembly code can confuse gcc. I'm not sure if I 
> should
> report a bug for gcc...
> 
> I posted a patch to suppress these bogus warnings just now. The Subject is:
> 
> [PATCH net] hv_sock: Suppress bogus "may be used uninitialized" warnings
> 
> Thanks,
> -- Dexuan

Yes, these warnings are not specific to this patch. And, additionally these
should already addressed in the commit 
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=ac383f58f3c98de37fa67452acc5bd677396e9f3
I was trying to avoid making the same changes here to avoid merge
conflicts when 'net-next' merges with 'net' next.