from:"andrea"

Re: [PATCH 2/2] riscv: Fix text patching when IPI are used

2024-02-28 Thread Andrea Parri

On Wed, Feb 28, 2024 at 06:51:49PM +0100, Alexandre Ghiti wrote:
> For now, we use stop_machine() to patch the text and when we use IPIs for
> remote icache flushes (which is emitted in patch_text_nosync()), the system
> hangs.
> 
> So instead, make sure every cpu executes the stop_machine() patching
> function and emit a local icache flush there.
> 
> Co-developed-by: Björn Töpel 
> Signed-off-by: Björn Töpel 
> Signed-off-by: Alexandre Ghiti 

Modulo the removal of the hunk discussed with Samuel,

Reviewed-by: Andrea Parri 

Some nits / amendments to the inline comments below:


> + /*
> +  * Make sure the patching store is effective *before* we
> +  * increment the counter which releases all waiting cpus
> +  * by using the release version of atomic increment.
> +  */

s/cpus/CPUs
s/release version/release variant

The comment could be amended with a description of the matching barrier(s), say,
"The release pairs with the call to local_flush_icache_all() on the waiting 
CPU".

(Same for the comment in patch_text_cb().)

  Andrea

Re: [PATCH 1/2] riscv: Remove superfluous smp_mb()

2024-02-28 Thread Andrea Parri

On Wed, Feb 28, 2024 at 06:51:48PM +0100, Alexandre Ghiti wrote:
> This memory barrier is not needed and not documented so simply remove
> it.
> 
> Suggested-by: Andrea Parri 
> Signed-off-by: Alexandre Ghiti 

Reviewed-by: Andrea Parri 

  Andrea

Re: [PATCH] riscv: Fix text patching when icache flushes use IPIs

2024-02-08 Thread Andrea Parri

> I did not even think of that, and it actually makes sense so I'll go
> with what you propose: I'll replace atomic_inc() with
> atomic_inc_return_release(). And I'll add the following comment if
> that's ok with you:
> 
> "Make sure the patching store is effective *before* we increment the
> counter which releases all waiting cpus"

Yes, this sounds good to me.


> Honestly, I looked at it one minute, did not understand its purpose
> and said to myself "ok that can't hurt anyway, I may be missing
> something".
> 
> FWIW,  I see that arm64 uses isb() here. If you don't see its purpose,
> I'll remove it (here and where I copied it).

Removing the smp_mb() (and keeping the local_flush_icache_all()) seems
fine to me; thanks for the confirmation.


> > On a last topic, although somehow orthogonal to the scope of this patch,
> > I'm not sure the patch_{map,unmap}() dance in our patch_insn_write() is
> > correct: I can see why we may want (need to do) the local TLB flush be-
> > fore returning from patch_{map,unmap}(), but does a local flush suffice?
> > For comparison, arm64 seems to go through a complete dsb-tlbi-dsb(-isb)
> > sequence in their unmapping stage (and apparently relying on "no caching
> > of invalid ptes" in their mapping stage).  Of course, "broadcasting" our
> > (riscv's) TLB invalidations will necessary introduce some complexity...
> >
> > Thoughts?
> 
> To avoid remote TLBI, could we simply disable the preemption before
> the first patch_map()? arm64 disables the irqs, but that seems
> overkill to me, but maybe I'm missing something again?

Mmh, I'm afraid this will require more thinking/probing on my end (not
really "the expert" of the codebase at stake...).  Maybe the ftrace
reviewers will provide further ideas/suggestions for us to brainstorm.

  Andrea

Re: [PATCH] riscv: Fix text patching when icache flushes use IPIs

2024-02-08 Thread Andrea Parri

> +static int __ftrace_modify_code(void *data)
> +{
> + struct ftrace_modify_param *param = data;
> +
> + if (atomic_inc_return(>cpu_count) == num_online_cpus()) {
> + ftrace_modify_all_code(param->command);
> + atomic_inc(>cpu_count);

I stared at ftrace_modify_all_code() for a bit but honestly I don't see
what prevents the ->cpu_count increment from being reordered before the
insn write(s) (architecturally) now that you have removed the IPI dance:
perhaps add an smp_wmb() right before the atomic_inc() (or promote this
latter to a (void)atomic_inc_return_release()) and/or an inline comment
saying why such reordering is not possible?


> + } else {
> + while (atomic_read(>cpu_count) <= num_online_cpus())
> + cpu_relax();
> + smp_mb();

I see that you've lifted/copied the memory barrier from patch_text_cb():
what's its point?  AFAIU, the barrier has no ordering effect on program
order later insn fetches; perhaps the code was based on some legacy/old
version of Zifencei?  IAC, comments, comments, ... or maybe just remove
that memory barrier?


> + }
> +
> + local_flush_icache_all();
> +
> + return 0;
> +}

[...]


> @@ -232,8 +230,7 @@ static int patch_text_cb(void *data)
>   if (atomic_inc_return(>cpu_count) == num_online_cpus()) {
>   for (i = 0; ret == 0 && i < patch->ninsns; i++) {
>   len = GET_INSN_LENGTH(patch->insns[i]);
> - ret = patch_text_nosync(patch->addr + i * len,
> - >insns[i], len);
> + ret = patch_insn_write(patch->addr + i * len, 
> >insns[i], len);
>   }
>   atomic_inc(>cpu_count);
>   } else {
> @@ -242,6 +239,8 @@ static int patch_text_cb(void *data)
>   smp_mb();
>   }
>  
> + local_flush_icache_all();
> +
>   return ret;
>  }
>  NOKPROBE_SYMBOL(patch_text_cb);

My above remarks/questions also apply to this function.


On a last topic, although somehow orthogonal to the scope of this patch,
I'm not sure the patch_{map,unmap}() dance in our patch_insn_write() is
correct: I can see why we may want (need to do) the local TLB flush be-
fore returning from patch_{map,unmap}(), but does a local flush suffice?
For comparison, arm64 seems to go through a complete dsb-tlbi-dsb(-isb)
sequence in their unmapping stage (and apparently relying on "no caching
of invalid ptes" in their mapping stage).  Of course, "broadcasting" our
(riscv's) TLB invalidations will necessary introduce some complexity...

Thoughts?

  Andrea

Re: [GIT PULL] Modules changes for v6.7-rc1

2023-11-02 Thread Andrea Righi

On Thu, Nov 02, 2023 at 08:29:17AM +0100, Andrea Righi wrote:
> On Wed, Nov 01, 2023 at 09:21:09PM -1000, Linus Torvalds wrote:
> > On Wed, 1 Nov 2023 at 21:02, Linus Torvalds
> >  wrote:
> > >
> > > kmalloc() isn't just about "use physically contiguous allocations".
> > > It's also more memory-efficient, and a *lot* faster than vmalloc(),
> > > which has to play VM tricks.
> > 
> > I've pulled this, but I think you should do something like the
> > attached (UNTESTED!) patch.
> > 
> > Linus
> 
> Looks good to me, I'll give it a try ASAP.
> 
> -Andrea

Just tested this both with zstd and gzip module compression, all good.

You can add my:

Tested-by: Andrea Righi 

Or if you need a proper paperwork:

--

From: Andrea Righi 
Subject: [PATCH] module/decompress: use kvmalloc() consistently

We consistently switched from kmalloc() to vmalloc() in module
decompression to prevent potential memory allocation failures with large
modules, however vmalloc() is not as memory-efficient and fast as
kmalloc().

Since we don't know in general the size of the workspace required by the
decompression algorithm, it is more reasonable to use kvmalloc()
consistently, also considering that we don't have special memory
requirements here.

Signed-off-by: Linus Torvalds 
Tested-by: Andrea Righi 
Signed-off-by: Andrea Righi 
---
 kernel/module/decompress.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/module/decompress.c b/kernel/module/decompress.c
index 4156d59be440..474e68f0f063 100644
--- a/kernel/module/decompress.c
+++ b/kernel/module/decompress.c
@@ -100,7 +100,7 @@ static ssize_t module_gzip_decompress(struct load_info 
*info,
s.next_in = buf + gzip_hdr_len;
s.avail_in = size - gzip_hdr_len;
 
-   s.workspace = vmalloc(zlib_inflate_workspacesize());
+   s.workspace = kvmalloc(zlib_inflate_workspacesize(), GFP_KERNEL);
if (!s.workspace)
return -ENOMEM;
 
@@ -138,7 +138,7 @@ static ssize_t module_gzip_decompress(struct load_info 
*info,
 out_inflate_end:
zlib_inflateEnd();
 out:
-   vfree(s.workspace);
+   kvfree(s.workspace);
return retval;
 }
 #elif defined(CONFIG_MODULE_COMPRESS_XZ)
@@ -241,7 +241,7 @@ static ssize_t module_zstd_decompress(struct load_info 
*info,
}
 
wksp_size = zstd_dstream_workspace_bound(header.windowSize);
-   wksp = vmalloc(wksp_size);
+   wksp = kvmalloc(wksp_size, GFP_KERNEL);
if (!wksp) {
retval = -ENOMEM;
goto out;
@@ -284,7 +284,7 @@ static ssize_t module_zstd_decompress(struct load_info 
*info,
retval = new_size;
 
  out:
-   vfree(wksp);
+   kvfree(wksp);
return retval;
 }
 #else
-- 
2.40.1

Re: [GIT PULL] Modules changes for v6.7-rc1

2023-11-02 Thread Andrea Righi

On Wed, Nov 01, 2023 at 09:21:09PM -1000, Linus Torvalds wrote:
> On Wed, 1 Nov 2023 at 21:02, Linus Torvalds
>  wrote:
> >
> > kmalloc() isn't just about "use physically contiguous allocations".
> > It's also more memory-efficient, and a *lot* faster than vmalloc(),
> > which has to play VM tricks.
> 
> I've pulled this, but I think you should do something like the
> attached (UNTESTED!) patch.
> 
>     Linus

Looks good to me, I'll give it a try ASAP.

-Andrea


>  kernel/module/decompress.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/module/decompress.c b/kernel/module/decompress.c
> index 4156d59be440..474e68f0f063 100644
> --- a/kernel/module/decompress.c
> +++ b/kernel/module/decompress.c
> @@ -100,7 +100,7 @@ static ssize_t module_gzip_decompress(struct load_info 
> *info,
>   s.next_in = buf + gzip_hdr_len;
>   s.avail_in = size - gzip_hdr_len;
>  
> - s.workspace = vmalloc(zlib_inflate_workspacesize());
> + s.workspace = kvmalloc(zlib_inflate_workspacesize(), GFP_KERNEL);
>   if (!s.workspace)
>   return -ENOMEM;
>  
> @@ -138,7 +138,7 @@ static ssize_t module_gzip_decompress(struct load_info 
> *info,
>  out_inflate_end:
>   zlib_inflateEnd();
>  out:
> - vfree(s.workspace);
> + kvfree(s.workspace);
>   return retval;
>  }
>  #elif defined(CONFIG_MODULE_COMPRESS_XZ)
> @@ -241,7 +241,7 @@ static ssize_t module_zstd_decompress(struct load_info 
> *info,
>   }
>  
>   wksp_size = zstd_dstream_workspace_bound(header.windowSize);
> - wksp = vmalloc(wksp_size);
> + wksp = kvmalloc(wksp_size, GFP_KERNEL);
>   if (!wksp) {
>   retval = -ENOMEM;
>   goto out;
> @@ -284,7 +284,7 @@ static ssize_t module_zstd_decompress(struct load_info 
> *info,
>   retval = new_size;
>  
>   out:
> - vfree(wksp);
> + kvfree(wksp);
>   return retval;
>  }
>  #else

Re: [GIT PULL] Modules changes for v6.7-rc1

2023-11-02 Thread Andrea Righi

On Wed, Nov 01, 2023 at 09:02:51PM -1000, Linus Torvalds wrote:
> On Wed, 1 Nov 2023 at 10:13, Luis Chamberlain  wrote:
> >
> > The only thing worth highligthing is that gzip moves to use vmalloc() 
> > instead of
> > kmalloc just as we had a fix for this for zstd on v6.6-rc1.
> 
> Actually, that's almost certainly entirely the wrong thing to do.
> 
> Unless you *know* that the allocation is large, you shouldn't use
> vmalloc(). And since kmalloc() has worked fine, you most definitely
> don't know that.
> 
> So we have 'kvmalloc()' *exactly* for this reason, which is a "use
> kmalloc, unless that is too small, then use vmalloc".
> 
> kmalloc() isn't just about "use physically contiguous allocations".
> It's also more memory-efficient, and a *lot* faster than vmalloc(),
> which has to play VM tricks.
> 
> So this "just switch to vmalloc()" is entirely wrong.
> 
>   Linus

I proposed that change mostlfy for consistency with the zstd case, but I
haven't experience any issue with gzip compressed modules (that seem to
require less memory, even with larger modules).

So, yes, it probably makes sense to drop this change for now and I can
send another patch to switch to kvmalloc() for all the decompress cases.

Thanks,
-Andrea

[PATCH v2] Drivers: hv: vmbus: Initialize unload_event statically

2021-04-19 Thread Andrea Parri (Microsoft)

If a malicious or compromised Hyper-V sends a spurious message of type
CHANNELMSG_UNLOAD_RESPONSE, the function vmbus_unload_response() will
call complete() on an uninitialized event, and cause an oops.

Reported-by: Michael Kelley 
Signed-off-by: Andrea Parri (Microsoft) 
---
Changes since v1[1]:
  - add inline comment in vmbus_unload_response()

[1] https://lkml.kernel.org/r/20210416143932.16512-1-parri.and...@gmail.com

 drivers/hv/channel_mgmt.c | 7 ++-
 drivers/hv/connection.c   | 2 ++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 4c9e45d1f462c..335a10ee03a5e 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -826,6 +826,11 @@ static void vmbus_unload_response(struct 
vmbus_channel_message_header *hdr)
/*
 * This is a global event; just wakeup the waiting thread.
 * Once we successfully unload, we can cleanup the monitor state.
+*
+* NB.  A malicious or compromised Hyper-V could send a spurious
+* message of type CHANNELMSG_UNLOAD_RESPONSE, and trigger a call
+* of the complete() below.  Make sure that unload_event has been
+* initialized by the time this complete() is executed.
 */
complete(_connection.unload_event);
 }
@@ -841,7 +846,7 @@ void vmbus_initiate_unload(bool crash)
if (vmbus_proto_version < VERSION_WIN8_1)
return;
 
-   init_completion(_connection.unload_event);
+   reinit_completion(_connection.unload_event);
memset(, 0, sizeof(struct vmbus_channel_message_header));
hdr.msgtype = CHANNELMSG_UNLOAD;
vmbus_post_msg(, sizeof(struct vmbus_channel_message_header),
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index dc19d5ae4373c..311cd005b3be6 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -26,6 +26,8 @@
 
 struct vmbus_connection vmbus_connection = {
.conn_state = DISCONNECTED,
+   .unload_event   = COMPLETION_INITIALIZER(
+ vmbus_connection.unload_event),
.next_gpadl_handle  = ATOMIC_INIT(0xE1E10),
 
.ready_for_suspend_event = COMPLETION_INITIALIZER(
-- 
2.25.1

Re: [PATCH] Drivers: hv: vmbus: Initialize unload_event statically

2021-04-19 Thread Andrea Parri

On Fri, Apr 16, 2021 at 03:25:03PM +, Michael Kelley wrote:
> From: Andrea Parri (Microsoft)  Sent: Friday, April 
> 16, 2021 7:40 AM
> > 
> > If a malicious or compromised Hyper-V sends a spurious message of type
> > CHANNELMSG_UNLOAD_RESPONSE, the function vmbus_unload_response() will
> > call complete() on an uninitialized event, and cause an oops.
> 
> Please leave a comment somewhere in the code itself that describes this
> scenario so that somebody in the future doesn't decide it's OK to "simplify" 
> the
> initialization of unload_event. :-)

Yes, will do for v2.

Thanks,
  Andrea

[PATCH] Drivers: hv: vmbus: Initialize unload_event statically

2021-04-16 Thread Andrea Parri (Microsoft)

If a malicious or compromised Hyper-V sends a spurious message of type
CHANNELMSG_UNLOAD_RESPONSE, the function vmbus_unload_response() will
call complete() on an uninitialized event, and cause an oops.

Reported-by: Michael Kelley 
Signed-off-by: Andrea Parri (Microsoft) 
---
 drivers/hv/channel_mgmt.c | 2 +-
 drivers/hv/connection.c   | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index f3cf4af01e102..1efb616480a64 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -841,7 +841,7 @@ void vmbus_initiate_unload(bool crash)
if (vmbus_proto_version < VERSION_WIN8_1)
return;
 
-   init_completion(_connection.unload_event);
+   reinit_completion(_connection.unload_event);
memset(, 0, sizeof(struct vmbus_channel_message_header));
hdr.msgtype = CHANNELMSG_UNLOAD;
vmbus_post_msg(, sizeof(struct vmbus_channel_message_header),
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 350e8c5cafa8c..529dcc47f3e11 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -26,6 +26,8 @@
 
 struct vmbus_connection vmbus_connection = {
.conn_state = DISCONNECTED,
+   .unload_event   = COMPLETION_INITIALIZER(
+ vmbus_connection.unload_event),
.next_gpadl_handle  = ATOMIC_INIT(0xE1E10),
 
.ready_for_suspend_event = COMPLETION_INITIALIZER(
-- 
2.25.1

[PATCH v3 3/3] Drivers: hv: vmbus: Check for pending channel interrupts before taking a CPU offline

2021-04-16 Thread Andrea Parri (Microsoft)

Check that enough time has passed such that the modify channel message
has been processed before taking a CPU offline.

Signed-off-by: Andrea Parri (Microsoft) 
---
 drivers/hv/hv.c | 56 ++---
 1 file changed, 53 insertions(+), 3 deletions(-)

diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index 3e6ff83adff42..e0c522d143a37 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -292,12 +293,50 @@ void hv_synic_disable_regs(unsigned int cpu)
disable_percpu_irq(vmbus_irq);
 }
 
+#define HV_MAX_TRIES 3
+/*
+ * Scan the event flags page of 'this' CPU looking for any bit that is set.  
If we find one
+ * bit set, then wait for a few milliseconds.  Repeat these steps for a 
maximum of 3 times.
+ * Return 'true', if there is still any set bit after this operation; 'false', 
otherwise.
+ *
+ * If a bit is set, that means there is a pending channel interrupt.  The 
expectation is
+ * that the normal interrupt handling mechanism will find and process the 
channel interrupt
+ * "very soon", and in the process clear the bit.
+ */
+static bool hv_synic_event_pending(void)
+{
+   struct hv_per_cpu_context *hv_cpu = 
this_cpu_ptr(hv_context.cpu_context);
+   union hv_synic_event_flags *event =
+   (union hv_synic_event_flags *)hv_cpu->synic_event_page + 
VMBUS_MESSAGE_SINT;
+   unsigned long *recv_int_page = event->flags; /* assumes VMBus version 
>= VERSION_WIN8 */
+   bool pending;
+   u32 relid;
+   int tries = 0;
+
+retry:
+   pending = false;
+   for_each_set_bit(relid, recv_int_page, HV_EVENT_FLAGS_COUNT) {
+   /* Special case - VMBus channel protocol messages */
+   if (relid == 0)
+   continue;
+   pending = true;
+   break;
+   }
+   if (pending && tries++ < HV_MAX_TRIES) {
+   usleep_range(1, 2);
+   goto retry;
+   }
+   return pending;
+}
 
 int hv_synic_cleanup(unsigned int cpu)
 {
struct vmbus_channel *channel, *sc;
bool channel_found = false;
 
+   if (vmbus_connection.conn_state != CONNECTED)
+   goto always_cleanup;
+
/*
 * Hyper-V does not provide a way to change the connect CPU once
 * it is set; we must prevent the connect CPU from going offline
@@ -305,8 +344,7 @@ int hv_synic_cleanup(unsigned int cpu)
 * path where the vmbus is already disconnected, the CPU must be
 * allowed to shut down.
 */
-   if (cpu == VMBUS_CONNECT_CPU &&
-   vmbus_connection.conn_state == CONNECTED)
+   if (cpu == VMBUS_CONNECT_CPU)
return -EBUSY;
 
/*
@@ -333,9 +371,21 @@ int hv_synic_cleanup(unsigned int cpu)
}
mutex_unlock(_connection.channel_mutex);
 
-   if (channel_found && vmbus_connection.conn_state == CONNECTED)
+   if (channel_found)
+   return -EBUSY;
+
+   /*
+* channel_found == false means that any channels that were previously
+* assigned to the CPU have been reassigned elsewhere with a call of
+* vmbus_send_modifychannel().  Scan the event flags page looking for
+* bits that are set and waiting with a timeout for vmbus_chan_sched()
+* to process such bits.  If bits are still set after this operation
+* and VMBus is connected, fail the CPU offlining operation.
+*/
+   if (vmbus_proto_version >= VERSION_WIN10_V4_1 && 
hv_synic_event_pending())
return -EBUSY;
 
+always_cleanup:
hv_stimer_legacy_cleanup(cpu);
 
hv_synic_disable_regs(cpu);
-- 
2.25.1

[PATCH v3 2/3] Drivers: hv: vmbus: Drivers: hv: vmbus: Introduce CHANNELMSG_MODIFYCHANNEL_RESPONSE

2021-04-16 Thread Andrea Parri (Microsoft)

Introduce the CHANNELMSG_MODIFYCHANNEL_RESPONSE message type, and code
to receive and process such a message.

Signed-off-by: Andrea Parri (Microsoft) 
Reviewed-by: Michael Kelley 
---
 drivers/hv/channel.c  | 99 ---
 drivers/hv/channel_mgmt.c | 42 +
 drivers/hv/hv_trace.h | 15 ++
 drivers/hv/vmbus_drv.c|  4 +-
 include/linux/hyperv.h| 11 -
 5 files changed, 152 insertions(+), 19 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index db30be8f9ccea..aa4ef75d8dee2 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -209,31 +209,96 @@ int vmbus_send_tl_connect_request(const guid_t 
*shv_guest_servie_id,
 }
 EXPORT_SYMBOL_GPL(vmbus_send_tl_connect_request);
 
+static int send_modifychannel_without_ack(struct vmbus_channel *channel, u32 
target_vp)
+{
+   struct vmbus_channel_modifychannel msg;
+   int ret;
+
+   memset(, 0, sizeof(msg));
+   msg.header.msgtype = CHANNELMSG_MODIFYCHANNEL;
+   msg.child_relid = channel->offermsg.child_relid;
+   msg.target_vp = target_vp;
+
+   ret = vmbus_post_msg(, sizeof(msg), true);
+   trace_vmbus_send_modifychannel(, ret);
+
+   return ret;
+}
+
+static int send_modifychannel_with_ack(struct vmbus_channel *channel, u32 
target_vp)
+{
+   struct vmbus_channel_modifychannel *msg;
+   struct vmbus_channel_msginfo *info;
+   unsigned long flags;
+   int ret;
+
+   info = kzalloc(sizeof(struct vmbus_channel_msginfo) +
+   sizeof(struct vmbus_channel_modifychannel),
+  GFP_KERNEL);
+   if (!info)
+   return -ENOMEM;
+
+   init_completion(>waitevent);
+   info->waiting_channel = channel;
+
+   msg = (struct vmbus_channel_modifychannel *)info->msg;
+   msg->header.msgtype = CHANNELMSG_MODIFYCHANNEL;
+   msg->child_relid = channel->offermsg.child_relid;
+   msg->target_vp = target_vp;
+
+   spin_lock_irqsave(_connection.channelmsg_lock, flags);
+   list_add_tail(>msglistentry, _connection.chn_msg_list);
+   spin_unlock_irqrestore(_connection.channelmsg_lock, flags);
+
+   ret = vmbus_post_msg(msg, sizeof(*msg), true);
+   trace_vmbus_send_modifychannel(msg, ret);
+   if (ret != 0) {
+   spin_lock_irqsave(_connection.channelmsg_lock, flags);
+   list_del(>msglistentry);
+   spin_unlock_irqrestore(_connection.channelmsg_lock, 
flags);
+   goto free_info;
+   }
+
+   /*
+* Release channel_mutex; otherwise, vmbus_onoffer_rescind() could 
block on
+* the mutex and be unable to signal the completion.
+*
+* See the caller target_cpu_store() for information about the usage of 
the
+* mutex.
+*/
+   mutex_unlock(_connection.channel_mutex);
+   wait_for_completion(>waitevent);
+   mutex_lock(_connection.channel_mutex);
+
+   spin_lock_irqsave(_connection.channelmsg_lock, flags);
+   list_del(>msglistentry);
+   spin_unlock_irqrestore(_connection.channelmsg_lock, flags);
+
+   if (info->response.modify_response.status)
+   ret = -EAGAIN;
+
+free_info:
+   kfree(info);
+   return ret;
+}
+
 /*
  * Set/change the vCPU (@target_vp) the channel (@child_relid) will interrupt.
  *
- * CHANNELMSG_MODIFYCHANNEL messages are aynchronous.  Also, Hyper-V does not
- * ACK such messages.  IOW we can't know when the host will stop interrupting
- * the "old" vCPU and start interrupting the "new" vCPU for the given channel.
+ * CHANNELMSG_MODIFYCHANNEL messages are aynchronous.  When VMbus version 5.3
+ * or later is negotiated, Hyper-V always sends an ACK in response to such a
+ * message.  For VMbus version 5.2 and earlier, it never sends an ACK.  With-
+ * out an ACK, we can not know when the host will stop interrupting the "old"
+ * vCPU and start interrupting the "new" vCPU for the given channel.
  *
  * The CHANNELMSG_MODIFYCHANNEL message type is supported since VMBus version
  * VERSION_WIN10_V4_1.
  */
-int vmbus_send_modifychannel(u32 child_relid, u32 target_vp)
+int vmbus_send_modifychannel(struct vmbus_channel *channel, u32 target_vp)
 {
-   struct vmbus_channel_modifychannel conn_msg;
-   int ret;
-
-   memset(_msg, 0, sizeof(conn_msg));
-   conn_msg.header.msgtype = CHANNELMSG_MODIFYCHANNEL;
-   conn_msg.child_relid = child_relid;
-   conn_msg.target_vp = target_vp;
-
-   ret = vmbus_post_msg(_msg, sizeof(conn_msg), true);
-
-   trace_vmbus_send_modifychannel(_msg, ret);
-
-   return ret;
+   if (vmbus_proto_version >= VERSION_WIN10_V5_3)
+   return send_modifychannel_with_ack(channel, target_vp);
+   return send_modifychannel_without_ack(channel, target_vp);
 }
 EXPORT_SYMBOL_GPL(vmbus_send_modifychannel);
 
diff --git a/drivers/hv/ch

[PATCH v3 1/3] Drivers: hv: vmbus: Introduce and negotiate VMBus protocol version 5.3

2021-04-16 Thread Andrea Parri (Microsoft)

Hyper-V has added VMBus protocol version 5.3.  Allow Linux guests to
negotiate the new version on version of Hyper-V that support it.

Signed-off-by: Andrea Parri (Microsoft) 
---
 drivers/hv/connection.c | 3 ++-
 include/linux/hyperv.h  | 2 ++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 350e8c5cafa8c..dc19d5ae4373c 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -45,6 +45,7 @@ EXPORT_SYMBOL_GPL(vmbus_proto_version);
  * Table of VMBus versions listed from newest to oldest.
  */
 static __u32 vmbus_versions[] = {
+   VERSION_WIN10_V5_3,
VERSION_WIN10_V5_2,
VERSION_WIN10_V5_1,
VERSION_WIN10_V5,
@@ -60,7 +61,7 @@ static __u32 vmbus_versions[] = {
  * Maximal VMBus protocol version guests can negotiate.  Useful to cap the
  * VMBus version for testing and debugging purpose.
  */
-static uint max_version = VERSION_WIN10_V5_2;
+static uint max_version = VERSION_WIN10_V5_3;
 
 module_param(max_version, uint, S_IRUGO);
 MODULE_PARM_DESC(max_version,
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 2c18c8e768efe..3ce36bbb398e9 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -234,6 +234,7 @@ static inline u32 hv_get_avail_to_write_percent(
  * 5 . 0  (Newer Windows 10)
  * 5 . 1  (Windows 10 RS4)
  * 5 . 2  (Windows Server 2019, RS5)
+ * 5 . 3  (Windows Server 2022)
  */
 
 #define VERSION_WS2008  ((0 << 16) | (13))
@@ -245,6 +246,7 @@ static inline u32 hv_get_avail_to_write_percent(
 #define VERSION_WIN10_V5 ((5 << 16) | (0))
 #define VERSION_WIN10_V5_1 ((5 << 16) | (1))
 #define VERSION_WIN10_V5_2 ((5 << 16) | (2))
+#define VERSION_WIN10_V5_3 ((5 << 16) | (3))
 
 /* Make maximum size of pipe payload of 16K */
 #define MAX_PIPE_DATA_PAYLOAD  (sizeof(u8) * 16384)
-- 
2.25.1

[PATCH v3 0/3] Drivers: hv: vmbus: Introduce CHANNELMSG_MODIFYCHANNEL_RESPONSE

2021-04-16 Thread Andrea Parri (Microsoft)

Changes since v2[1]:
  - fix VMbus protocol version name
  - add Reviewed-by: tag
  - refactor/simplyfy changes in hv_synic_cleanup()

Changes since v1[2]:
  - rebase on hyperv-next
  - split changes into three patches
  - fix error handling in send_modifychannel_with_ack()
  - remove rescind checks in send_modifychannel_with_ack()
  - remove unneeded test in hv_synic_event_pending()
  - add/amend inline comments
  - style changes

[1] https://lkml.kernel.org/r/20210414150118.2843-1-parri.and...@gmail.com
[2] https://lkml.kernel.org/r/20201126191210.13115-1-parri.and...@gmail.com

Andrea Parri (Microsoft) (3):
  Drivers: hv: vmbus: Introduce and negotiate VMBus protocol version 5.3
  Drivers: hv: vmbus: Drivers: hv: vmbus: Introduce
CHANNELMSG_MODIFYCHANNEL_RESPONSE
  Drivers: hv: vmbus: Check for pending channel interrupts before taking
a CPU offline

 drivers/hv/channel.c  | 99 ---
 drivers/hv/channel_mgmt.c | 42 +
 drivers/hv/connection.c   |  3 +-
 drivers/hv/hv.c   | 56 --
 drivers/hv/hv_trace.h | 15 ++
 drivers/hv/vmbus_drv.c|  4 +-
 include/linux/hyperv.h| 13 -
 7 files changed, 209 insertions(+), 23 deletions(-)

-- 
2.25.1

[RFC net-next v2] seg6: add counters support for SRv6 Behaviors

2021-04-15 Thread Andrea Mayer

dev 2956,00 pps

As can be observed, throughputs achieved in scenarios (2),(3) did not
suffer any observable degradation compared to scenario (1).

Comments, suggestions and improvements are very welcome!

Thanks,
Andrea

v2:
 - improve comments;

 - guarantee alignment of 64 bit values, thanks to Jakub Kicinski;

 - pass counters within netlink attributes rather than passing a whole
   single structure. This is to address compatibility issues that might
   arise from having different versions of kernel/iproute2.
   Thanks to David Ahern for bringing attention to this.

 - include cover letter in the commit message, thanks to David Ahern.

[2] https://www.cloudlab.us

Signed-off-by: Andrea Mayer 
---
 include/uapi/linux/seg6_local.h |  30 +
 net/ipv6/seg6_local.c   | 198 +++-
 2 files changed, 226 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/seg6_local.h b/include/uapi/linux/seg6_local.h
index 3b39ef1dbb46..5ae3ace84de0 100644
--- a/include/uapi/linux/seg6_local.h
+++ b/include/uapi/linux/seg6_local.h
@@ -27,6 +27,7 @@ enum {
SEG6_LOCAL_OIF,
SEG6_LOCAL_BPF,
SEG6_LOCAL_VRFTABLE,
+   SEG6_LOCAL_COUNTERS,
__SEG6_LOCAL_MAX,
 };
 #define SEG6_LOCAL_MAX (__SEG6_LOCAL_MAX - 1)
@@ -78,4 +79,33 @@ enum {
 
 #define SEG6_LOCAL_BPF_PROG_MAX (__SEG6_LOCAL_BPF_PROG_MAX - 1)
 
+/* SRv6 Behavior counters are encoded as netlink attributes guaranteeing the
+ * correct alignment.
+ * Each counter is identified by a different attribute type (i.e.
+ * SEG6_LOCAL_CNT_PACKETS).
+ *
+ * - SEG6_LOCAL_CNT_PACKETS: identifies a counter that counts the number of
+ *   packets that have been CORRECTLY processed by an SRv6 Behavior instance
+ *   (i.e., packets that generate errors or are dropped are NOT counted).
+ *
+ * - SEG6_LOCAL_CNT_BYTES: identifies a counter that counts the total amount
+ *   of traffic in bytes of all packets that have been CORRECTLY processed by
+ *   an SRv6 Behavior instance (i.e., packets that generate errors or are
+ *   dropped are NOT counted).
+ *
+ * - SEG6_LOCAL_CNT_ERRORS: identifies a counter that counts the number of
+ *   packets that have NOT been properly processed by an SRv6 Behavior instance
+ *   (i.e., packets that generate errors or are dropped).
+ */
+enum {
+   SEG6_LOCAL_CNT_UNSPEC,
+   SEG6_LOCAL_CNT_PAD, /* pad for 64 bits values */
+   SEG6_LOCAL_CNT_PACKETS,
+   SEG6_LOCAL_CNT_BYTES,
+   SEG6_LOCAL_CNT_ERRORS,
+   __SEG6_LOCAL_CNT_MAX,
+};
+
+#define SEG6_LOCAL_CNT_MAX (__SEG6_LOCAL_CNT_MAX - 1)
+
 #endif
diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index bd7140885e60..3e627cf9ce2a 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -93,6 +93,35 @@ struct seg6_end_dt_info {
int hdrlen;
 };
 
+struct pcpu_seg6_local_counters {
+   u64_stats_t packets;
+   u64_stats_t bytes;
+   u64_stats_t errors;
+
+   struct u64_stats_sync syncp;
+};
+
+/* This struct groups all the SRv6 Behavior counters supported so far.
+ *
+ * put_nla_counters() makes use of this data structure to collect all counter
+ * values after the per-CPU counter evaluation has been performed.
+ * Finally, each counter value (in seg6_local_counters) is stored in the
+ * corresponding netlink attribute and sent to user space.
+ *
+ * NB: we don't want to expose this structure to user space!
+ */
+struct seg6_local_counters {
+   __u64 packets;
+   __u64 bytes;
+   __u64 errors;
+};
+
+#define seg6_local_alloc_pcpu_counters(__gfp)  \
+   __netdev_alloc_pcpu_stats(struct pcpu_seg6_local_counters,  \
+ ((__gfp) | __GFP_ZERO))
+
+#define SEG6_F_LOCAL_COUNTERS  SEG6_F_ATTR(SEG6_LOCAL_COUNTERS)
+
 struct seg6_local_lwt {
int action;
struct ipv6_sr_hdr *srh;
@@ -105,6 +134,7 @@ struct seg6_local_lwt {
 #ifdef CONFIG_NET_L3_MASTER_DEV
struct seg6_end_dt_info dt_info;
 #endif
+   struct pcpu_seg6_local_counters __percpu *pcpu_counters;
 
int headroom;
struct seg6_action_desc *desc;
@@ -878,36 +908,43 @@ static struct seg6_action_desc seg6_action_table[] = {
{
.action = SEG6_LOCAL_ACTION_END,
.attrs  = 0,
+   .optattrs   = SEG6_F_LOCAL_COUNTERS,
.input  = input_action_end,
},
{
.action = SEG6_LOCAL_ACTION_END_X,
.attrs  = SEG6_F_ATTR(SEG6_LOCAL_NH6),
+   .optattrs   = SEG6_F_LOCAL_COUNTERS,
.input  = input_action_end_x,
},
{
.action = SEG6_LOCAL_ACTION_END_T,
.attrs  = SEG6_F_ATTR(SEG6_LOCAL_TABLE),
+   .optattrs   = SEG6_F_LOCAL_COUNTERS,
.input  = input_action_end_t,
},
{
.action = SE

Re: [PATCH v2 3/3] Drivers: hv: vmbus: Check for pending channel interrupts before taking a CPU offline

2021-04-15 Thread Andrea Parri

> > @@ -336,6 +372,19 @@ int hv_synic_cleanup(unsigned int cpu)
> > if (channel_found && vmbus_connection.conn_state == CONNECTED)
> > return -EBUSY;
> > 
> > +   if (vmbus_proto_version >= VERSION_WIN10_V4_1) {
> > +   /*
> > +* channel_found == false means that any channels that were 
> > previously
> > +* assigned to the CPU have been reassigned elsewhere with a 
> > call of
> > +* vmbus_send_modifychannel().  Scan the event flags page 
> > looking for
> > +* bits that are set and waiting with a timeout for 
> > vmbus_chan_sched()
> > +* to process such bits.  If bits are still set after this 
> > operation
> > +* and VMBus is connected, fail the CPU offlining operation.
> > +*/
> > +   if (hv_synic_event_pending() && vmbus_connection.conn_state == 
> > CONNECTED)
> > +   return -EBUSY;
> > +   }
> > +
> 
> Perhaps the test for conn_state == CONNECTED could be factored out as 
> follows.  If we're
> not CONNECTED (i.e., in the panic or kexec path) we should be able to also 
> skip the search
> for channels that are bound to the CPU since we will ignore the result anyway.
> 
>   if (vmbus_connection.conn_state != CONNECTED)
>   goto always_cleanup;
> 
>   if (cpu == VMBUS_CONNECT_CPU)
>   return -EBUSY;
> 
>   [Code to search for channels that are bound to the CPU we're about to 
> clean up]
>   
>   if (channel_found)
>   return -EBUSY;
> 
>   /*
>* channel_found == false means that any channels that were previously
>* assigned to the CPU have been reassigned elsewhere with a call of
>* vmbus_send_modifychannel().  Scan the event flags page looking for
>* bits that are set and waiting with a timeout for vmbus_chan_sched()
>* to process such bits.  If bits are still set after this operation
>* and VMBus is connected, fail the CPU offlining operation.
>*/
>   if (vmbus_proto_version >= VERSION_WIN10_V4_1 && 
> hv_synic_event_pending())
>   return -EBUSY;
> 
> always_cleanup:

Agreed, applied.  Thank you for the suggestion,

  Andrea

Re: [PATCH v2 1/3] Drivers: hv: vmbus: Introduce and negotiate VMBus protocol version 5.3

2021-04-15 Thread Andrea Parri

> > @@ -234,6 +234,7 @@ static inline u32 hv_get_avail_to_write_percent(
> >   * 5 . 0  (Newer Windows 10)
> >   * 5 . 1  (Windows 10 RS4)
> >   * 5 . 2  (Windows Server 2019, RS5)
> > + * 5 . 3  (Windows Server 2021) // FIXME: use proper version number/name
> 
> The official name is now public information as "Windows Server 2022".

Thank you, I've updated the name and removed the FIXME.

  Andrea

[PATCH hyperv-next] scsi: storvsc: Use blk_mq_unique_tag() to generate requestIDs

2021-04-15 Thread Andrea Parri (Microsoft)

Use blk_mq_unique_tag() to generate requestIDs for StorVSC, avoiding
all issues with allocating enough entries in the VMbus requestor.

Suggested-by: Michael Kelley 
Signed-off-by: Andrea Parri (Microsoft) 
---
Changes since RFC:
  - pass sentinel values for {init,reset}_request in vmbus_sendpacket()
  - remove/inline the storvsc_request_addr() callback
  - make storvsc_next_request_id() 'static'
  - add code to handle the case of an unsolicited message from Hyper-V
  - other minor/style changes

[1] https://lkml.kernel.org/r/20210408161315.341888-1-parri.and...@gmail.com

 drivers/hv/channel.c  | 14 ++---
 drivers/hv/ring_buffer.c  | 13 +++--
 drivers/net/hyperv/netvsc.c   |  8 ++-
 drivers/net/hyperv/rndis_filter.c |  2 +
 drivers/scsi/storvsc_drv.c| 94 +--
 include/linux/hyperv.h| 13 -
 6 files changed, 95 insertions(+), 49 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index db30be8f9ccea..f78e02ace51e8 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -1121,15 +1121,14 @@ EXPORT_SYMBOL_GPL(vmbus_recvpacket_raw);
  * vmbus_next_request_id - Returns a new request id. It is also
  * the index at which the guest memory address is stored.
  * Uses a spin lock to avoid race conditions.
- * @rqstor: Pointer to the requestor struct
+ * @channel: Pointer to the VMbus channel struct
  * @rqst_add: Guest memory address to be stored in the array
  */
-u64 vmbus_next_request_id(struct vmbus_requestor *rqstor, u64 rqst_addr)
+u64 vmbus_next_request_id(struct vmbus_channel *channel, u64 rqst_addr)
 {
+   struct vmbus_requestor *rqstor = >requestor;
unsigned long flags;
u64 current_id;
-   const struct vmbus_channel *channel =
-   container_of(rqstor, const struct vmbus_channel, requestor);
 
/* Check rqstor has been initialized */
if (!channel->rqstor_size)
@@ -1163,16 +1162,15 @@ EXPORT_SYMBOL_GPL(vmbus_next_request_id);
 /*
  * vmbus_request_addr - Returns the memory address stored at @trans_id
  * in @rqstor. Uses a spin lock to avoid race conditions.
- * @rqstor: Pointer to the requestor struct
+ * @channel: Pointer to the VMbus channel struct
  * @trans_id: Request id sent back from Hyper-V. Becomes the requestor's
  * next request id.
  */
-u64 vmbus_request_addr(struct vmbus_requestor *rqstor, u64 trans_id)
+u64 vmbus_request_addr(struct vmbus_channel *channel, u64 trans_id)
 {
+   struct vmbus_requestor *rqstor = >requestor;
unsigned long flags;
u64 req_addr;
-   const struct vmbus_channel *channel =
-   container_of(rqstor, const struct vmbus_channel, requestor);
 
/* Check rqstor has been initialized */
if (!channel->rqstor_size)
diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index ecd82ebfd5bc4..2bf57677272b5 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -310,10 +310,12 @@ int hv_ringbuffer_write(struct vmbus_channel *channel,
 */
 
if (desc->flags == VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED) {
-   rqst_id = vmbus_next_request_id(>requestor, requestid);
-   if (rqst_id == VMBUS_RQST_ERROR) {
-   spin_unlock_irqrestore(_info->ring_lock, flags);
-   return -EAGAIN;
+   if (channel->next_request_id_callback != NULL) {
+   rqst_id = channel->next_request_id_callback(channel, 
requestid);
+   if (rqst_id == VMBUS_RQST_ERROR) {
+   
spin_unlock_irqrestore(_info->ring_lock, flags);
+   return -EAGAIN;
+   }
}
}
desc = hv_get_ring_buffer(outring_info) + old_write;
@@ -341,7 +343,8 @@ int hv_ringbuffer_write(struct vmbus_channel *channel,
if (channel->rescind) {
if (rqst_id != VMBUS_NO_RQSTOR) {
/* Reclaim request ID to avoid leak of IDs */
-   vmbus_request_addr(>requestor, rqst_id);
+   if (channel->request_addr_callback != NULL)
+   channel->request_addr_callback(channel, 
rqst_id);
}
return -ENODEV;
}
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index c64cc7639c39c..1a221ce2d6fdc 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -730,7 +730,7 @@ static void netvsc_send_tx_complete(struct net_device *ndev,
int queue_sends;
u64 cmd_rqst;
 
-   cmd_rqst = vmbus_request_addr(>requestor, (u64)desc->trans_id);
+   cmd_rqst = channel->request_addr_callback(channel, (u64)desc->trans_id);
if (cmd_rqst == VMBUS_RQST_ERROR) {
netdev_err(ndev, "Incorrect transaction id\n");
return;
@@ -790,

[PATCH v2 3/3] Drivers: hv: vmbus: Check for pending channel interrupts before taking a CPU offline

2021-04-14 Thread Andrea Parri (Microsoft)

Check that enough time has passed such that the modify channel message
has been processed before taking a CPU offline.

Signed-off-by: Andrea Parri (Microsoft) 
---
 drivers/hv/hv.c | 49 +
 1 file changed, 49 insertions(+)

diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index 3e6ff83adff42..dc9aa1130b22f 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -292,6 +293,41 @@ void hv_synic_disable_regs(unsigned int cpu)
disable_percpu_irq(vmbus_irq);
 }
 
+#define HV_MAX_TRIES 3
+/*
+ * Scan the event flags page of 'this' CPU looking for any bit that is set.  
If we find one
+ * bit set, then wait for a few milliseconds.  Repeat these steps for a 
maximum of 3 times.
+ * Return 'true', if there is still any set bit after this operation; 'false', 
otherwise.
+ *
+ * If a bit is set, that means there is a pending channel interrupt.  The 
expectation is
+ * that the normal interrupt handling mechanism will find and process the 
channel interrupt
+ * "very soon", and in the process clear the bit.
+ */
+static bool hv_synic_event_pending(void)
+{
+   struct hv_per_cpu_context *hv_cpu = 
this_cpu_ptr(hv_context.cpu_context);
+   union hv_synic_event_flags *event =
+   (union hv_synic_event_flags *)hv_cpu->synic_event_page + 
VMBUS_MESSAGE_SINT;
+   unsigned long *recv_int_page = event->flags; /* assumes VMBus version 
>= VERSION_WIN8 */
+   bool pending;
+   u32 relid;
+   int tries = 0;
+
+retry:
+   pending = false;
+   for_each_set_bit(relid, recv_int_page, HV_EVENT_FLAGS_COUNT) {
+   /* Special case - VMBus channel protocol messages */
+   if (relid == 0)
+   continue;
+   pending = true;
+   break;
+   }
+   if (pending && tries++ < HV_MAX_TRIES) {
+   usleep_range(1, 2);
+   goto retry;
+   }
+   return pending;
+}
 
 int hv_synic_cleanup(unsigned int cpu)
 {
@@ -336,6 +372,19 @@ int hv_synic_cleanup(unsigned int cpu)
if (channel_found && vmbus_connection.conn_state == CONNECTED)
return -EBUSY;
 
+   if (vmbus_proto_version >= VERSION_WIN10_V4_1) {
+   /*
+* channel_found == false means that any channels that were 
previously
+* assigned to the CPU have been reassigned elsewhere with a 
call of
+* vmbus_send_modifychannel().  Scan the event flags page 
looking for
+* bits that are set and waiting with a timeout for 
vmbus_chan_sched()
+* to process such bits.  If bits are still set after this 
operation
+* and VMBus is connected, fail the CPU offlining operation.
+*/
+   if (hv_synic_event_pending() && vmbus_connection.conn_state == 
CONNECTED)
+   return -EBUSY;
+   }
+
hv_stimer_legacy_cleanup(cpu);
 
hv_synic_disable_regs(cpu);
-- 
2.25.1

[PATCH v2 2/3] Drivers: hv: vmbus: Drivers: hv: vmbus: Introduce CHANNELMSG_MODIFYCHANNEL_RESPONSE

2021-04-14 Thread Andrea Parri (Microsoft)

Introduce the CHANNELMSG_MODIFYCHANNEL_RESPONSE message type, and code
to receive and process such a message.

Signed-off-by: Andrea Parri (Microsoft) 
---
 drivers/hv/channel.c  | 99 ---
 drivers/hv/channel_mgmt.c | 42 +
 drivers/hv/hv_trace.h | 15 ++
 drivers/hv/vmbus_drv.c|  4 +-
 include/linux/hyperv.h| 11 -
 5 files changed, 152 insertions(+), 19 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index db30be8f9ccea..aa4ef75d8dee2 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -209,31 +209,96 @@ int vmbus_send_tl_connect_request(const guid_t 
*shv_guest_servie_id,
 }
 EXPORT_SYMBOL_GPL(vmbus_send_tl_connect_request);
 
+static int send_modifychannel_without_ack(struct vmbus_channel *channel, u32 
target_vp)
+{
+   struct vmbus_channel_modifychannel msg;
+   int ret;
+
+   memset(, 0, sizeof(msg));
+   msg.header.msgtype = CHANNELMSG_MODIFYCHANNEL;
+   msg.child_relid = channel->offermsg.child_relid;
+   msg.target_vp = target_vp;
+
+   ret = vmbus_post_msg(, sizeof(msg), true);
+   trace_vmbus_send_modifychannel(, ret);
+
+   return ret;
+}
+
+static int send_modifychannel_with_ack(struct vmbus_channel *channel, u32 
target_vp)
+{
+   struct vmbus_channel_modifychannel *msg;
+   struct vmbus_channel_msginfo *info;
+   unsigned long flags;
+   int ret;
+
+   info = kzalloc(sizeof(struct vmbus_channel_msginfo) +
+   sizeof(struct vmbus_channel_modifychannel),
+  GFP_KERNEL);
+   if (!info)
+   return -ENOMEM;
+
+   init_completion(>waitevent);
+   info->waiting_channel = channel;
+
+   msg = (struct vmbus_channel_modifychannel *)info->msg;
+   msg->header.msgtype = CHANNELMSG_MODIFYCHANNEL;
+   msg->child_relid = channel->offermsg.child_relid;
+   msg->target_vp = target_vp;
+
+   spin_lock_irqsave(_connection.channelmsg_lock, flags);
+   list_add_tail(>msglistentry, _connection.chn_msg_list);
+   spin_unlock_irqrestore(_connection.channelmsg_lock, flags);
+
+   ret = vmbus_post_msg(msg, sizeof(*msg), true);
+   trace_vmbus_send_modifychannel(msg, ret);
+   if (ret != 0) {
+   spin_lock_irqsave(_connection.channelmsg_lock, flags);
+   list_del(>msglistentry);
+   spin_unlock_irqrestore(_connection.channelmsg_lock, 
flags);
+   goto free_info;
+   }
+
+   /*
+* Release channel_mutex; otherwise, vmbus_onoffer_rescind() could 
block on
+* the mutex and be unable to signal the completion.
+*
+* See the caller target_cpu_store() for information about the usage of 
the
+* mutex.
+*/
+   mutex_unlock(_connection.channel_mutex);
+   wait_for_completion(>waitevent);
+   mutex_lock(_connection.channel_mutex);
+
+   spin_lock_irqsave(_connection.channelmsg_lock, flags);
+   list_del(>msglistentry);
+   spin_unlock_irqrestore(_connection.channelmsg_lock, flags);
+
+   if (info->response.modify_response.status)
+   ret = -EAGAIN;
+
+free_info:
+   kfree(info);
+   return ret;
+}
+
 /*
  * Set/change the vCPU (@target_vp) the channel (@child_relid) will interrupt.
  *
- * CHANNELMSG_MODIFYCHANNEL messages are aynchronous.  Also, Hyper-V does not
- * ACK such messages.  IOW we can't know when the host will stop interrupting
- * the "old" vCPU and start interrupting the "new" vCPU for the given channel.
+ * CHANNELMSG_MODIFYCHANNEL messages are aynchronous.  When VMbus version 5.3
+ * or later is negotiated, Hyper-V always sends an ACK in response to such a
+ * message.  For VMbus version 5.2 and earlier, it never sends an ACK.  With-
+ * out an ACK, we can not know when the host will stop interrupting the "old"
+ * vCPU and start interrupting the "new" vCPU for the given channel.
  *
  * The CHANNELMSG_MODIFYCHANNEL message type is supported since VMBus version
  * VERSION_WIN10_V4_1.
  */
-int vmbus_send_modifychannel(u32 child_relid, u32 target_vp)
+int vmbus_send_modifychannel(struct vmbus_channel *channel, u32 target_vp)
 {
-   struct vmbus_channel_modifychannel conn_msg;
-   int ret;
-
-   memset(_msg, 0, sizeof(conn_msg));
-   conn_msg.header.msgtype = CHANNELMSG_MODIFYCHANNEL;
-   conn_msg.child_relid = child_relid;
-   conn_msg.target_vp = target_vp;
-
-   ret = vmbus_post_msg(_msg, sizeof(conn_msg), true);
-
-   trace_vmbus_send_modifychannel(_msg, ret);
-
-   return ret;
+   if (vmbus_proto_version >= VERSION_WIN10_V5_3)
+   return send_modifychannel_with_ack(channel, target_vp);
+   return send_modifychannel_without_ack(channel, target_vp);
 }
 EXPORT_SYMBOL_GPL(vmbus_send_modifychannel);
 
diff --git a/drivers/hv/channel_mgmt.c b/dr

[PATCH v2 0/3] Drivers: hv: vmbus: Introduce CHANNELMSG_MODIFYCHANNEL_RESPONSE

2021-04-14 Thread Andrea Parri (Microsoft)

Changes since v1[1]:
  - rebase on hyperv-next
  - split changes into three patches
  - fix error handling in send_modifychannel_with_ack()
  - remove rescind checks in send_modifychannel_with_ack()
  - remove unneeded test in hv_synic_event_pending()
  - add/amend inline comments
  - style changes

[1] https://lkml.kernel.org/r/20201126191210.13115-1-parri.and...@gmail.com

Andrea Parri (Microsoft) (3):
  Drivers: hv: vmbus: Introduce and negotiate VMBus protocol version 5.3
  Drivers: hv: vmbus: Drivers: hv: vmbus: Introduce
CHANNELMSG_MODIFYCHANNEL_RESPONSE
  Drivers: hv: vmbus: Check for pending channel interrupts before taking
a CPU offline

 drivers/hv/channel.c  | 99 ---
 drivers/hv/channel_mgmt.c | 42 +
 drivers/hv/connection.c   |  3 +-
 drivers/hv/hv.c   | 49 +++
 drivers/hv/hv_trace.h | 15 ++
 drivers/hv/vmbus_drv.c|  4 +-
 include/linux/hyperv.h| 13 -
 7 files changed, 205 insertions(+), 20 deletions(-)

-- 
2.25.1

[PATCH v2 1/3] Drivers: hv: vmbus: Introduce and negotiate VMBus protocol version 5.3

2021-04-14 Thread Andrea Parri (Microsoft)

Hyper-V has added VMBus protocol version 5.3.  Allow Linux guests to
negotiate the new version on version of Hyper-V that support it.

Signed-off-by: Andrea Parri (Microsoft) 
---
 drivers/hv/connection.c | 3 ++-
 include/linux/hyperv.h  | 2 ++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 350e8c5cafa8c..dc19d5ae4373c 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -45,6 +45,7 @@ EXPORT_SYMBOL_GPL(vmbus_proto_version);
  * Table of VMBus versions listed from newest to oldest.
  */
 static __u32 vmbus_versions[] = {
+   VERSION_WIN10_V5_3,
VERSION_WIN10_V5_2,
VERSION_WIN10_V5_1,
VERSION_WIN10_V5,
@@ -60,7 +61,7 @@ static __u32 vmbus_versions[] = {
  * Maximal VMBus protocol version guests can negotiate.  Useful to cap the
  * VMBus version for testing and debugging purpose.
  */
-static uint max_version = VERSION_WIN10_V5_2;
+static uint max_version = VERSION_WIN10_V5_3;
 
 module_param(max_version, uint, S_IRUGO);
 MODULE_PARM_DESC(max_version,
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 2c18c8e768efe..d6a6f76040b5f 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -234,6 +234,7 @@ static inline u32 hv_get_avail_to_write_percent(
  * 5 . 0  (Newer Windows 10)
  * 5 . 1  (Windows 10 RS4)
  * 5 . 2  (Windows Server 2019, RS5)
+ * 5 . 3  (Windows Server 2021) // FIXME: use proper version number/name
  */
 
 #define VERSION_WS2008  ((0 << 16) | (13))
@@ -245,6 +246,7 @@ static inline u32 hv_get_avail_to_write_percent(
 #define VERSION_WIN10_V5 ((5 << 16) | (0))
 #define VERSION_WIN10_V5_1 ((5 << 16) | (1))
 #define VERSION_WIN10_V5_2 ((5 << 16) | (2))
+#define VERSION_WIN10_V5_3 ((5 << 16) | (3))
 
 /* Make maximum size of pipe payload of 16K */
 #define MAX_PIPE_DATA_PAYLOAD  (sizeof(u8) * 16384)
-- 
2.25.1

Re: [PATCH] Drivers: hv: vmbus: Use after free in __vmbus_open()

2021-04-13 Thread Andrea Parri

On Tue, Apr 13, 2021 at 01:50:04PM +0300, Dan Carpenter wrote:
> The "open_info" variable is added to the _connection.chn_msg_list,
> but the error handling frees "open_info" without removing it from the
> list.  This will result in a use after free.  First remove it from the
> list, and then free it.
> 
> Fixes: 6f3d791f3006 ("Drivers: hv: vmbus: Fix rescind handling issues")
> Signed-off-by: Dan Carpenter 

I had this 'queued' in my list,

Reviewed-by: Andrea Parri 

  Andrea


> ---
> From static analysis.  Untested etc.  There is almost certainly a good
> reason to add it to the list before checking "newchannel->rescind" but I
> don't know the code well enough to know what the reason is.
> 
>  drivers/hv/channel.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
> index db30be8f9cce..1c5a418c1962 100644
> --- a/drivers/hv/channel.c
> +++ b/drivers/hv/channel.c
> @@ -653,7 +653,7 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
>  
>   if (newchannel->rescind) {
>   err = -ENODEV;
> - goto error_free_info;
> + goto error_clean_msglist;
>   }
>  
>   err = vmbus_post_msg(open_msg,
> -- 
> 2.30.2
>

Re: [RFC PATCH hyperv-next] scsi: storvsc: Use blk_mq_unique_tag() to generate requestIDs

2021-04-12 Thread Andrea Parri

On Fri, Apr 09, 2021 at 03:38:14PM +, Michael Kelley wrote:
> From: Andrea Parri (Microsoft)  Sent: Thursday, April 
> 8, 2021 9:13 AM
> > 
> > Use blk_mq_unique_tag() to generate requestIDs for StorVSC, avoiding
> > all issues with allocating enough entries in the VMbus requestor.
> 
> This looks good to me!  I'm glad to see that the idea worked without
> too much complexity.
> 
> See a few comments inline below.

Thank you for these suggestions; I've tried to implement them, cf. the
diff at the bottom of this email (on top of this RFC, plus 'change the
storvsc callbacks to 'static'').  I like the result, however this does
not work well yet: I am getting 'Incorrect transaction id' messages at
boot time with this diff; I'll dig more tomorrow... hints are welcome!

  Andrea

> 
> > 
> > Suggested-by: Michael Kelley 
> > Signed-off-by: Andrea Parri (Microsoft) 
> > ---
> >  drivers/hv/channel.c  | 14 +++---
> >  drivers/hv/ring_buffer.c  | 12 ++---
> >  drivers/net/hyperv/netvsc.c   |  8 ++--
> >  drivers/net/hyperv/rndis_filter.c |  2 +
> >  drivers/scsi/storvsc_drv.c| 73 ++-
> >  include/linux/hyperv.h| 13 +-
> >  6 files changed, 92 insertions(+), 30 deletions(-)
> > 
> > diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
> > index db30be8f9ccea..f78e02ace51e8 100644
> > --- a/drivers/hv/channel.c
> > +++ b/drivers/hv/channel.c
> > @@ -1121,15 +1121,14 @@ EXPORT_SYMBOL_GPL(vmbus_recvpacket_raw);
> >   * vmbus_next_request_id - Returns a new request id. It is also
> >   * the index at which the guest memory address is stored.
> >   * Uses a spin lock to avoid race conditions.
> > - * @rqstor: Pointer to the requestor struct
> > + * @channel: Pointer to the VMbus channel struct
> >   * @rqst_add: Guest memory address to be stored in the array
> >   */
> > -u64 vmbus_next_request_id(struct vmbus_requestor *rqstor, u64 rqst_addr)
> > +u64 vmbus_next_request_id(struct vmbus_channel *channel, u64 rqst_addr)
> >  {
> > +   struct vmbus_requestor *rqstor = >requestor;
> > unsigned long flags;
> > u64 current_id;
> > -   const struct vmbus_channel *channel =
> > -   container_of(rqstor, const struct vmbus_channel, requestor);
> > 
> > /* Check rqstor has been initialized */
> > if (!channel->rqstor_size)
> > @@ -1163,16 +1162,15 @@ EXPORT_SYMBOL_GPL(vmbus_next_request_id);
> >  /*
> >   * vmbus_request_addr - Returns the memory address stored at @trans_id
> >   * in @rqstor. Uses a spin lock to avoid race conditions.
> > - * @rqstor: Pointer to the requestor struct
> > + * @channel: Pointer to the VMbus channel struct
> >   * @trans_id: Request id sent back from Hyper-V. Becomes the requestor's
> >   * next request id.
> >   */
> > -u64 vmbus_request_addr(struct vmbus_requestor *rqstor, u64 trans_id)
> > +u64 vmbus_request_addr(struct vmbus_channel *channel, u64 trans_id)
> >  {
> > +   struct vmbus_requestor *rqstor = >requestor;
> > unsigned long flags;
> > u64 req_addr;
> > -   const struct vmbus_channel *channel =
> > -   container_of(rqstor, const struct vmbus_channel, requestor);
> > 
> > /* Check rqstor has been initialized */
> > if (!channel->rqstor_size)
> > diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
> > index ecd82ebfd5bc4..46d8e038e4ee1 100644
> > --- a/drivers/hv/ring_buffer.c
> > +++ b/drivers/hv/ring_buffer.c
> > @@ -310,10 +310,12 @@ int hv_ringbuffer_write(struct vmbus_channel *channel,
> >  */
> > 
> > if (desc->flags == VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED) {
> > -   rqst_id = vmbus_next_request_id(>requestor, requestid);
> > -   if (rqst_id == VMBUS_RQST_ERROR) {
> > -   spin_unlock_irqrestore(_info->ring_lock, flags);
> > -   return -EAGAIN;
> > +   if (channel->next_request_id_callback != NULL) {
> > +   rqst_id = channel->next_request_id_callback(channel, 
> > requestid);
> > +   if (rqst_id == VMBUS_RQST_ERROR) {
> > +   
> > spin_unlock_irqrestore(_info->ring_lock, flags);
> > +   return -EAGAIN;
> > +   }
> > }
> > }
> > desc = hv_get_ring_buffer(outring_info) + old_write;
> > @@ -341,7 +343,7 @@ int hv_ringbuffer_write(struct vmbus_channel *channel,
> > if (channel->rescind) {
> >

[PATCH net] net: seg6: trivial fix of a spelling mistake in comment

2021-04-10 Thread Andrea Mayer

There is a comment spelling mistake "interfarence" -> "interference" in
function parse_nla_action(). Fix it.

Signed-off-by: Andrea Mayer 
---
 net/ipv6/seg6_local.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index 8936f48570fc..bd7140885e60 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -1475,7 +1475,7 @@ static int parse_nla_action(struct nlattr **attrs, struct 
seg6_local_lwt *slwt)
/* Forcing the desc->optattrs *set* and the desc->attrs *set* to be
 * disjoined, this allow us to release acquired resources by optional
 * attributes and by required attributes independently from each other
-* without any interfarence.
+* without any interference.
 * In other terms, we are sure that we do not release some the acquired
 * resources twice.
 *
-- 
2.20.1

Re: [RFC net-next 1/1] seg6: add counters support for SRv6 Behaviors

2021-04-09 Thread Andrea Mayer

On Wed, 7 Apr 2021 16:55:41 -0600
David Ahern  wrote:

> On 4/7/21 12:03 PM, Andrea Mayer wrote:
> > diff --git a/include/uapi/linux/seg6_local.h 
> > b/include/uapi/linux/seg6_local.h
> > index 3b39ef1dbb46..ae5e3fd12b73 100644
> > --- a/include/uapi/linux/seg6_local.h
> > +++ b/include/uapi/linux/seg6_local.h
> > @@ -27,6 +27,7 @@ enum {
> > SEG6_LOCAL_OIF,
> > SEG6_LOCAL_BPF,
> > SEG6_LOCAL_VRFTABLE,
> > +   SEG6_LOCAL_COUNTERS,
> > __SEG6_LOCAL_MAX,
> >  };
> >  #define SEG6_LOCAL_MAX (__SEG6_LOCAL_MAX - 1)
> > @@ -78,4 +79,11 @@ enum {
> >  
> >  #define SEG6_LOCAL_BPF_PROG_MAX (__SEG6_LOCAL_BPF_PROG_MAX - 1)
> >  
> > +/* SRv6 Behavior counters */
> > +struct seg6_local_counters {
> > +   __u64 rx_packets;
> > +   __u64 rx_bytes;
> > +   __u64 rx_errors;
> > +};
> > +
> >  #endif
> 
> It's highly likely that more stats would get added over time. It would
> be good to document that here for interested parties and then make sure
> iproute2 can handle different sized stats structs. e.g., commit support
> to your repo, then add a new one (e.g, rx_drops) and verify the
> combinations handle it. e.g., old kernel - new iproute2, new kernel -
> old iproute, old - old and new-new.
> 

Hi David,
thanks for your review.

I totally agree with you: we may want to add other counters in the future, even
if they are not considered in RFC8986.

With that in mind, the shared struct seg6_local_counters is not the best way to
go if we want to add other counters (because it will be difficult to manage
different sized structures when considering different kernel/iproute2 versions).

To make it easier adding new counters, instead of sharing the struct
seg6_local_counters, I would use netlink nested attributes to exchange counters
individually. In this way, only recognized (nested) attributes can be processed
by both the kernel and iproute2.

For example:

enum {
   SEG6_LOCAL_CNT_UNSPEC,
   SEG6_LOCAL_CNT_PAD, /* padding for 64 bits values */
   SEG6_LOCAL_CNT_RX_PACKETS,
   SEG6_LOCAL_CNT_RX_BYTES,
   SEG6_LOCAL_CNT_RX_ERRORS,
   __SEG6_LOCAL_CNT_MAX,
};
#define SEG6_LOCAL_CNT_MAX (__SEG6_LOCAL_CNT_MAX - 1)

updating the policy for SEG6_LOCAL_COUNTERS to NLA_NESTED.

Then, I create a new policy for counters which handles each supported
counter separately.

static const struct
nla_policy seg6_local_counters_policy[SEG6_LOCAL_CNT_MAX + 1] = {
   [SEG6_LOCAL_CNT_RX_PACKETS] = { .type = NLA_U64 },
   [SEG6_LOCAL_CNT_RX_BYTES]   = { .type = NLA_U64 },
   [SEG6_LOCAL_CNT_RX_ERRORS]  = { .type = NLA_U64 },
};

At the end, I update the parse_nla_counters(), put_nla_counters(), etc
according to the changes, i.e:
 - nla_parse_nested() in parse_nla_counters();
 - nla_nest_{start/end}() and for each supported counter nla_put_u64_64bit()
   in put_nla_counters().

On the iproute2 side, we have to update the code to reflect the changes
discussed above. 

I plan to issue an RFC v2 in a few days.

Andrea

Re: [PATCH hyperv-next] Drivers: hv: vmbus: Copy packets sent by Hyper-V out of the ring buffer

2021-04-09 Thread Andrea Parri

On Fri, Apr 09, 2021 at 03:49:00PM +, Michael Kelley wrote:
> From: Andrea Parri (Microsoft)  Sent: Thursday, April 
> 8, 2021 9:15 AM
> > 
> > Pointers to ring-buffer packets sent by Hyper-V are used within the
> > guest VM. Hyper-V can send packets with erroneous values or modify
> > packet fields after they are processed by the guest. To defend
> > against these scenarios, return a copy of the incoming VMBus packet
> > after validating its length and offset fields in hv_pkt_iter_first().
> > In this way, the packet can no longer be modified by the host.
> 
> Andrea -- has anything changed in this version of this patch, except
> the value of NETVSC_MAX_XFER_PAGE_RANGES?  It used to be a
> fixed 375, but now is NVSP_RSC_MAX, which is 562.
> 
> If that's the only change, then
> 
> Reviewed-by: Michael Kelley 

The only change here is indeed the value of NETVSC_MAX_XFER_PAGE_RANGES,
apologies for the omission of the changelog.

Thanks for the review.

  Andrea

Re: [RFC net-next 1/1] seg6: add counters support for SRv6 Behaviors

2021-04-09 Thread Andrea Mayer

On Wed, 7 Apr 2021 13:24:04 -0700
Jakub Kicinski  wrote:

> On Wed,  7 Apr 2021 20:03:32 +0200 Andrea Mayer wrote:
> > This patch provides counters for SRv6 Behaviors as defined in [1], section
> > 6. For each SRv6 Behavior instance, the counters defined in [1] are:
> > 
> >  - the total number of packets that have been correctly processed;
> >  - the total amount of traffic in bytes of all packets that have been
> >correctly processed;
> > 
> > In addition, we introduces a new counter that counts the number of packets
> > that have NOT been properly processed (i.e. errors) by an SRv6 Behavior
> > instance.
> > 
> > Each SRv6 Behavior instance can be configured, at the time of its creation,
> > to make use of counters.
> > This is done through iproute2 which allows the user to create an SRv6
> > Behavior instance specifying the optional "count" attribute as shown in the
> > following example:
> > 
> >  $ ip -6 route add 2001:db8::1 encap seg6local action End count dev eth0
> > 
> > per-behavior counters can be shown by adding "-s" to the iproute2 command
> > line, i.e.:
> > 
> >  $ ip -s -6 route show 2001:db8::1
> >  2001:db8::1 encap seg6local action End packets 0 bytes 0 errors 0 dev eth0
> > 
> > [1] https://www.rfc-editor.org/rfc/rfc8986.html#name-counters
> > 
> > Signed-off-by: Andrea Mayer 
> 
> > +static int put_nla_counters(struct sk_buff *skb, struct seg6_local_lwt 
> > *slwt)
> > +{
> > +   struct seg6_local_counters counters = { 0, 0, 0 };
> > +   struct nlattr *nla;
> > +   int i;
> > +
> > +   nla = nla_reserve(skb, SEG6_LOCAL_COUNTERS, sizeof(counters));
> > +   if (!nla)
> > +   return -EMSGSIZE;
> 
> nla_reserve_64bit(), IIUC netlink guarantees alignment of 64 bit values.

Hi Jakub, thanks for your review!

Yes, we should guarantee alignment of 64 bit values.
I will definitely follow your advice.

Andrea

[PATCH hyperv-next] Drivers: hv: vmbus: Copy packets sent by Hyper-V out of the ring buffer

2021-04-08 Thread Andrea Parri (Microsoft)

From: Andres Beltran 

Pointers to ring-buffer packets sent by Hyper-V are used within the
guest VM. Hyper-V can send packets with erroneous values or modify
packet fields after they are processed by the guest. To defend
against these scenarios, return a copy of the incoming VMBus packet
after validating its length and offset fields in hv_pkt_iter_first().
In this way, the packet can no longer be modified by the host.

Signed-off-by: Andres Beltran 
Co-developed-by: Andrea Parri (Microsoft) 
Signed-off-by: Andrea Parri (Microsoft) 
---
 drivers/hv/channel.c  |  9 ++--
 drivers/hv/hv_fcopy.c |  1 +
 drivers/hv/hv_kvp.c   |  1 +
 drivers/hv/hyperv_vmbus.h |  2 +-
 drivers/hv/ring_buffer.c  | 82 ++-
 drivers/net/hyperv/hyperv_net.h   |  7 +++
 drivers/net/hyperv/netvsc.c   |  2 +
 drivers/net/hyperv/rndis_filter.c |  2 +
 drivers/scsi/storvsc_drv.c| 10 
 include/linux/hyperv.h| 48 +++---
 net/vmw_vsock/hyperv_transport.c  |  4 +-
 11 files changed, 143 insertions(+), 25 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index db30be8f9ccea..b665db21e120d 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -597,12 +597,15 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
newchannel->onchannel_callback = onchannelcallback;
newchannel->channel_callback_context = context;
 
-   err = hv_ringbuffer_init(>outbound, page, send_pages);
+   if (!newchannel->max_pkt_size)
+   newchannel->max_pkt_size = VMBUS_DEFAULT_MAX_PKT_SIZE;
+
+   err = hv_ringbuffer_init(>outbound, page, send_pages, 0);
if (err)
goto error_clean_ring;
 
-   err = hv_ringbuffer_init(>inbound,
-[send_pages], recv_pages);
+   err = hv_ringbuffer_init(>inbound, [send_pages],
+recv_pages, newchannel->max_pkt_size);
if (err)
goto error_clean_ring;
 
diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
index 59ce85e00a028..660036da74495 100644
--- a/drivers/hv/hv_fcopy.c
+++ b/drivers/hv/hv_fcopy.c
@@ -349,6 +349,7 @@ int hv_fcopy_init(struct hv_util_service *srv)
 {
recv_buffer = srv->recv_buffer;
fcopy_transaction.recv_channel = srv->channel;
+   fcopy_transaction.recv_channel->max_pkt_size = HV_HYP_PAGE_SIZE * 2;
 
/*
 * When this driver loads, the user level daemon that
diff --git a/drivers/hv/hv_kvp.c b/drivers/hv/hv_kvp.c
index b49962d312cef..c698592b83e42 100644
--- a/drivers/hv/hv_kvp.c
+++ b/drivers/hv/hv_kvp.c
@@ -757,6 +757,7 @@ hv_kvp_init(struct hv_util_service *srv)
 {
recv_buffer = srv->recv_buffer;
kvp_transaction.recv_channel = srv->channel;
+   kvp_transaction.recv_channel->max_pkt_size = HV_HYP_PAGE_SIZE * 4;
 
/*
 * When this driver loads, the user level daemon that
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 9416e09ebd58c..42f3d9d123a12 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -174,7 +174,7 @@ extern int hv_synic_cleanup(unsigned int cpu);
 void hv_ringbuffer_pre_init(struct vmbus_channel *channel);
 
 int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
-  struct page *pages, u32 pagecnt);
+  struct page *pages, u32 pagecnt, u32 max_pkt_size);
 
 void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info);
 
diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index ecd82ebfd5bc4..848f3bba83f8b 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -190,7 +190,7 @@ void hv_ringbuffer_pre_init(struct vmbus_channel *channel)
 
 /* Initialize the ring buffer. */
 int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
-  struct page *pages, u32 page_cnt)
+  struct page *pages, u32 page_cnt, u32 max_pkt_size)
 {
int i;
struct page **pages_wraparound;
@@ -232,6 +232,14 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info 
*ring_info,
sizeof(struct hv_ring_buffer);
ring_info->priv_read_index = 0;
 
+   /* Initialize buffer that holds copies of incoming packets */
+   if (max_pkt_size) {
+   ring_info->pkt_buffer = kzalloc(max_pkt_size, GFP_KERNEL);
+   if (!ring_info->pkt_buffer)
+   return -ENOMEM;
+   ring_info->pkt_buffer_size = max_pkt_size;
+   }
+
spin_lock_init(_info->ring_lock);
 
return 0;
@@ -244,6 +252,9 @@ void hv_ringbuffer_cleanup(struct hv_ring_buffer_info 
*ring_info)
vunmap(ring_info->ring_buffer);
ring_info->ring_buffer = NULL;
mutex_unlock(_info->ring_buffer_mutex);
+
+   kfree(ring_info->pkt_buffer);
+   ring_in

[RFC PATCH hyperv-next] scsi: storvsc: Use blk_mq_unique_tag() to generate requestIDs

2021-04-08 Thread Andrea Parri (Microsoft)

Use blk_mq_unique_tag() to generate requestIDs for StorVSC, avoiding
all issues with allocating enough entries in the VMbus requestor.

Suggested-by: Michael Kelley 
Signed-off-by: Andrea Parri (Microsoft) 
---
 drivers/hv/channel.c  | 14 +++---
 drivers/hv/ring_buffer.c  | 12 ++---
 drivers/net/hyperv/netvsc.c   |  8 ++--
 drivers/net/hyperv/rndis_filter.c |  2 +
 drivers/scsi/storvsc_drv.c| 73 ++-
 include/linux/hyperv.h| 13 +-
 6 files changed, 92 insertions(+), 30 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index db30be8f9ccea..f78e02ace51e8 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -1121,15 +1121,14 @@ EXPORT_SYMBOL_GPL(vmbus_recvpacket_raw);
  * vmbus_next_request_id - Returns a new request id. It is also
  * the index at which the guest memory address is stored.
  * Uses a spin lock to avoid race conditions.
- * @rqstor: Pointer to the requestor struct
+ * @channel: Pointer to the VMbus channel struct
  * @rqst_add: Guest memory address to be stored in the array
  */
-u64 vmbus_next_request_id(struct vmbus_requestor *rqstor, u64 rqst_addr)
+u64 vmbus_next_request_id(struct vmbus_channel *channel, u64 rqst_addr)
 {
+   struct vmbus_requestor *rqstor = >requestor;
unsigned long flags;
u64 current_id;
-   const struct vmbus_channel *channel =
-   container_of(rqstor, const struct vmbus_channel, requestor);
 
/* Check rqstor has been initialized */
if (!channel->rqstor_size)
@@ -1163,16 +1162,15 @@ EXPORT_SYMBOL_GPL(vmbus_next_request_id);
 /*
  * vmbus_request_addr - Returns the memory address stored at @trans_id
  * in @rqstor. Uses a spin lock to avoid race conditions.
- * @rqstor: Pointer to the requestor struct
+ * @channel: Pointer to the VMbus channel struct
  * @trans_id: Request id sent back from Hyper-V. Becomes the requestor's
  * next request id.
  */
-u64 vmbus_request_addr(struct vmbus_requestor *rqstor, u64 trans_id)
+u64 vmbus_request_addr(struct vmbus_channel *channel, u64 trans_id)
 {
+   struct vmbus_requestor *rqstor = >requestor;
unsigned long flags;
u64 req_addr;
-   const struct vmbus_channel *channel =
-   container_of(rqstor, const struct vmbus_channel, requestor);
 
/* Check rqstor has been initialized */
if (!channel->rqstor_size)
diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index ecd82ebfd5bc4..46d8e038e4ee1 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -310,10 +310,12 @@ int hv_ringbuffer_write(struct vmbus_channel *channel,
 */
 
if (desc->flags == VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED) {
-   rqst_id = vmbus_next_request_id(>requestor, requestid);
-   if (rqst_id == VMBUS_RQST_ERROR) {
-   spin_unlock_irqrestore(_info->ring_lock, flags);
-   return -EAGAIN;
+   if (channel->next_request_id_callback != NULL) {
+   rqst_id = channel->next_request_id_callback(channel, 
requestid);
+   if (rqst_id == VMBUS_RQST_ERROR) {
+   
spin_unlock_irqrestore(_info->ring_lock, flags);
+   return -EAGAIN;
+   }
}
}
desc = hv_get_ring_buffer(outring_info) + old_write;
@@ -341,7 +343,7 @@ int hv_ringbuffer_write(struct vmbus_channel *channel,
if (channel->rescind) {
if (rqst_id != VMBUS_NO_RQSTOR) {
/* Reclaim request ID to avoid leak of IDs */
-   vmbus_request_addr(>requestor, rqst_id);
+   channel->request_addr_callback(channel, rqst_id);
}
return -ENODEV;
}
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index c64cc7639c39c..1a221ce2d6fdc 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -730,7 +730,7 @@ static void netvsc_send_tx_complete(struct net_device *ndev,
int queue_sends;
u64 cmd_rqst;
 
-   cmd_rqst = vmbus_request_addr(>requestor, (u64)desc->trans_id);
+   cmd_rqst = channel->request_addr_callback(channel, (u64)desc->trans_id);
if (cmd_rqst == VMBUS_RQST_ERROR) {
netdev_err(ndev, "Incorrect transaction id\n");
return;
@@ -790,8 +790,8 @@ static void netvsc_send_completion(struct net_device *ndev,
 
/* First check if this is a VMBUS completion without data payload */
if (!msglen) {
-   cmd_rqst = vmbus_request_addr(_channel->requestor,
- (u64)desc->trans_id);
+   cmd_rqst = 
incoming_channel->request_addr_callback(incoming_channel,
+

[RFC net-next 1/1] seg6: add counters support for SRv6 Behaviors

2021-04-07 Thread Andrea Mayer

This patch provides counters for SRv6 Behaviors as defined in [1], section
6. For each SRv6 Behavior instance, the counters defined in [1] are:

 - the total number of packets that have been correctly processed;
 - the total amount of traffic in bytes of all packets that have been
   correctly processed;

In addition, we introduces a new counter that counts the number of packets
that have NOT been properly processed (i.e. errors) by an SRv6 Behavior
instance.

Each SRv6 Behavior instance can be configured, at the time of its creation,
to make use of counters.
This is done through iproute2 which allows the user to create an SRv6
Behavior instance specifying the optional "count" attribute as shown in the
following example:

 $ ip -6 route add 2001:db8::1 encap seg6local action End count dev eth0

per-behavior counters can be shown by adding "-s" to the iproute2 command
line, i.e.:

 $ ip -s -6 route show 2001:db8::1
 2001:db8::1 encap seg6local action End packets 0 bytes 0 errors 0 dev eth0

[1] https://www.rfc-editor.org/rfc/rfc8986.html#name-counters

Signed-off-by: Andrea Mayer 
---
 include/uapi/linux/seg6_local.h |   8 ++
 net/ipv6/seg6_local.c   | 133 +++-
 2 files changed, 139 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/seg6_local.h b/include/uapi/linux/seg6_local.h
index 3b39ef1dbb46..ae5e3fd12b73 100644
--- a/include/uapi/linux/seg6_local.h
+++ b/include/uapi/linux/seg6_local.h
@@ -27,6 +27,7 @@ enum {
SEG6_LOCAL_OIF,
SEG6_LOCAL_BPF,
SEG6_LOCAL_VRFTABLE,
+   SEG6_LOCAL_COUNTERS,
__SEG6_LOCAL_MAX,
 };
 #define SEG6_LOCAL_MAX (__SEG6_LOCAL_MAX - 1)
@@ -78,4 +79,11 @@ enum {
 
 #define SEG6_LOCAL_BPF_PROG_MAX (__SEG6_LOCAL_BPF_PROG_MAX - 1)
 
+/* SRv6 Behavior counters */
+struct seg6_local_counters {
+   __u64 rx_packets;
+   __u64 rx_bytes;
+   __u64 rx_errors;
+};
+
 #endif
diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index 8936f48570fc..0f905a4410bd 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -93,6 +93,20 @@ struct seg6_end_dt_info {
int hdrlen;
 };
 
+struct pcpu_seg6_local_counters {
+   u64_stats_t rx_packets;
+   u64_stats_t rx_bytes;
+   u64_stats_t rx_errors;
+
+   struct u64_stats_sync syncp;
+};
+
+#define seg6_local_alloc_pcpu_counters(__gfp)  \
+   __netdev_alloc_pcpu_stats(struct pcpu_seg6_local_counters,  \
+ ((__gfp) | __GFP_ZERO))
+
+#define SEG6_F_LOCAL_COUNTERS  SEG6_F_ATTR(SEG6_LOCAL_COUNTERS)
+
 struct seg6_local_lwt {
int action;
struct ipv6_sr_hdr *srh;
@@ -105,6 +119,7 @@ struct seg6_local_lwt {
 #ifdef CONFIG_NET_L3_MASTER_DEV
struct seg6_end_dt_info dt_info;
 #endif
+   struct pcpu_seg6_local_counters __percpu *pcpu_counters;
 
int headroom;
struct seg6_action_desc *desc;
@@ -878,36 +893,43 @@ static struct seg6_action_desc seg6_action_table[] = {
{
.action = SEG6_LOCAL_ACTION_END,
.attrs  = 0,
+   .optattrs   = SEG6_F_LOCAL_COUNTERS,
.input  = input_action_end,
},
{
.action = SEG6_LOCAL_ACTION_END_X,
.attrs  = SEG6_F_ATTR(SEG6_LOCAL_NH6),
+   .optattrs   = SEG6_F_LOCAL_COUNTERS,
.input  = input_action_end_x,
},
{
.action = SEG6_LOCAL_ACTION_END_T,
.attrs  = SEG6_F_ATTR(SEG6_LOCAL_TABLE),
+   .optattrs   = SEG6_F_LOCAL_COUNTERS,
.input  = input_action_end_t,
},
{
.action = SEG6_LOCAL_ACTION_END_DX2,
.attrs  = SEG6_F_ATTR(SEG6_LOCAL_OIF),
+   .optattrs   = SEG6_F_LOCAL_COUNTERS,
.input  = input_action_end_dx2,
},
{
.action = SEG6_LOCAL_ACTION_END_DX6,
.attrs  = SEG6_F_ATTR(SEG6_LOCAL_NH6),
+   .optattrs   = SEG6_F_LOCAL_COUNTERS,
.input  = input_action_end_dx6,
},
{
.action = SEG6_LOCAL_ACTION_END_DX4,
.attrs  = SEG6_F_ATTR(SEG6_LOCAL_NH4),
+   .optattrs   = SEG6_F_LOCAL_COUNTERS,
.input  = input_action_end_dx4,
},
{
.action = SEG6_LOCAL_ACTION_END_DT4,
.attrs  = SEG6_F_ATTR(SEG6_LOCAL_VRFTABLE),
+   .optattrs   = SEG6_F_LOCAL_COUNTERS,
 #ifdef CONFIG_NET_L3_MASTER_DEV
.input  = input_action_end_dt4,
.slwt_ops   = {
@@ -919,30 +941,35 @@ static struct seg6_action_desc seg6_action_table[] = {
.action = SEG6_LOCAL_ACTION_END_DT6,
 #ifdef CONF

[RFC net-next 0/1] seg6: Counters for SRv6 Behaviors

2021-04-07 Thread Andrea Mayer

pps

As can be observed, throughputs achieved in scenarios (2),(3) did not
suffer any observable degradation compared to scenario (1).

Comments, suggestions and improvements are very welcome!

Thanks,
Andrea

[2] https://www.cloudlab.us

Andrea Mayer (1):
  seg6: add counters support for SRv6 Behaviors

 include/uapi/linux/seg6_local.h |   8 ++
 net/ipv6/seg6_local.c   | 133 +++-
 2 files changed, 139 insertions(+), 2 deletions(-)

-- 
2.20.1

[PATCH v2 1/2] clocksource: arm_global_timer: implement rate compensation whenever source clock changes

2021-04-06 Thread Andrea Merello

This patch adds rate change notification support for the parent clock;
should that clock change, then we try to adjust the our prescaler in order
to compensate (i.e. we adjust to still get the same timer frequency).

This is loosely based on what it's done in timer-cadence-ttc. timer-sun51,
mips-gic-timer and smp_twd.c also seem to look at their parent clock rate
and to perform some kind of adjustment whenever needed.

In this particular case we have only one single counter and prescaler for
all clocksource, clockevent and timer_delay, and we just update it for all
(i.e. we don't let it go and call clockevents_update_freq() to notify to
the kernel that our rate has changed).

Note that, there is apparently no other way to fixup things, because once
we call register_current_timer_delay(), specifying the timer rate, it seems
that that rate is not supposed to change ever.

In order for this mechanism to work, we have to make assumptions about how
much the initial clock is supposed to eventually decrease from the initial
one, and set our initial prescaler to a value that we can eventually
decrease enough to compensate. We provide an option in KConfig for this.

In case we end up in a situation in which we are not able to compensate the
parent clock change, we fail returning NOTIFY_BAD.

This fixes a real-world problem with Zynq arch not being able to use this
driver and CPU_FREQ at the same time (because ARM global timer is fed by
the CPU clock, which may keep changing when CPU_FREQ is enabled).

Signed-off-by: Andrea Merello 
Cc: Patrice Chotard 
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: Michal Simek 
Cc: Sören Brinkmann 
---
 drivers/clocksource/Kconfig|  13 +++
 drivers/clocksource/arm_global_timer.c | 122 +++--
 2 files changed, 125 insertions(+), 10 deletions(-)

diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
index 39aa21d01e05..19fc5f8883e0 100644
--- a/drivers/clocksource/Kconfig
+++ b/drivers/clocksource/Kconfig
@@ -358,6 +358,19 @@ config ARM_GLOBAL_TIMER
help
  This option enables support for the ARM global timer unit.
 
+config ARM_GT_INITIAL_PRESCALER_VAL
+   int "ARM global timer initial prescaler value"
+   default 1
+   depends on ARM_GLOBAL_TIMER
+   help
+ When the ARM global timer initializes, its current rate is declared
+ to the kernel and maintained forever. Should it's parent clock
+ change, the driver tries to fix the timer's internal prescaler.
+ On some machs (i.e. Zynq) the initial prescaler value thus poses
+ bounds about how much the parent clock is allowed to decrease or
+ increase wrt the initial clock value.
+ This affects CPU_FREQ max delta from the initial frequency.
+
 config ARM_TIMER_SP804
bool "Support for Dual Timer SP804 module" if COMPILE_TEST
depends on GENERIC_SCHED_CLOCK && CLKDEV_LOOKUP
diff --git a/drivers/clocksource/arm_global_timer.c 
b/drivers/clocksource/arm_global_timer.c
index 88b2d38a7a61..60a8047fd32e 100644
--- a/drivers/clocksource/arm_global_timer.c
+++ b/drivers/clocksource/arm_global_timer.c
@@ -31,6 +31,10 @@
 #define GT_CONTROL_COMP_ENABLE BIT(1)  /* banked */
 #define GT_CONTROL_IRQ_ENABLE  BIT(2)  /* banked */
 #define GT_CONTROL_AUTO_INCBIT(3)  /* banked */
+#define GT_CONTROL_PRESCALER_SHIFT  8
+#define GT_CONTROL_PRESCALER_MAX0xF
+#define GT_CONTROL_PRESCALER_MASK   (GT_CONTROL_PRESCALER_MAX << \
+GT_CONTROL_PRESCALER_SHIFT)
 
 #define GT_INT_STATUS  0x0c
 #define GT_INT_STATUS_EVENT_FLAG   BIT(0)
@@ -39,6 +43,7 @@
 #define GT_COMP1   0x14
 #define GT_AUTO_INC0x18
 
+#define MAX_F_ERR 50
 /*
  * We are expecting to be clocked by the ARM peripheral clock.
  *
@@ -46,7 +51,8 @@
  * the units for all operations.
  */
 static void __iomem *gt_base;
-static unsigned long gt_clk_rate;
+struct notifier_block gt_clk_rate_change_nb;
+static u32 gt_psv_new, gt_psv_bck, gt_target_rate;
 static int gt_ppi;
 static struct clock_event_device __percpu *gt_evt;
 
@@ -96,7 +102,10 @@ static void gt_compare_set(unsigned long delta, int 
periodic)
unsigned long ctrl;
 
counter += delta;
-   ctrl = GT_CONTROL_TIMER_ENABLE;
+   ctrl = readl(gt_base + GT_CONTROL);
+   ctrl &= ~(GT_CONTROL_COMP_ENABLE | GT_CONTROL_IRQ_ENABLE |
+ GT_CONTROL_AUTO_INC | GT_CONTROL_AUTO_INC);
+   ctrl |= GT_CONTROL_TIMER_ENABLE;
writel_relaxed(ctrl, gt_base + GT_CONTROL);
writel_relaxed(lower_32_bits(counter), gt_base + GT_COMP0);
writel_relaxed(upper_32_bits(counter), gt_base + GT_COMP1);
@@ -123,7 +132,7 @@ static int gt_clockevent_shutdown(struct clock_event_device 
*evt)
 
 static int gt_clockevent_set_periodic(struct clock_event_device *evt)
 {
-   gt_compare_set

[PATCH v2 2/2] arm: zynq: don't disable CONFIG_ARM_GLOBAL_TIMER due to CONFIG_CPU_FREQ anymore

2021-04-06 Thread Andrea Merello

Now ARM global timer driver could work even if it's source clock rate
changes, so we don't need to disable that driver when cpu frequency scaling
is in use.

This cause Zynq arch to get support for timer delay and get_cycles().

Signed-off-by: Andrea Merello 
Cc: Patrice Chotard 
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: Michal Simek 
Cc: Sören Brinkmann 
---
 arch/arm/mach-zynq/Kconfig  | 2 +-
 drivers/clocksource/Kconfig | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/arm/mach-zynq/Kconfig b/arch/arm/mach-zynq/Kconfig
index 43fb941dcd07..a56748d671c4 100644
--- a/arch/arm/mach-zynq/Kconfig
+++ b/arch/arm/mach-zynq/Kconfig
@@ -6,7 +6,7 @@ config ARCH_ZYNQ
select ARCH_SUPPORTS_BIG_ENDIAN
select ARM_AMBA
select ARM_GIC
-   select ARM_GLOBAL_TIMER if !CPU_FREQ
+   select ARM_GLOBAL_TIMER
select CADENCE_TTC_TIMER
select HAVE_ARM_SCU if SMP
select HAVE_ARM_TWD if SMP
diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
index 19fc5f8883e0..9fa28237715a 100644
--- a/drivers/clocksource/Kconfig
+++ b/drivers/clocksource/Kconfig
@@ -360,6 +360,7 @@ config ARM_GLOBAL_TIMER
 
 config ARM_GT_INITIAL_PRESCALER_VAL
int "ARM global timer initial prescaler value"
+   default 2 if ARCH_ZYNQ
default 1
depends on ARM_GLOBAL_TIMER
help
-- 
2.17.1

[PATCH v2 0/2] Fix missing entropy on Zynq arch due to get_cycles() not supported

2021-04-06 Thread Andrea Merello

Changes wrt v1:
 - rebased on latest kernel tree
 - fix cover letter: does -> doesn't

A real-world problem has been seen with a Zynq-based board running
Debian 10, where ssh daemon takes a really long time to come up at boot.
This is due to lack of random numbers.

Since commit 50ee7529ec450 ("random: try to actively add entropy rather
than passively wait for it") we try to generate entropy whenever we are
in short of random numbers and someone needs them.

This trick works only when CPU cycle counter is available. On ARM this
means that get_cycles() works and in turn read_current_timer() works.

Zynq HW includes two "cadence TTC" timers and the "ARM global timer".
All these pieces of HW are fed by the CPU clock, which dynamically
changes whenever CPU frequency scaling is enabled.

In timer-cadence-ttc driver this scenario is handled by looking at parent
clock changes and adjusting things when required. This is the only usable
clocksource when CPU frequency scaling is in use. arm_global_timer driver
is disabled in Kconfig when CPU_FREQ is enabled for Zynq arch.

Unfortunately timer-cadence-ttc driver doesn't register itself via
register_current_timer_delay() and that ultimately ends up in get_cycles()
to always return zero, causing the aforementioned lack of entropy problem.
I believe that the reason for this is because Cadence TTC counter on Zynq
silicon is only 16-bit wide.

This patchset works around this by implementing in ARM global timer driver
a mechanism to compensate for parent clock variations, similarly to what
it's done in Cadence TTC timer driver, so that it can be used together
with CPU frequency scaling on Zynq arch.

This proved to finally fix the problem on my Zynq-based Z-turn board.

Signed-off-by: Andrea Merello 
Cc: Patrice Chotard 
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: Michal Simek 
Cc: Sören Brinkmann 

Andrea Merello (2):
  clocksource: arm_global_timer: implement rate compensation whenever
source clock changes
  arm: zynq: don't disable CONFIG_ARM_GLOBAL_TIMER due to
CONFIG_CPU_FREQ anymore

 arch/arm/mach-zynq/Kconfig |   2 +-
 drivers/clocksource/Kconfig|  14 +++
 drivers/clocksource/arm_global_timer.c | 122 +++--
 3 files changed, 127 insertions(+), 11 deletions(-)

--
2.17.1

Re: [PATCH 3/3] scsi: storvsc: Validate length of incoming packet in storvsc_on_channel_callback()

2021-03-30 Thread Andrea Parri

Hi Olaf,

On Mon, Mar 29, 2021 at 06:37:21PM +0200, Olaf Hering wrote:
> On Thu, Dec 17, Andrea Parri (Microsoft) wrote:
> 
> > Check that the packet is of the expected size at least, don't copy data
> > past the packet.
> 
> > +   if (hv_pkt_datalen(desc) < sizeof(struct vstor_packet) -
> > +   stor_device->vmscsi_size_delta) {
> > +   dev_err(>device, "Invalid packet len\n");
> > +   continue;
> > +   }
> > +
> 
> Sorry for being late:
> 
> It might be just cosmetic, but should this check be done prior the call to 
> vmbus_request_addr()?

TBH, I'm not immediately seeing why it 'should'; it could make sense to move
the check on the packet data length.


> Unrelated: my copy of vmbus_request_addr() can return 0, which is apparently 
> not handled by this loop in storvsc_on_channel_callback().

Indeed, IDs of 0 are reserved for so called unsolicited messages; I think we
should check that storvsc_on_io_completion() is not called on such messages.

Thanks,
  Andrea

Re: [PATCH] leds: trigger: fix potential deadlock with libata

2021-03-06 Thread Andrea Righi

On Sun, Mar 07, 2021 at 10:02:32AM +0800, Boqun Feng wrote:
> On Sat, Mar 06, 2021 at 09:39:54PM +0100, Marc Kleine-Budde wrote:
> > Hello *,
> > 
> > On 02.11.2020 11:41:52, Andrea Righi wrote:
> > > We have the following potential deadlock condition:
> > > 
> > >  
> > >  WARNING: possible irq lock inversion dependency detected
> > >  5.10.0-rc2+ #25 Not tainted
> > >  
> > >  swapper/3/0 just changed the state of lock:
> > >  8880063bd618 (>lock){-...}-{2:2}, at: 
> > > ata_bmdma_interrupt+0x27/0x200
> > >  but this lock took another, HARDIRQ-READ-unsafe lock in the past:
> > >   (>leddev_list_lock){.+.?}-{2:2}
> > > 
> > >  and interrupts could create inverse lock ordering between them.
> > 
> > [...]
> > 
> > > ---
> > >  drivers/leds/led-triggers.c | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/leds/led-triggers.c b/drivers/leds/led-triggers.c
> > > index 91da90cfb11d..16d1a93a10a8 100644
> > > --- a/drivers/leds/led-triggers.c
> > > +++ b/drivers/leds/led-triggers.c
> > > @@ -378,14 +378,15 @@ void led_trigger_event(struct led_trigger *trig,
> > >   enum led_brightness brightness)
> > >  {
> > >   struct led_classdev *led_cdev;
> > > + unsigned long flags;
> > >  
> > >   if (!trig)
> > >   return;
> > >  
> > > - read_lock(>leddev_list_lock);
> > > + read_lock_irqsave(>leddev_list_lock, flags);
> > >   list_for_each_entry(led_cdev, >led_cdevs, trig_list)
> > >   led_set_brightness(led_cdev, brightness);
> > > - read_unlock(>leddev_list_lock);
> > > + read_unlock_irqrestore(>leddev_list_lock, flags);
> > >  }
> > >  EXPORT_SYMBOL_GPL(led_trigger_event);
> > 
> > meanwhile this patch hit v5.10.x stable and caused a performance
> > degradation on our use case:
> > 
> > It's an embedded ARM system, 4x Cortex A53, with an SPI attached CAN
> > controller. CAN stands for Controller Area Network and here used to
> > connect to some automotive equipment. Over CAN an ISOTP (a CAN-specific
> > Transport Protocol) transfer is running. With this patch, we see CAN
> > frames delayed for ~6ms, the usual gap between CAN frames is 240µs.
> > 
> > Reverting this patch, restores the old performance.
> > 
> > What is the best way to solve this dilemma? Identify the critical path
> > in our use case? Is there a way we can get around the irqsave in
> > led_trigger_event()?
> > 
> 
> Probably, we can change from rwlock to rcu here, POC code as follow,
> only compile tested. Marc, could you see whether this help the
> performance on your platform? Please note that I haven't test it in a
> running kernel and I'm not that familir with led subsystem, so use it
> with caution ;-)

If we don't want to touch the led subsystem at all maybe we could try to
fix the problem in libata, we just need to prevent calling
led_trigger_blink_oneshot() with >lock held from
ata_qc_complete(), maybe doing the led blinking from another context (a
workqueue for example)?

-Andrea

[PATCH] Drivers: hv: vmbus: Drop error message when 'No request id available'

2021-03-01 Thread Andrea Parri (Microsoft)

Running out of request IDs on a channel essentially produces the same
effect as running out of space in the ring buffer, in that -EAGAIN is
returned.  The error message in hv_ringbuffer_write() should either be
dropped (since we don't output a message when the ring buffer is full)
or be made conditional/debug-only.

Suggested-by: Michael Kelley 
Signed-off-by: Andrea Parri (Microsoft) 
Fixes: e8b7db38449ac ("Drivers: hv: vmbus: Add vmbus_requestor data structure 
for VMBus hardening")
---
 drivers/hv/ring_buffer.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index 35833d4d1a1dc..ecd82ebfd5bc4 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -313,7 +313,6 @@ int hv_ringbuffer_write(struct vmbus_channel *channel,
rqst_id = vmbus_next_request_id(>requestor, requestid);
if (rqst_id == VMBUS_RQST_ERROR) {
spin_unlock_irqrestore(_info->ring_lock, flags);
-   pr_err("No request id available\n");
return -EAGAIN;
}
}
-- 
2.25.1

[PATCH net] hv_netvsc: Fix validation in netvsc_linkstatus_callback()

2021-03-01 Thread Andrea Parri (Microsoft)

Contrary to the RNDIS protocol specification, certain (pre-Fe)
implementations of Hyper-V's vSwitch did not account for the status
buffer field in the length of an RNDIS packet; the bug was fixed in
newer implementations.  Validate the status buffer fields using the
length of the 'vmtransfer_page' packet (all implementations), that
is known/validated to be less than or equal to the receive section
size and not smaller than the length of the RNDIS message.

Reported-by: Dexuan Cui 
Suggested-by: Haiyang Zhang 
Signed-off-by: Andrea Parri (Microsoft) 
Fixes: 505e3f00c3f36 ("hv_netvsc: Add (more) validation for untrusted Hyper-V 
values")
---
 drivers/net/hyperv/hyperv_net.h   |  2 +-
 drivers/net/hyperv/netvsc_drv.c   | 13 +
 drivers/net/hyperv/rndis_filter.c |  2 +-
 3 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index e1a497d3c9ba4..59ac04a610adb 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -229,7 +229,7 @@ int netvsc_send(struct net_device *net,
bool xdp_tx);
 void netvsc_linkstatus_callback(struct net_device *net,
struct rndis_message *resp,
-   void *data);
+   void *data, u32 data_buflen);
 int netvsc_recv_callback(struct net_device *net,
 struct netvsc_device *nvdev,
 struct netvsc_channel *nvchan);
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 8176fa0c8b168..15f262b70489e 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -744,7 +744,7 @@ static netdev_tx_t netvsc_start_xmit(struct sk_buff *skb,
  */
 void netvsc_linkstatus_callback(struct net_device *net,
struct rndis_message *resp,
-   void *data)
+   void *data, u32 data_buflen)
 {
struct rndis_indicate_status *indicate = >msg.indicate_status;
struct net_device_context *ndev_ctx = netdev_priv(net);
@@ -765,11 +765,16 @@ void netvsc_linkstatus_callback(struct net_device *net,
if (indicate->status == RNDIS_STATUS_LINK_SPEED_CHANGE) {
u32 speed;
 
-   /* Validate status_buf_offset */
+   /* Validate status_buf_offset and status_buflen.
+*
+* Certain (pre-Fe) implementations of Hyper-V's vSwitch didn't 
account
+* for the status buffer field in resp->msg_len; perform the 
validation
+* using data_buflen (>= resp->msg_len).
+*/
if (indicate->status_buflen < sizeof(speed) ||
indicate->status_buf_offset < sizeof(*indicate) ||
-   resp->msg_len - RNDIS_HEADER_SIZE < 
indicate->status_buf_offset ||
-   resp->msg_len - RNDIS_HEADER_SIZE - 
indicate->status_buf_offset
+   data_buflen - RNDIS_HEADER_SIZE < 
indicate->status_buf_offset ||
+   data_buflen - RNDIS_HEADER_SIZE - 
indicate->status_buf_offset
< indicate->status_buflen) {
netdev_err(net, "invalid rndis_indicate_status 
packet\n");
return;
diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index 123cc9d25f5ed..c0e89e107d575 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -620,7 +620,7 @@ int rndis_filter_receive(struct net_device *ndev,
 
case RNDIS_MSG_INDICATE:
/* notification msgs */
-   netvsc_linkstatus_callback(ndev, rndis_msg, data);
+   netvsc_linkstatus_callback(ndev, rndis_msg, data, buflen);
break;
default:
netdev_err(ndev,
-- 
2.25.1

Re: [PATCH AUTOSEL 5.11 50/67] Drivers: hv: vmbus: Initialize memory to be sent to the host

2021-02-24 Thread Andrea Parri

On Wed, Feb 24, 2021 at 02:16:00PM +0100, Andrea Parri wrote:
> On Wed, Feb 24, 2021 at 07:50:08AM -0500, Sasha Levin wrote:
> > From: "Andrea Parri (Microsoft)" 
> > 
> > [ Upstream commit e99c4afbee07e9323e9191a20b24d74dbf815bdf ]
> > 
> > __vmbus_open() and vmbus_teardown_gpadl() do not inizialite the memory
> > for the vmbus_channel_open_channel and the vmbus_channel_gpadl_teardown
> > objects they allocate respectively.  These objects contain padding bytes
> > and fields that are left uninitialized and that are later sent to the
> > host, potentially leaking guest data.  Zero initialize such fields to
> > avoid leaking sensitive information to the host.
> > 
> > Reported-by: Juan Vazquez 
> > Signed-off-by: Andrea Parri (Microsoft) 
> > Reviewed-by: Michael Kelley 
> > Link: 
> > https://lore.kernel.org/r/20201209070827.29335-2-parri.and...@gmail.com
> > Signed-off-by: Wei Liu 
> > Signed-off-by: Sasha Levin 
> 
> Sasha - This patch is one of a group of patches where a Linux guest running on
> Hyper-V will start assuming that hypervisor behavior might be malicious, and
> guards against such behavior.  Because this is a new assumption, these patches
> are more properly treated as new functionality rather than as bug fixes.  So I
> would propose that we *not* bring such patches back to stable branches.

For future/similar cases: I'm wondering, is there some way to annotate a patch
with "please do not bring it back"?

Thanks,
  Andrea

Re: [PATCH AUTOSEL 4.14 15/16] Drivers: hv: vmbus: Resolve race condition in vmbus_onoffer_rescind()

2021-02-24 Thread Andrea Parri

On Wed, Feb 24, 2021 at 07:55:12AM -0500, Sasha Levin wrote:
> From: "Andrea Parri (Microsoft)" 
> 
> [ Upstream commit e4d221b42354b2e2ddb9187a806afb651eee2cda ]
> 
> An erroneous or malicious host could send multiple rescind messages for
> a same channel.  In vmbus_onoffer_rescind(), the guest maps the channel
> ID to obtain a pointer to the channel object and it eventually releases
> such object and associated data.  The host could time rescind messages
> and lead to an use-after-free.  Add a new flag to the channel structure
> to make sure that only one instance of vmbus_onoffer_rescind() can get
> the reference to the channel object.
> 
> Reported-by: Juan Vazquez 
> Signed-off-by: Andrea Parri (Microsoft) 
> Reviewed-by: Michael Kelley 
> Link: https://lore.kernel.org/r/20201209070827.29335-6-parri.and...@gmail.com
> Signed-off-by: Wei Liu 
> Signed-off-by: Sasha Levin 

Same here.

  Andrea


> ---
>  drivers/hv/channel_mgmt.c | 12 
>  include/linux/hyperv.h|  1 +
>  2 files changed, 13 insertions(+)
> 
> diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> index 5bf633c15cd4b..6ddda97030628 100644
> --- a/drivers/hv/channel_mgmt.c
> +++ b/drivers/hv/channel_mgmt.c
> @@ -942,6 +942,18 @@ static void vmbus_onoffer_rescind(struct 
> vmbus_channel_message_header *hdr)
>  
>   mutex_lock(_connection.channel_mutex);
>   channel = relid2channel(rescind->child_relid);
> + if (channel != NULL) {
> + /*
> +  * Guarantee that no other instance of vmbus_onoffer_rescind()
> +  * has got a reference to the channel object.  Synchronize on
> +  * _connection.channel_mutex.
> +  */
> + if (channel->rescind_ref) {
> + mutex_unlock(_connection.channel_mutex);
> + return;
> + }
> + channel->rescind_ref = true;
> + }
>   mutex_unlock(_connection.channel_mutex);
>  
>   if (channel == NULL) {
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index 63cd81e5610d1..22e2c2d75361e 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -710,6 +710,7 @@ struct vmbus_channel {
>   u8 monitor_bit;
>  
>   bool rescind; /* got rescind msg */
> + bool rescind_ref; /* got rescind msg, got channel reference */
>   struct completion rescind_event;
>  
>   u32 ringbuffer_gpadlhandle;
> -- 
> 2.27.0
>

Re: [PATCH AUTOSEL 5.10 40/56] Drivers: hv: vmbus: Initialize memory to be sent to the host

2021-02-24 Thread Andrea Parri

On Wed, Feb 24, 2021 at 07:51:56AM -0500, Sasha Levin wrote:
> From: "Andrea Parri (Microsoft)" 
> 
> [ Upstream commit e99c4afbee07e9323e9191a20b24d74dbf815bdf ]
> 
> __vmbus_open() and vmbus_teardown_gpadl() do not inizialite the memory
> for the vmbus_channel_open_channel and the vmbus_channel_gpadl_teardown
> objects they allocate respectively.  These objects contain padding bytes
> and fields that are left uninitialized and that are later sent to the
> host, potentially leaking guest data.  Zero initialize such fields to
> avoid leaking sensitive information to the host.
> 
> Reported-by: Juan Vazquez 
> Signed-off-by: Andrea Parri (Microsoft) 
> Reviewed-by: Michael Kelley 
> Link: https://lore.kernel.org/r/20201209070827.29335-2-parri.and...@gmail.com
> Signed-off-by: Wei Liu 
> Signed-off-by: Sasha Levin 

Same here.

  Andrea


> ---
>  drivers/hv/channel.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
> index fbdda9938039a..f9f04b5cd303f 100644
> --- a/drivers/hv/channel.c
> +++ b/drivers/hv/channel.c
> @@ -548,7 +548,7 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
>   goto error_clean_ring;
>  
>   /* Create and init the channel open message */
> - open_info = kmalloc(sizeof(*open_info) +
> + open_info = kzalloc(sizeof(*open_info) +
>  sizeof(struct vmbus_channel_open_channel),
>  GFP_KERNEL);
>   if (!open_info) {
> @@ -674,7 +674,7 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, 
> u32 gpadl_handle)
>   unsigned long flags;
>   int ret;
>  
> - info = kmalloc(sizeof(*info) +
> + info = kzalloc(sizeof(*info) +
>  sizeof(struct vmbus_channel_gpadl_teardown), GFP_KERNEL);
>   if (!info)
>   return -ENOMEM;
> -- 
> 2.27.0
>

Re: [PATCH AUTOSEL 5.10 41/56] Drivers: hv: vmbus: Resolve race condition in vmbus_onoffer_rescind()

2021-02-24 Thread Andrea Parri

On Wed, Feb 24, 2021 at 07:51:57AM -0500, Sasha Levin wrote:
> From: "Andrea Parri (Microsoft)" 
> 
> [ Upstream commit e4d221b42354b2e2ddb9187a806afb651eee2cda ]
> 
> An erroneous or malicious host could send multiple rescind messages for
> a same channel.  In vmbus_onoffer_rescind(), the guest maps the channel
> ID to obtain a pointer to the channel object and it eventually releases
> such object and associated data.  The host could time rescind messages
> and lead to an use-after-free.  Add a new flag to the channel structure
> to make sure that only one instance of vmbus_onoffer_rescind() can get
> the reference to the channel object.
> 
> Reported-by: Juan Vazquez 
> Signed-off-by: Andrea Parri (Microsoft) 
> Reviewed-by: Michael Kelley 
> Link: https://lore.kernel.org/r/20201209070827.29335-6-parri.and...@gmail.com
> Signed-off-by: Wei Liu 
> Signed-off-by: Sasha Levin 

Same here.

  Andrea


> ---
>  drivers/hv/channel_mgmt.c | 12 
>  include/linux/hyperv.h|  1 +
>  2 files changed, 13 insertions(+)
> 
> diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> index 1d44bb635bb84..a9f58840f85dc 100644
> --- a/drivers/hv/channel_mgmt.c
> +++ b/drivers/hv/channel_mgmt.c
> @@ -1049,6 +1049,18 @@ static void vmbus_onoffer_rescind(struct 
> vmbus_channel_message_header *hdr)
>  
>   mutex_lock(_connection.channel_mutex);
>   channel = relid2channel(rescind->child_relid);
> + if (channel != NULL) {
> + /*
> +  * Guarantee that no other instance of vmbus_onoffer_rescind()
> +  * has got a reference to the channel object.  Synchronize on
> +  * _connection.channel_mutex.
> +  */
> + if (channel->rescind_ref) {
> + mutex_unlock(_connection.channel_mutex);
> + return;
> + }
> + channel->rescind_ref = true;
> + }
>   mutex_unlock(_connection.channel_mutex);
>  
>   if (channel == NULL) {
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index 1ce131f29f3b4..376f0f9e19650 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -786,6 +786,7 @@ struct vmbus_channel {
>   u8 monitor_bit;
>  
>   bool rescind; /* got rescind msg */
> + bool rescind_ref; /* got rescind msg, got channel reference */
>   struct completion rescind_event;
>  
>   u32 ringbuffer_gpadlhandle;
> -- 
> 2.27.0
>

Re: [PATCH AUTOSEL 5.4 30/40] Drivers: hv: vmbus: Resolve race condition in vmbus_onoffer_rescind()

2021-02-24 Thread Andrea Parri

On Wed, Feb 24, 2021 at 07:53:30AM -0500, Sasha Levin wrote:
> From: "Andrea Parri (Microsoft)" 
> 
> [ Upstream commit e4d221b42354b2e2ddb9187a806afb651eee2cda ]
> 
> An erroneous or malicious host could send multiple rescind messages for
> a same channel.  In vmbus_onoffer_rescind(), the guest maps the channel
> ID to obtain a pointer to the channel object and it eventually releases
> such object and associated data.  The host could time rescind messages
> and lead to an use-after-free.  Add a new flag to the channel structure
> to make sure that only one instance of vmbus_onoffer_rescind() can get
> the reference to the channel object.
> 
> Reported-by: Juan Vazquez 
> Signed-off-by: Andrea Parri (Microsoft) 
> Reviewed-by: Michael Kelley 
> Link: https://lore.kernel.org/r/20201209070827.29335-6-parri.and...@gmail.com
> Signed-off-by: Wei Liu 
> Signed-off-by: Sasha Levin 

Same here.

  Andrea


> ---
>  drivers/hv/channel_mgmt.c | 12 
>  include/linux/hyperv.h|  1 +
>  2 files changed, 13 insertions(+)
> 
> diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> index 452307c79e4b9..dd4e890cf1b1d 100644
> --- a/drivers/hv/channel_mgmt.c
> +++ b/drivers/hv/channel_mgmt.c
> @@ -1048,6 +1048,18 @@ static void vmbus_onoffer_rescind(struct 
> vmbus_channel_message_header *hdr)
>  
>   mutex_lock(_connection.channel_mutex);
>   channel = relid2channel(rescind->child_relid);
> + if (channel != NULL) {
> + /*
> +  * Guarantee that no other instance of vmbus_onoffer_rescind()
> +  * has got a reference to the channel object.  Synchronize on
> +  * _connection.channel_mutex.
> +  */
> + if (channel->rescind_ref) {
> + mutex_unlock(_connection.channel_mutex);
> + return;
> + }
> + channel->rescind_ref = true;
> + }
>   mutex_unlock(_connection.channel_mutex);
>  
>   if (channel == NULL) {
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index 67d9b5a374600..51e2134b32a21 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -734,6 +734,7 @@ struct vmbus_channel {
>   u8 monitor_bit;
>  
>   bool rescind; /* got rescind msg */
> + bool rescind_ref; /* got rescind msg, got channel reference */
>   struct completion rescind_event;
>  
>   u32 ringbuffer_gpadlhandle;
> -- 
> 2.27.0
>

Re: [PATCH AUTOSEL 4.19 21/26] Drivers: hv: vmbus: Resolve race condition in vmbus_onoffer_rescind()

2021-02-24 Thread Andrea Parri

On Wed, Feb 24, 2021 at 07:54:29AM -0500, Sasha Levin wrote:
> From: "Andrea Parri (Microsoft)" 
> 
> [ Upstream commit e4d221b42354b2e2ddb9187a806afb651eee2cda ]
> 
> An erroneous or malicious host could send multiple rescind messages for
> a same channel.  In vmbus_onoffer_rescind(), the guest maps the channel
> ID to obtain a pointer to the channel object and it eventually releases
> such object and associated data.  The host could time rescind messages
> and lead to an use-after-free.  Add a new flag to the channel structure
> to make sure that only one instance of vmbus_onoffer_rescind() can get
> the reference to the channel object.
> 
> Reported-by: Juan Vazquez 
> Signed-off-by: Andrea Parri (Microsoft) 
> Reviewed-by: Michael Kelley 
> Link: https://lore.kernel.org/r/20201209070827.29335-6-parri.and...@gmail.com
> Signed-off-by: Wei Liu 
> Signed-off-by: Sasha Levin 

Same here.

  Andrea


> ---
>  drivers/hv/channel_mgmt.c | 12 
>  include/linux/hyperv.h|  1 +
>  2 files changed, 13 insertions(+)
> 
> diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> index 7920b0d7e35a7..1322e799938af 100644
> --- a/drivers/hv/channel_mgmt.c
> +++ b/drivers/hv/channel_mgmt.c
> @@ -954,6 +954,18 @@ static void vmbus_onoffer_rescind(struct 
> vmbus_channel_message_header *hdr)
>  
>   mutex_lock(_connection.channel_mutex);
>   channel = relid2channel(rescind->child_relid);
> + if (channel != NULL) {
> + /*
> +  * Guarantee that no other instance of vmbus_onoffer_rescind()
> +  * has got a reference to the channel object.  Synchronize on
> +  * _connection.channel_mutex.
> +  */
> + if (channel->rescind_ref) {
> + mutex_unlock(_connection.channel_mutex);
> + return;
> + }
> + channel->rescind_ref = true;
> + }
>   mutex_unlock(_connection.channel_mutex);
>  
>   if (channel == NULL) {
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index 35461d49d3aee..59525fe25abde 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -736,6 +736,7 @@ struct vmbus_channel {
>   u8 monitor_bit;
>  
>   bool rescind; /* got rescind msg */
> + bool rescind_ref; /* got rescind msg, got channel reference */
>   struct completion rescind_event;
>  
>   u32 ringbuffer_gpadlhandle;
> -- 
> 2.27.0
>

Re: [PATCH AUTOSEL 5.11 51/67] Drivers: hv: vmbus: Resolve race condition in vmbus_onoffer_rescind()

2021-02-24 Thread Andrea Parri

On Wed, Feb 24, 2021 at 07:50:09AM -0500, Sasha Levin wrote:
> From: "Andrea Parri (Microsoft)" 
> 
> [ Upstream commit e4d221b42354b2e2ddb9187a806afb651eee2cda ]
> 
> An erroneous or malicious host could send multiple rescind messages for
> a same channel.  In vmbus_onoffer_rescind(), the guest maps the channel
> ID to obtain a pointer to the channel object and it eventually releases
> such object and associated data.  The host could time rescind messages
> and lead to an use-after-free.  Add a new flag to the channel structure
> to make sure that only one instance of vmbus_onoffer_rescind() can get
> the reference to the channel object.
> 
> Reported-by: Juan Vazquez 
> Signed-off-by: Andrea Parri (Microsoft) 
> Reviewed-by: Michael Kelley 
> Link: https://lore.kernel.org/r/20201209070827.29335-6-parri.and...@gmail.com
> Signed-off-by: Wei Liu 
> Signed-off-by: Sasha Levin 

Sasha - This patch is one of a group of patches where a Linux guest running on
Hyper-V will start assuming that hypervisor behavior might be malicious, and
guards against such behavior.  Because this is a new assumption, these patches
are more properly treated as new functionality rather than as bug fixes.  So I
would propose that we *not* bring such patches back to stable branches.

Thanks,
  Andrea


> ---
>  drivers/hv/channel_mgmt.c | 12 
>  include/linux/hyperv.h|  1 +
>  2 files changed, 13 insertions(+)
> 
> diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> index 1d44bb635bb84..a9f58840f85dc 100644
> --- a/drivers/hv/channel_mgmt.c
> +++ b/drivers/hv/channel_mgmt.c
> @@ -1049,6 +1049,18 @@ static void vmbus_onoffer_rescind(struct 
> vmbus_channel_message_header *hdr)
>  
>   mutex_lock(_connection.channel_mutex);
>   channel = relid2channel(rescind->child_relid);
> + if (channel != NULL) {
> + /*
> +  * Guarantee that no other instance of vmbus_onoffer_rescind()
> +  * has got a reference to the channel object.  Synchronize on
> +  * _connection.channel_mutex.
> +  */
> + if (channel->rescind_ref) {
> + mutex_unlock(_connection.channel_mutex);
> + return;
> + }
> + channel->rescind_ref = true;
> + }
>   mutex_unlock(_connection.channel_mutex);
>  
>   if (channel == NULL) {
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index 5ddb479c4d4cb..ef3573e99d989 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -803,6 +803,7 @@ struct vmbus_channel {
>   u8 monitor_bit;
>  
>   bool rescind; /* got rescind msg */
> + bool rescind_ref; /* got rescind msg, got channel reference */
>   struct completion rescind_event;
>  
>   u32 ringbuffer_gpadlhandle;
> -- 
> 2.27.0
>

Re: [PATCH AUTOSEL 5.11 50/67] Drivers: hv: vmbus: Initialize memory to be sent to the host

2021-02-24 Thread Andrea Parri

On Wed, Feb 24, 2021 at 07:50:08AM -0500, Sasha Levin wrote:
> From: "Andrea Parri (Microsoft)" 
> 
> [ Upstream commit e99c4afbee07e9323e9191a20b24d74dbf815bdf ]
> 
> __vmbus_open() and vmbus_teardown_gpadl() do not inizialite the memory
> for the vmbus_channel_open_channel and the vmbus_channel_gpadl_teardown
> objects they allocate respectively.  These objects contain padding bytes
> and fields that are left uninitialized and that are later sent to the
> host, potentially leaking guest data.  Zero initialize such fields to
> avoid leaking sensitive information to the host.
> 
> Reported-by: Juan Vazquez 
> Signed-off-by: Andrea Parri (Microsoft) 
> Reviewed-by: Michael Kelley 
> Link: https://lore.kernel.org/r/20201209070827.29335-2-parri.and...@gmail.com
> Signed-off-by: Wei Liu 
> Signed-off-by: Sasha Levin 

Sasha - This patch is one of a group of patches where a Linux guest running on
Hyper-V will start assuming that hypervisor behavior might be malicious, and
guards against such behavior.  Because this is a new assumption, these patches
are more properly treated as new functionality rather than as bug fixes.  So I
would propose that we *not* bring such patches back to stable branches.

Thanks,
  Andrea


> ---
>  drivers/hv/channel.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
> index 6fb0c76bfbf81..0bd202de79600 100644
> --- a/drivers/hv/channel.c
> +++ b/drivers/hv/channel.c
> @@ -618,7 +618,7 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
>   goto error_clean_ring;
>  
>   /* Create and init the channel open message */
> - open_info = kmalloc(sizeof(*open_info) +
> + open_info = kzalloc(sizeof(*open_info) +
>  sizeof(struct vmbus_channel_open_channel),
>  GFP_KERNEL);
>   if (!open_info) {
> @@ -745,7 +745,7 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, 
> u32 gpadl_handle)
>   unsigned long flags;
>   int ret;
>  
> - info = kmalloc(sizeof(*info) +
> + info = kzalloc(sizeof(*info) +
>  sizeof(struct vmbus_channel_gpadl_teardown), GFP_KERNEL);
>   if (!info)
>   return -ENOMEM;
> -- 
> 2.27.0
>

[PATCH 2/2] arm: zynq: don't disable CONFIG_ARM_GLOBAL_TIMER due to CONFIG_CPU_FREQ anymore

2021-02-17 Thread Andrea Merello

Now ARM global timer driver could work even if it's source clock rate
changes, so we don't need to disable that driver when cpu frequency scaling
is in use.

This cause Zynq arch to get support for timer delay and get_cycles().

Signed-off-by: Andrea Merello 
Cc: Patrice Chotard 
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: Michal Simek 
Cc: Sören Brinkmann 
---
 arch/arm/mach-zynq/Kconfig  | 2 +-
 drivers/clocksource/Kconfig | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/arm/mach-zynq/Kconfig b/arch/arm/mach-zynq/Kconfig
index 43fb941dcd07..a56748d671c4 100644
--- a/arch/arm/mach-zynq/Kconfig
+++ b/arch/arm/mach-zynq/Kconfig
@@ -6,7 +6,7 @@ config ARCH_ZYNQ
select ARCH_SUPPORTS_BIG_ENDIAN
select ARM_AMBA
select ARM_GIC
-   select ARM_GLOBAL_TIMER if !CPU_FREQ
+   select ARM_GLOBAL_TIMER
select CADENCE_TTC_TIMER
select HAVE_ARM_SCU if SMP
select HAVE_ARM_TWD if SMP
diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
index 0f1c625275a0..421cd763bdb2 100644
--- a/drivers/clocksource/Kconfig
+++ b/drivers/clocksource/Kconfig
@@ -387,6 +387,7 @@ config ARM_GLOBAL_TIMER
 
 config ARM_GT_INITIAL_PRESCALER_VAL
int "ARM global timer initial prescaler value"
+   default 2 if ARCH_ZYNQ
default 1
depends on ARM_GLOBAL_TIMER
help
-- 
2.17.1

[PATCH 0/2] Fix missing entropy on Zynq arch due to get_cycles() not supported

2021-02-17 Thread Andrea Merello

A real-world problem has been seen with a Zynq-based board running
Debian 10, where ssh daemon takes a really long time to come up at boot.
This is due to lack of random numbers.

Since commit 50ee7529ec450 ("random: try to actively add entropy rather
than passively wait for it") we try to generate entropy whenever we are
in short of random numbers and someone needs them.

This trick works only when CPU cycle counter is available. On ARM this
means that get_cycles() works and in turn read_current_timer() works.

Zynq HW includes two "cadence TTC" timers and the "ARM global timer".
All these pieces of HW are fed by the CPU clock, which dynamically
changes whenever CPU frequency scaling is enabled.

In timer-cadence-ttc driver this scenario is handled by looking at parent
clock changes and adjusting things when required. This is the only usable
clocksource when CPU frequency scaling is in use. arm_global_timer driver
is disabled in Kconfig when CPU_FREQ is enabled for Zynq arch.

Unfortunately timer-cadence-ttc driver does register itself via
register_current_timer_delay() and that ultimately ends up in get_cycles()
to always return zero, causing the aforementioned lack of entropy problem.
I believe that the reason for this is because Cadence TTC counter on Zynq
silicon is only 16-bit wide.

This patchset works around this by implementing in ARM global timer driver
a mechanism to compensate for parent clock variations, similarly to what
it's done in Cadence TTC timer driver, so that it can be used together
with CPU frequency scaling on Zynq arch.

This proved to finally fix the problem on my Zynq-based Z-turn board.

Signed-off-by: Andrea Merello 
Cc: Patrice Chotard 
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: Michal Simek 
Cc: Sören Brinkmann 

Andrea Merello (2):
  clocksource: arm_global_timer: implement rate compensation whenever
source clock changes
  arm: zynq: don't disable CONFIG_ARM_GLOBAL_TIMER due to
CONFIG_CPU_FREQ anymore

 arch/arm/mach-zynq/Kconfig |   2 +-
 drivers/clocksource/Kconfig|  14 +++
 drivers/clocksource/arm_global_timer.c | 122 +++--
 3 files changed, 127 insertions(+), 11 deletions(-)

--
2.17.1

[PATCH 1/2] clocksource: arm_global_timer: implement rate compensation whenever source clock changes

2021-02-17 Thread Andrea Merello

This patch adds rate change notification support for the parent clock;
should that clock change, then we try to adjust the our prescaler in order
to compensate (i.e. we adjust to still get the same timer frequency).

This is loosely based on what it's done in timer-cadence-ttc. timer-sun51,
mips-gic-timer and smp_twd.c also seem to look at their parent clock rate
and to perform some kind of adjustment whenever needed.

In this particular case we have only one single counter and prescaler for
all clocksource, clockevent and timer_delay, and we just update it for all
(i.e. we don't let it go and call clockevents_update_freq() to notify to
the kernel that our rate has changed).

Note that, there is apparently no other way to fixup things, because once
we call register_current_timer_delay(), specifying the timer rate, it seems
that that rate is not supposed to change ever.

In order for this mechanism to work, we have to make assumptions about how
much the initial clock is supposed to eventually decrease from the initial
one, and set our initial prescaler to a value that we can eventually
decrease enough to compensate. We provide an option in KConfig for this.

In case we end up in a situation in which we are not able to compensate the
parent clock change, we fail returning NOTIFY_BAD.

This fixes a real-world problem with Zynq arch not being able to use this
driver and CPU_FREQ at the same time (because ARM global timer is fed by
the CPU clock, which may keep changing when CPU_FREQ is enabled).

Signed-off-by: Andrea Merello 
Cc: Patrice Chotard 
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: Michal Simek 
Cc: Sören Brinkmann 
---
 drivers/clocksource/Kconfig|  13 +++
 drivers/clocksource/arm_global_timer.c | 122 +++--
 2 files changed, 125 insertions(+), 10 deletions(-)

diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
index 14c7c4712478..0f1c625275a0 100644
--- a/drivers/clocksource/Kconfig
+++ b/drivers/clocksource/Kconfig
@@ -385,6 +385,19 @@ config ARM_GLOBAL_TIMER
help
  This option enables support for the ARM global timer unit.
 
+config ARM_GT_INITIAL_PRESCALER_VAL
+   int "ARM global timer initial prescaler value"
+   default 1
+   depends on ARM_GLOBAL_TIMER
+   help
+ When the ARM global timer initializes, its current rate is declared
+ to the kernel and maintained forever. Should it's parent clock
+ change, the driver tries to fix the timer's internal prescaler.
+ On some machs (i.e. Zynq) the initial prescaler value thus poses
+ bounds about how much the parent clock is allowed to decrease or
+ increase wrt the initial clock value.
+ This affects CPU_FREQ max delta from the initial frequency.
+
 config ARM_TIMER_SP804
bool "Support for Dual Timer SP804 module" if COMPILE_TEST
depends on GENERIC_SCHED_CLOCK && CLKDEV_LOOKUP
diff --git a/drivers/clocksource/arm_global_timer.c 
b/drivers/clocksource/arm_global_timer.c
index 88b2d38a7a61..60a8047fd32e 100644
--- a/drivers/clocksource/arm_global_timer.c
+++ b/drivers/clocksource/arm_global_timer.c
@@ -31,6 +31,10 @@
 #define GT_CONTROL_COMP_ENABLE BIT(1)  /* banked */
 #define GT_CONTROL_IRQ_ENABLE  BIT(2)  /* banked */
 #define GT_CONTROL_AUTO_INCBIT(3)  /* banked */
+#define GT_CONTROL_PRESCALER_SHIFT  8
+#define GT_CONTROL_PRESCALER_MAX0xF
+#define GT_CONTROL_PRESCALER_MASK   (GT_CONTROL_PRESCALER_MAX << \
+GT_CONTROL_PRESCALER_SHIFT)
 
 #define GT_INT_STATUS  0x0c
 #define GT_INT_STATUS_EVENT_FLAG   BIT(0)
@@ -39,6 +43,7 @@
 #define GT_COMP1   0x14
 #define GT_AUTO_INC0x18
 
+#define MAX_F_ERR 50
 /*
  * We are expecting to be clocked by the ARM peripheral clock.
  *
@@ -46,7 +51,8 @@
  * the units for all operations.
  */
 static void __iomem *gt_base;
-static unsigned long gt_clk_rate;
+struct notifier_block gt_clk_rate_change_nb;
+static u32 gt_psv_new, gt_psv_bck, gt_target_rate;
 static int gt_ppi;
 static struct clock_event_device __percpu *gt_evt;
 
@@ -96,7 +102,10 @@ static void gt_compare_set(unsigned long delta, int 
periodic)
unsigned long ctrl;
 
counter += delta;
-   ctrl = GT_CONTROL_TIMER_ENABLE;
+   ctrl = readl(gt_base + GT_CONTROL);
+   ctrl &= ~(GT_CONTROL_COMP_ENABLE | GT_CONTROL_IRQ_ENABLE |
+ GT_CONTROL_AUTO_INC | GT_CONTROL_AUTO_INC);
+   ctrl |= GT_CONTROL_TIMER_ENABLE;
writel_relaxed(ctrl, gt_base + GT_CONTROL);
writel_relaxed(lower_32_bits(counter), gt_base + GT_COMP0);
writel_relaxed(upper_32_bits(counter), gt_base + GT_COMP1);
@@ -123,7 +132,7 @@ static int gt_clockevent_shutdown(struct clock_event_device 
*evt)
 
 static int gt_clockevent_set_periodic(struct clock_event_device *evt)
 {
-   gt_compare_set

Regressions with VMBus/VSCs hardening changes

2021-02-12 Thread Andrea Parri

Hi all,

I'm reporting two regressions following certain VMBus/VSCs hardening changes
we've been discussing 'recently', unfortunately the first regression already
touched/affects mainline while the second one is in hyperv-next:

1) [mainline]

The first regression manifests with the following message (several):

  hv_vmbus: No request id available

I could reliably reproduce such message/behavior by running the command:

  fio --name=seqwrite --rw=read --direct=1 --ioengine=libaio --bs=32k 
--numjobs=4 --size=2G --runtime=60

(the message is triggered when files are being created).

I've bisected this regression to commit:

  453de21c2b8281 ("scsi: storvsc: Use vmbus_requestor to generate transaction 
IDs for VMBus hardening")

2) [hyperv-next]

The second regression manifests with various messages including:

  hv_netvsc 9c5f5000-0499-4b18-b2eb-a8d5c57c8774 eth0: Unknown nvsp packet type 
received 51966

  hv_netvsc 9c5f5000-0499-4b18-b2eb-a8d5c57c8774 eth0: unhandled packet type 0, 
tid 0

  hv_netvsc 9c5f5000-0499-4b18-b2eb-a8d5c57c8774 eth0: Incorrect transaction id

  hv_netvsc 9c5f5000-0499-4b18-b2eb-a8d5c57c8774 eth0: Invalid rndis_msg 
(buflen: 262, msg_len: 1728)

The connection was then typically lost/reset by the peer.

I could reproduce such behavior/messages by running the test:

  ntttcp -r -m 8,*, # receiver

  ntttcp -s -m 8,*, -ns -t 60 # sender

I bisected this regression to commit:

  a8c3209998afb5 ("Drivers: hv: vmbus: Copy packets sent by Hyper-V out of the 
ring buffer")

---
I am investigating but don't have fixes for these regressions now: given the
'timing' (-rc7 with the next merge window at the door...) I would propose to
revert/drop the interested changes:

1) 453de21c2b8281 is part of the so called 'vmbus_requestor' series that was
   applied during the merge window for 5.11:

  e8b7db38449ac5 ("Drivers: hv: vmbus: Add vmbus_requestor data structure for 
VMBus hardening")
  453de21c2b8281 ("scsi: storvsc: Use vmbus_requestor to generate transaction 
IDs for VMBus hardening")
  4d18fcc95f5095 ("hv_netvsc: Use vmbus_requestor to generate transaction IDs 
for VMBus hardening")

  I could prepare/submit patches to revert such commits (asap but likely not
  before tomorrow/late Saturday - EU time).

2) IIUC a8c3209998afb5 could be dropped (after rebase) without further modi-
   fications to hyperv-next.

Other suggestions/thoughts?

Thanks,
  Andrea

[PATCH net-next v2] seg6: fool-proof the processing of SRv6 behavior attributes

2021-02-06 Thread Andrea Mayer

The set of required attributes for a given SRv6 behavior is identified
using a bitmap stored in an unsigned long, since the initial design of SRv6
networking in Linux. Recently the same approach has been used for
identifying the optional attributes.

However, the number of attributes supported by SRv6 behaviors depends on
the size of the unsigned long type which changes with the architecture.
Indeed, on a 64-bit architecture, an SRv6 behavior can support up to 64
attributes while on a 32-bit architecture it can support at most 32
attributes.

To fool-proof the processing of SRv6 behaviors we verify, at compile time,
that the set of all supported SRv6 attributes can be encoded into a bitmap
stored in an unsigned long. Otherwise, kernel build fails forcing
developers to reconsider adding a new attribute or extend the total
number of supported attributes by the SRv6 behaviors.

Moreover, we replace all patterns (1 << i) with the macro SEG6_F_ATTR(i) in
order to address potential overflow issues caused by 32-bit signed
arithmetic.

Thanks to Colin Ian King for catching the overflow problem, providing a
solution and inspiring this patch.
Thanks to Jakub Kicinski for his useful suggestions during the design of
this patch.

v2:
 - remove the SEG6_LOCAL_MAX_SUPP which is not strictly needed: it can
   be derived from the unsigned long type. Thanks to David Ahern for
   pointing it out.

Signed-off-by: Andrea Mayer 
---
 net/ipv6/seg6_local.c | 67 +--
 1 file changed, 39 insertions(+), 28 deletions(-)

diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index b07f7c1c82a4..c2a0c78e84d4 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -31,6 +31,8 @@
 #include 
 #include 
 
+#define SEG6_F_ATTR(i) BIT(i)
+
 struct seg6_local_lwt;
 
 /* callbacks used for customizing the creation and destruction of a behavior */
@@ -660,8 +662,8 @@ seg6_end_dt_mode seg6_end_dt6_parse_mode(struct 
seg6_local_lwt *slwt)
unsigned long parsed_optattrs = slwt->parsed_optattrs;
bool legacy, vrfmode;
 
-   legacy  = !!(parsed_optattrs & (1 << SEG6_LOCAL_TABLE));
-   vrfmode = !!(parsed_optattrs & (1 << SEG6_LOCAL_VRFTABLE));
+   legacy  = !!(parsed_optattrs & SEG6_F_ATTR(SEG6_LOCAL_TABLE));
+   vrfmode = !!(parsed_optattrs & SEG6_F_ATTR(SEG6_LOCAL_VRFTABLE));
 
if (!(legacy ^ vrfmode))
/* both are absent or present: invalid DT6 mode */
@@ -883,32 +885,32 @@ static struct seg6_action_desc seg6_action_table[] = {
},
{
.action = SEG6_LOCAL_ACTION_END_X,
-   .attrs  = (1 << SEG6_LOCAL_NH6),
+   .attrs  = SEG6_F_ATTR(SEG6_LOCAL_NH6),
.input  = input_action_end_x,
},
{
.action = SEG6_LOCAL_ACTION_END_T,
-   .attrs  = (1 << SEG6_LOCAL_TABLE),
+   .attrs  = SEG6_F_ATTR(SEG6_LOCAL_TABLE),
.input  = input_action_end_t,
},
{
.action = SEG6_LOCAL_ACTION_END_DX2,
-   .attrs  = (1 << SEG6_LOCAL_OIF),
+   .attrs  = SEG6_F_ATTR(SEG6_LOCAL_OIF),
.input  = input_action_end_dx2,
},
{
.action = SEG6_LOCAL_ACTION_END_DX6,
-   .attrs  = (1 << SEG6_LOCAL_NH6),
+   .attrs  = SEG6_F_ATTR(SEG6_LOCAL_NH6),
.input  = input_action_end_dx6,
},
{
.action = SEG6_LOCAL_ACTION_END_DX4,
-   .attrs  = (1 << SEG6_LOCAL_NH4),
+   .attrs  = SEG6_F_ATTR(SEG6_LOCAL_NH4),
.input  = input_action_end_dx4,
},
{
.action = SEG6_LOCAL_ACTION_END_DT4,
-   .attrs  = (1 << SEG6_LOCAL_VRFTABLE),
+   .attrs  = SEG6_F_ATTR(SEG6_LOCAL_VRFTABLE),
 #ifdef CONFIG_NET_L3_MASTER_DEV
.input  = input_action_end_dt4,
.slwt_ops   = {
@@ -920,30 +922,30 @@ static struct seg6_action_desc seg6_action_table[] = {
.action = SEG6_LOCAL_ACTION_END_DT6,
 #ifdef CONFIG_NET_L3_MASTER_DEV
.attrs  = 0,
-   .optattrs   = (1 << SEG6_LOCAL_TABLE) |
- (1 << SEG6_LOCAL_VRFTABLE),
+   .optattrs   = SEG6_F_ATTR(SEG6_LOCAL_TABLE) |
+ SEG6_F_ATTR(SEG6_LOCAL_VRFTABLE),
.slwt_ops   = {
.build_state = seg6_end_dt6_build,
  },
 #else
-   .attrs  = (1 << SEG6_LOCAL_TABLE),
+   .attrs  = SE

Re: [PATCH net-next] seg6: fool-proof the processing of SRv6 behavior attributes

2021-02-03 Thread Andrea Mayer

Hi David,
thanks for your time.

On Wed, 3 Feb 2021 08:59:40 -0700
David Ahern  wrote:

> On 2/2/21 11:56 AM, Andrea Mayer wrote:
> > diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
> > index b07f7c1c82a4..7cc50d506902 100644
> > --- a/net/ipv6/seg6_local.c
> > +++ b/net/ipv6/seg6_local.c
> > @@ -31,6 +31,9 @@
> >  #include 
> >  #include 
> >  
> > +#define SEG6_F_ATTR(i) BIT(i)
> > +#define SEG6_LOCAL_MAX_SUPP32
> > +
> 
> SEG6_LOCAL_MAX_SUPP should not be needed; it can be derived from the type:
> 

Yes, we can avoid the SEG6_LOCAL_MAX_SUPP. Using the BITS_PER_TYPE macro seems
better to me and it avoids nailing the value 32 as the limit. At compile time,
the kernel build will fail if the total number of SRv6 attributes will exceed
the number of bits available in the unsigned long bitmap (whatever is the
reference architecture). Therefore, I'm going to follow your suggestion!

> BUILD_BUG_ON(BITS_PER_TYPE(unsigned long) > SEG6_LOCAL_MAX)
> 

I think there is an issue here because BITS_PER_TYPE(unsigned long) is greater
than the SEG6_LOCAL_MAX (currently = 9).

I think it should be like this:

 BUILD_BUG_ON(SEG6_LOCAL_MAX + 1 > BITS_PER_TYPE(unsigned long))

I will send a v2 with the changes discussed so far.

Thank you,
Andrea

[PATCH net-next 2/2] hv_netvsc: Load and store the proper (NBL_HASH_INFO) per-packet info

2021-02-03 Thread Andrea Parri (Microsoft)

Fix the typo.

Signed-off-by: Andrea Parri (Microsoft) 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: net...@vger.kernel.org
Fixes: 0ba35fe91ce34f ("hv_netvsc: Copy packets sent by Hyper-V out of the 
receive buffer")
---
 drivers/net/hyperv/rndis_filter.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index 6c48a4d627368..0c2ebe7ac6554 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -465,7 +465,7 @@ void rsc_add_data(struct netvsc_channel *nvchan,
}
nvchan->rsc.pktlen = len;
if (hash_info != NULL) {
-   nvchan->rsc.csum_info = *csum_info;
+   nvchan->rsc.hash_info = *hash_info;
nvchan->rsc.ppi_flags |= NVSC_RSC_HASH_INFO;
} else {
nvchan->rsc.ppi_flags &= ~NVSC_RSC_HASH_INFO;
-- 
2.25.1

[PATCH net] hv_netvsc: Reset the RSC count if NVSP_STAT_FAIL in netvsc_receive()

2021-02-03 Thread Andrea Parri (Microsoft)

Commit 44144185951a0f ("hv_netvsc: Add validation for untrusted Hyper-V
values") added validation to rndis_filter_receive_data() (and
rndis_filter_receive()) which introduced NVSP_STAT_FAIL-scenarios where
the count is not updated/reset.  Fix this omission, and prevent similar
scenarios from occurring in the future.

Reported-by: Juan Vazquez 
Signed-off-by: Andrea Parri (Microsoft) 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: net...@vger.kernel.org
Fixes: 44144185951a0f ("hv_netvsc: Add validation for untrusted Hyper-V values")
---
 drivers/net/hyperv/netvsc.c   | 5 -
 drivers/net/hyperv/rndis_filter.c | 2 --
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 2350342b961ff..13bd48a75db76 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -1262,8 +1262,11 @@ static int netvsc_receive(struct net_device *ndev,
ret = rndis_filter_receive(ndev, net_device,
   nvchan, data, buflen);
 
-   if (unlikely(ret != NVSP_STAT_SUCCESS))
+   if (unlikely(ret != NVSP_STAT_SUCCESS)) {
+   /* Drop incomplete packet */
+   nvchan->rsc.cnt = 0;
status = NVSP_STAT_FAIL;
+   }
}
 
enq_receive_complete(ndev, net_device, q_idx,
diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index 598713c0d5a87..3aab2b867fc0d 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -509,8 +509,6 @@ static int rndis_filter_receive_data(struct net_device 
*ndev,
return ret;
 
 drop:
-   /* Drop incomplete packet */
-   nvchan->rsc.cnt = 0;
return NVSP_STAT_FAIL;
 }
 
-- 
2.25.1

[PATCH net-next 1/2] hv_netvsc: Allocate the recv_buf buffers after NVSP_MSG1_TYPE_SEND_RECV_BUF

2021-02-03 Thread Andrea Parri (Microsoft)

The recv_buf buffers are allocated in netvsc_device_add().  Later in
netvsc_init_buf() the response to NVSP_MSG1_TYPE_SEND_RECV_BUF allows
the host to set up a recv_section_size that could be bigger than the
(default) value used for that allocation.  The host-controlled value
could be used by a malicious host to bypass the check on the packet's
length in netvsc_receive() and hence to overflow the recv_buf buffer.

Move the allocation of the recv_buf buffers into netvsc_init_but().

Reported-by: Juan Vazquez 
Signed-off-by: Andrea Parri (Microsoft) 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: net...@vger.kernel.org
Fixes: 0ba35fe91ce34f ("hv_netvsc: Copy packets sent by Hyper-V out of the 
receive buffer")
---
 drivers/net/hyperv/netvsc.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 0fba8257fc119..9db1ea3affbb3 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -311,7 +311,7 @@ static int netvsc_init_buf(struct hv_device *device,
struct nvsp_message *init_packet;
unsigned int buf_size;
size_t map_words;
-   int ret = 0;
+   int i, ret = 0;
 
/* Get receive buffer area. */
buf_size = device_info->recv_sections * device_info->recv_section_size;
@@ -405,6 +405,16 @@ static int netvsc_init_buf(struct hv_device *device,
goto cleanup;
}
 
+   for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
+   struct netvsc_channel *nvchan = _device->chan_table[i];
+
+   nvchan->recv_buf = kzalloc(net_device->recv_section_size, 
GFP_KERNEL);
+   if (nvchan->recv_buf == NULL) {
+   ret = -ENOMEM;
+   goto cleanup;
+   }
+   }
+
/* Setup receive completion ring.
 * Add 1 to the recv_section_cnt because at least one entry in a
 * ring buffer has to be empty.
@@ -1549,12 +1559,6 @@ struct netvsc_device *netvsc_device_add(struct hv_device 
*device,
for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
struct netvsc_channel *nvchan = _device->chan_table[i];
 
-   nvchan->recv_buf = kzalloc(device_info->recv_section_size, 
GFP_KERNEL);
-   if (nvchan->recv_buf == NULL) {
-   ret = -ENOMEM;
-   goto cleanup2;
-   }
-
nvchan->channel = device->channel;
nvchan->net_device = net_device;
u64_stats_init(>tx_stats.syncp);
-- 
2.25.1

[PATCH net-next 0/2] Amend "hv_netvsc: Copy packets sent by Hyper-V out of the receive buffer"

2021-02-03 Thread Andrea Parri (Microsoft)

Patch #2 also addresses the Smatch complaint reported here:

   https://lkml.kernel.org/r/YBp2oVIdMe+G%2FliJ@mwanda/

Thanks,
  Andrea

Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: net...@vger.kernel.org

Andrea Parri (Microsoft) (2):
  hv_netvsc: Allocate the recv_buf buffers after
NVSP_MSG1_TYPE_SEND_RECV_BUF
  hv_netvsc: Load and store the proper (NBL_HASH_INFO) per-packet info

 drivers/net/hyperv/netvsc.c   | 18 +++---
 drivers/net/hyperv/rndis_filter.c |  2 +-
 2 files changed, 12 insertions(+), 8 deletions(-)

-- 
2.25.1

Re: [PATCH v2 net-next] hv_netvsc: Copy packets sent by Hyper-V out of the receive buffer

2021-02-03 Thread Andrea Parri

On Tue, Feb 02, 2021 at 11:45:49AM -0800, Jakub Kicinski wrote:
> On Tue, 2 Feb 2021 09:18:43 +0100 Andrea Parri wrote:
> > Hi net maintainers,
> > 
> > 
> > On Sat, Jan 30, 2021 at 12:50:06AM +, 
> > patchwork-bot+netdev...@kernel.org wrote:
> > > Hello:
> > > 
> > > This patch was applied to netdev/net-next.git (refs/heads/master):
> > > 
> > > On Tue, 26 Jan 2021 17:29:07 +0100 you wrote:  
> > > > Pointers to receive-buffer packets sent by Hyper-V are used within the
> > > > guest VM.  Hyper-V can send packets with erroneous values or modify
> > > > packet fields after they are processed by the guest.  To defend against
> > > > these scenarios, copy (sections of) the incoming packet after validating
> > > > their length and offset fields in netvsc_filter_receive().  In this way,
> > > > the packet can no longer be modified by the host.
> > > > 
> > > > [...]  
> > > 
> > > Here is the summary with links:
> > >   - [v2,net-next] hv_netvsc: Copy packets sent by Hyper-V out of the 
> > > receive buffer
> > > https://git.kernel.org/netdev/net-next/c/0ba35fe91ce3  
> > 
> > I'd have some fixes on top of this and I'm wondering about the process: 
> > would
> > you consider fixes/patches on top of this commit now? 
> 
> Fixes for bugs present in Linus's tree?
> 
> You need to target the net tree, and give us instructions on how to
> resolve the conflict which will arise from merging net into net-next.
> 
> > would you rather prefer me to squash these fixes into a v3? other?
> 
> Networking trees are immutable, and v2 was already applied. We could
> do a revert, apply fix, apply v3, but we prefer to just handle the 
> merge conflict.

Thanks for the clarification, Jakub.

And sorry for the confusion; let me just send out the 'fixes'/patches (I
have one targeting the net tree and two targeting the net-next tree, with
no conflict between them), so that they can be reviewed and we can agree
/discuss any further steps.

Thanks,
  Andrea

[PATCH net-next] seg6: fool-proof the processing of SRv6 behavior attributes

2021-02-02 Thread Andrea Mayer

The set of required attributes for a given SRv6 behavior is identified
using a bitmap stored in an unsigned long, since the initial design of SRv6
networking in Linux. Recently the same approach has been used for
identifying the optional attributes.

However, the number of attributes supported by SRv6 behaviors depends on
the size of the unsigned long type which changes with the architecture.
Indeed, on a 64-bit architecture, an SRv6 behavior can support up to 64
attributes while on a 32-bit architecture it can support at most 32
attributes.

To fool-proof the processing of SRv6 behaviors we will support at most 32
attributes independently from the reference architecture.

Consequently, this patch aims to:

 - verify, at compile time, that the total number of attributes does not
   exceed the fixed value of 32. Otherwise, kernel build fails forcing
   developers to reconsider adding a new attribute or extend the total
   number of supported attributes by the SRv6 behaviors.

 - replace all patterns (1 << i) with the macro SEG6_F_ATTR(i) in order to
   address potential overflow issues caused by 32-bit signed arithmetic.

Thanks to Colin Ian King for catching the overflow problem, providing a
solution and inspiring this patch.
Thanks to Jakub Kicinski for his useful suggestions during the design of
this patch.

Signed-off-by: Andrea Mayer 
---
 net/ipv6/seg6_local.c | 68 +--
 1 file changed, 40 insertions(+), 28 deletions(-)

diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index b07f7c1c82a4..7cc50d506902 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -31,6 +31,9 @@
 #include 
 #include 
 
+#define SEG6_F_ATTR(i) BIT(i)
+#define SEG6_LOCAL_MAX_SUPP32
+
 struct seg6_local_lwt;
 
 /* callbacks used for customizing the creation and destruction of a behavior */
@@ -660,8 +663,8 @@ seg6_end_dt_mode seg6_end_dt6_parse_mode(struct 
seg6_local_lwt *slwt)
unsigned long parsed_optattrs = slwt->parsed_optattrs;
bool legacy, vrfmode;
 
-   legacy  = !!(parsed_optattrs & (1 << SEG6_LOCAL_TABLE));
-   vrfmode = !!(parsed_optattrs & (1 << SEG6_LOCAL_VRFTABLE));
+   legacy  = !!(parsed_optattrs & SEG6_F_ATTR(SEG6_LOCAL_TABLE));
+   vrfmode = !!(parsed_optattrs & SEG6_F_ATTR(SEG6_LOCAL_VRFTABLE));
 
if (!(legacy ^ vrfmode))
/* both are absent or present: invalid DT6 mode */
@@ -883,32 +886,32 @@ static struct seg6_action_desc seg6_action_table[] = {
},
{
.action = SEG6_LOCAL_ACTION_END_X,
-   .attrs  = (1 << SEG6_LOCAL_NH6),
+   .attrs  = SEG6_F_ATTR(SEG6_LOCAL_NH6),
.input  = input_action_end_x,
},
{
.action = SEG6_LOCAL_ACTION_END_T,
-   .attrs  = (1 << SEG6_LOCAL_TABLE),
+   .attrs  = SEG6_F_ATTR(SEG6_LOCAL_TABLE),
.input  = input_action_end_t,
},
{
.action = SEG6_LOCAL_ACTION_END_DX2,
-   .attrs  = (1 << SEG6_LOCAL_OIF),
+   .attrs  = SEG6_F_ATTR(SEG6_LOCAL_OIF),
.input  = input_action_end_dx2,
},
{
.action = SEG6_LOCAL_ACTION_END_DX6,
-   .attrs  = (1 << SEG6_LOCAL_NH6),
+   .attrs  = SEG6_F_ATTR(SEG6_LOCAL_NH6),
.input  = input_action_end_dx6,
},
{
.action = SEG6_LOCAL_ACTION_END_DX4,
-   .attrs  = (1 << SEG6_LOCAL_NH4),
+   .attrs  = SEG6_F_ATTR(SEG6_LOCAL_NH4),
.input  = input_action_end_dx4,
},
{
.action = SEG6_LOCAL_ACTION_END_DT4,
-   .attrs  = (1 << SEG6_LOCAL_VRFTABLE),
+   .attrs  = SEG6_F_ATTR(SEG6_LOCAL_VRFTABLE),
 #ifdef CONFIG_NET_L3_MASTER_DEV
.input  = input_action_end_dt4,
.slwt_ops   = {
@@ -920,30 +923,30 @@ static struct seg6_action_desc seg6_action_table[] = {
.action = SEG6_LOCAL_ACTION_END_DT6,
 #ifdef CONFIG_NET_L3_MASTER_DEV
.attrs  = 0,
-   .optattrs   = (1 << SEG6_LOCAL_TABLE) |
- (1 << SEG6_LOCAL_VRFTABLE),
+   .optattrs   = SEG6_F_ATTR(SEG6_LOCAL_TABLE) |
+ SEG6_F_ATTR(SEG6_LOCAL_VRFTABLE),
.slwt_ops   = {
.build_state = seg6_end_dt6_build,
  },
 #else
-   .attrs  = (1 << SEG6_LOCAL_TABLE),
+   .attrs  = SEG6_F_ATTR(SEG6_LOCAL_TABLE),
 #endif

Re: [PATCH v2 net-next] hv_netvsc: Copy packets sent by Hyper-V out of the receive buffer

2021-02-02 Thread Andrea Parri

Hi net maintainers,


On Sat, Jan 30, 2021 at 12:50:06AM +, patchwork-bot+netdev...@kernel.org 
wrote:
> Hello:
> 
> This patch was applied to netdev/net-next.git (refs/heads/master):
> 
> On Tue, 26 Jan 2021 17:29:07 +0100 you wrote:
> > Pointers to receive-buffer packets sent by Hyper-V are used within the
> > guest VM.  Hyper-V can send packets with erroneous values or modify
> > packet fields after they are processed by the guest.  To defend against
> > these scenarios, copy (sections of) the incoming packet after validating
> > their length and offset fields in netvsc_filter_receive().  In this way,
> > the packet can no longer be modified by the host.
> > 
> > [...]
> 
> Here is the summary with links:
>   - [v2,net-next] hv_netvsc: Copy packets sent by Hyper-V out of the receive 
> buffer
> https://git.kernel.org/netdev/net-next/c/0ba35fe91ce3

I'd have some fixes on top of this and I'm wondering about the process: would
you consider fixes/patches on top of this commit now? would you rather prefer
me to squash these fixes into a v3? other?

Thanks,
  Andrea

[PATCH v3 hyperv-next 2/4] Drivers: hv: vmbus: Restrict vmbus_devices on isolated guests

2021-02-01 Thread Andrea Parri (Microsoft)

Only the VSCs or ICs that have been hardened and that are critical for
the successful adoption of Confidential VMs should be allowed if the
guest is running isolated.  This change reduces the footprint of the
code that will be exercised by Confidential VMs and hence the exposure
to bugs and vulnerabilities.

Signed-off-by: Andrea Parri (Microsoft) 
---
 drivers/hv/channel_mgmt.c | 38 ++
 include/linux/hyperv.h|  1 +
 2 files changed, 39 insertions(+)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 68950a1e4b638..f0ed730e2e4e4 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -31,101 +31,118 @@ const struct vmbus_device vmbus_devs[] = {
{ .dev_type = HV_IDE,
  HV_IDE_GUID,
  .perf_device = true,
+ .allowed_in_isolated = false,
},
 
/* SCSI */
{ .dev_type = HV_SCSI,
  HV_SCSI_GUID,
  .perf_device = true,
+ .allowed_in_isolated = true,
},
 
/* Fibre Channel */
{ .dev_type = HV_FC,
  HV_SYNTHFC_GUID,
  .perf_device = true,
+ .allowed_in_isolated = false,
},
 
/* Synthetic NIC */
{ .dev_type = HV_NIC,
  HV_NIC_GUID,
  .perf_device = true,
+ .allowed_in_isolated = true,
},
 
/* Network Direct */
{ .dev_type = HV_ND,
  HV_ND_GUID,
  .perf_device = true,
+ .allowed_in_isolated = false,
},
 
/* PCIE */
{ .dev_type = HV_PCIE,
  HV_PCIE_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* Synthetic Frame Buffer */
{ .dev_type = HV_FB,
  HV_SYNTHVID_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* Synthetic Keyboard */
{ .dev_type = HV_KBD,
  HV_KBD_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* Synthetic MOUSE */
{ .dev_type = HV_MOUSE,
  HV_MOUSE_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* KVP */
{ .dev_type = HV_KVP,
  HV_KVP_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* Time Synch */
{ .dev_type = HV_TS,
  HV_TS_GUID,
  .perf_device = false,
+ .allowed_in_isolated = true,
},
 
/* Heartbeat */
{ .dev_type = HV_HB,
  HV_HEART_BEAT_GUID,
  .perf_device = false,
+ .allowed_in_isolated = true,
},
 
/* Shutdown */
{ .dev_type = HV_SHUTDOWN,
  HV_SHUTDOWN_GUID,
  .perf_device = false,
+ .allowed_in_isolated = true,
},
 
/* File copy */
{ .dev_type = HV_FCOPY,
  HV_FCOPY_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* Backup */
{ .dev_type = HV_BACKUP,
  HV_VSS_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* Dynamic Memory */
{ .dev_type = HV_DM,
  HV_DM_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* Unknown GUID */
{ .dev_type = HV_UNKNOWN,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 };
 
@@ -903,6 +920,20 @@ find_primary_channel_by_offer(const struct 
vmbus_channel_offer_channel *offer)
return channel;
 }
 
+static bool vmbus_is_valid_device(const guid_t *guid)
+{
+   u16 i;
+
+   if (!hv_is_isolation_supported())
+   return true;
+
+   for (i = 0; i < ARRAY_SIZE(vmbus_devs); i++) {
+   if (guid_equal(guid, _devs[i].guid))
+   return vmbus_devs[i].allowed_in_isolated;
+   }
+   return false;
+}
+
 /*
  * vmbus_onoffer - Handler for channel offers from vmbus in parent partition.
  *
@@ -917,6 +948,13 @@ static void vmbus_onoffer(struct 
vmbus_channel_message_header *hdr)
 
trace_vmbus_onoffer(offer);
 
+   if (!vmbus_is_valid_device(>offer.if_type)) {
+   pr_err_ratelimited("Invalid offer %d from the host supporting 
isolation\n",
+  offer->child_relid);
+   atomic_dec(_connection.offer_in_progress);
+   return;
+   }
+
oldchannel = find_primary_channel_by_offer(offer);
 
if (oldchannel != NULL) {
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index f0d48a368f131..e3426f8c12db9 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -789,6 +789,7 @@ struct vmbus_device {
u16  dev_type;
guid_t guid;
bool perf_device;
+   bool allowed_in_isolated;
 };
 
 #define VMBUS_DEFAULT_MAX_PKT_SIZE 4096
-- 
2.25.1

[PATCH v3 hyperv-next 1/4] x86/hyperv: Load/save the Isolation Configuration leaf

2021-02-01 Thread Andrea Parri (Microsoft)

If bit 22 of Group B Features is set, the guest has access to the
Isolation Configuration CPUID leaf.  On x86, the first four bits
of EAX in this leaf provide the isolation type of the partition;
we entail three isolation types: 'SNP' (hardware-based isolation),
'VBS' (software-based isolation), and 'NONE' (no isolation).

Signed-off-by: Andrea Parri (Microsoft) 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Arnd Bergmann 
Cc: x...@kernel.org
Cc: linux-a...@vger.kernel.org
---
 arch/x86/hyperv/hv_init.c  | 15 +++
 arch/x86/include/asm/hyperv-tlfs.h | 15 +++
 arch/x86/kernel/cpu/mshyperv.c |  9 +
 include/asm-generic/hyperv-tlfs.h  |  1 +
 include/asm-generic/mshyperv.h |  5 +
 5 files changed, 45 insertions(+)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index e04d90af4c27c..ccdfc6868cfc8 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -528,3 +529,17 @@ bool hv_is_hibernation_supported(void)
return acpi_sleep_state_supported(ACPI_STATE_S4);
 }
 EXPORT_SYMBOL_GPL(hv_is_hibernation_supported);
+
+enum hv_isolation_type hv_get_isolation_type(void)
+{
+   if (!(ms_hyperv.features_b & HV_ISOLATION))
+   return HV_ISOLATION_TYPE_NONE;
+   return FIELD_GET(HV_ISOLATION_TYPE, ms_hyperv.isolation_config_b);
+}
+EXPORT_SYMBOL_GPL(hv_get_isolation_type);
+
+bool hv_is_isolation_supported(void)
+{
+   return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;
+}
+EXPORT_SYMBOL_GPL(hv_is_isolation_supported);
diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
b/arch/x86/include/asm/hyperv-tlfs.h
index 6bf42aed387e3..6aed936e5e962 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -22,6 +22,7 @@
 #define HYPERV_CPUID_ENLIGHTMENT_INFO  0x4004
 #define HYPERV_CPUID_IMPLEMENT_LIMITS  0x4005
 #define HYPERV_CPUID_NESTED_FEATURES   0x400A
+#define HYPERV_CPUID_ISOLATION_CONFIG  0x400C
 
 #define HYPERV_CPUID_VIRT_STACK_INTERFACE  0x4081
 #define HYPERV_VS_INTERFACE_EAX_SIGNATURE  0x31235356  /* "VS#1" */
@@ -122,6 +123,20 @@
 #define HV_X64_NESTED_GUEST_MAPPING_FLUSH  BIT(18)
 #define HV_X64_NESTED_MSR_BITMAP   BIT(19)
 
+/* HYPERV_CPUID_ISOLATION_CONFIG.EAX bits. */
+#define HV_PARAVISOR_PRESENT   BIT(0)
+
+/* HYPERV_CPUID_ISOLATION_CONFIG.EBX bits. */
+#define HV_ISOLATION_TYPE  GENMASK(3, 0)
+#define HV_SHARED_GPA_BOUNDARY_ACTIVE  BIT(5)
+#define HV_SHARED_GPA_BOUNDARY_BITSGENMASK(11, 6)
+
+enum hv_isolation_type {
+   HV_ISOLATION_TYPE_NONE  = 0,
+   HV_ISOLATION_TYPE_VBS   = 1,
+   HV_ISOLATION_TYPE_SNP   = 2
+};
+
 /* Hyper-V specific model specific registers (MSRs) */
 
 /* MSR used to identify the guest OS. */
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index f628e3dc150f3..ea7bd8dff171c 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -225,6 +225,7 @@ static void __init ms_hyperv_init_platform(void)
 * Extract the features and hints
 */
ms_hyperv.features = cpuid_eax(HYPERV_CPUID_FEATURES);
+   ms_hyperv.features_b = cpuid_ebx(HYPERV_CPUID_FEATURES);
ms_hyperv.misc_features = cpuid_edx(HYPERV_CPUID_FEATURES);
ms_hyperv.hints= cpuid_eax(HYPERV_CPUID_ENLIGHTMENT_INFO);
 
@@ -259,6 +260,14 @@ static void __init ms_hyperv_init_platform(void)
x86_platform.calibrate_cpu = hv_get_tsc_khz;
}
 
+   if (ms_hyperv.features_b & HV_ISOLATION) {
+   ms_hyperv.isolation_config_a = 
cpuid_eax(HYPERV_CPUID_ISOLATION_CONFIG);
+   ms_hyperv.isolation_config_b = 
cpuid_ebx(HYPERV_CPUID_ISOLATION_CONFIG);
+
+   pr_info("Hyper-V: Isolation Config: Group A 0x%x, Group B 
0x%x\n",
+   ms_hyperv.isolation_config_a, 
ms_hyperv.isolation_config_b);
+   }
+
if (ms_hyperv.hints & HV_X64_ENLIGHTENED_VMCS_RECOMMENDED) {
ms_hyperv.nested_features =
cpuid_eax(HYPERV_CPUID_NESTED_FEATURES);
diff --git a/include/asm-generic/hyperv-tlfs.h 
b/include/asm-generic/hyperv-tlfs.h
index e73a11850055c..20d3cd9502043 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -89,6 +89,7 @@
 #define HV_ACCESS_STATSBIT(8)
 #define HV_DEBUGGING   BIT(11)
 #define HV_CPU_POWER_MANAGEMENTBIT(12)
+#define HV_ISOLATION   BIT(22)
 
 
 /*
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index c57799684170c..dff58a3db5d5c 100644

[PATCH v3 hyperv-next 3/4] Drivers: hv: vmbus: Enforce 'VMBus version >= 5.2' on isolated guests

2021-02-01 Thread Andrea Parri (Microsoft)

Restrict the protocol version(s) that will be negotiated with the host
to be 5.2 or greater if the guest is running isolated.  This reduces the
footprint of the code that will be exercised by Confidential VMs and
hence the exposure to bugs and vulnerabilities.

Signed-off-by: Andrea Parri (Microsoft) 
---
 drivers/hv/connection.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 11170d9a2e1a5..c83612cddb995 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -244,6 +244,13 @@ int vmbus_connect(void)
break;
}
 
+   if (hv_is_isolation_supported() && version < VERSION_WIN10_V5_2) {
+   pr_err("Invalid VMBus version %d.%d (expected >= %d.%d) from 
the host supporting isolation\n",
+  version >> 16, version & 0x, VERSION_WIN10_V5_2 >> 
16, VERSION_WIN10_V5_2 & 0x);
+   ret = -EINVAL;
+   goto cleanup;
+   }
+
vmbus_proto_version = version;
pr_info("Vmbus version:%d.%d\n",
version >> 16, version & 0x);
-- 
2.25.1

[PATCH v3 hyperv-next 4/4] hv_netvsc: Restrict configurations on isolated guests

2021-02-01 Thread Andrea Parri (Microsoft)

Restrict the NVSP protocol version(s) that will be negotiated with the
host to be NVSP_PROTOCOL_VERSION_61 or greater if the guest is running
isolated.  Moreover, do not advertise the SR-IOV capability and ignore
NVSP_MSG_4_TYPE_SEND_VF_ASSOCIATION messages in isolated guests, which
are not supposed to support SR-IOV.  This reduces the footprint of the
code that will be exercised by Confidential VMs and hence the exposure
to bugs and vulnerabilities.

Signed-off-by: Andrea Parri (Microsoft) 
Acked-by: Jakub Kicinski 
Reviewed-by: Haiyang Zhang 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: net...@vger.kernel.org
---
 drivers/net/hyperv/netvsc.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 1510a236aa341..51005f2d4a821 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -22,6 +22,7 @@
 #include 
 
 #include 
+#include 
 
 #include "hyperv_net.h"
 #include "netvsc_trace.h"
@@ -544,7 +545,10 @@ static int negotiate_nvsp_ver(struct hv_device *device,
init_packet->msg.v2_msg.send_ndis_config.capability.ieee8021q = 1;
 
if (nvsp_ver >= NVSP_PROTOCOL_VERSION_5) {
-   init_packet->msg.v2_msg.send_ndis_config.capability.sriov = 1;
+   if (hv_is_isolation_supported())
+   netdev_info(ndev, "SR-IOV not advertised by guests on 
the host supporting isolation\n");
+   else
+   
init_packet->msg.v2_msg.send_ndis_config.capability.sriov = 1;
 
/* Teaming bit is needed to receive link speed updates */
init_packet->msg.v2_msg.send_ndis_config.capability.teaming = 1;
@@ -591,6 +595,13 @@ static int netvsc_connect_vsp(struct hv_device *device,
goto cleanup;
}
 
+   if (hv_is_isolation_supported() && net_device->nvsp_version < 
NVSP_PROTOCOL_VERSION_61) {
+   netdev_err(ndev, "Invalid NVSP version 0x%x (expected >= 0x%x) 
from the host supporting isolation\n",
+  net_device->nvsp_version, NVSP_PROTOCOL_VERSION_61);
+   ret = -EPROTO;
+   goto cleanup;
+   }
+
pr_debug("Negotiated NVSP version:%x\n", net_device->nvsp_version);
 
/* Send the ndis version */
@@ -1357,7 +1368,10 @@ static void netvsc_receive_inband(struct net_device 
*ndev,
break;
 
case NVSP_MSG4_TYPE_SEND_VF_ASSOCIATION:
-   netvsc_send_vf(ndev, nvmsg, msglen);
+   if (hv_is_isolation_supported())
+   netdev_err(ndev, "Ignore VF_ASSOCIATION msg from the 
host supporting isolation\n");
+   else
+   netvsc_send_vf(ndev, nvmsg, msglen);
break;
}
 }
-- 
2.25.1

[PATCH v3 hyperv-next 0/4] Drivers: hv: vmbus: Restrict devices and configurations on 'isolated' guests

2021-02-01 Thread Andrea Parri (Microsoft)

Changes since v2 [1]:
  - improve/add logging (Michael Kelley)
  - rename 'hypercalls_features' to 'features_b' (Michael Kelley)
  - move VMBus and NVSC version checks after 'for' loop (Michael Kelley)
  - remove/inline helper functions (Michael Kelley)
  - other minor changes (Michael Kelley)

Changes since v1 [2]:
  - improve/add logging (Haiyang Zhang)
  - move NVSC version check after version negotiation (Haiyang Zhang)

[1] https://lkml.kernel.org/r/20210126115641.2527-1-parri.and...@gmail.com
[1] https://lkml.kernel.org/r/20210119175841.22248-1-parri.and...@gmail.com

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Arnd Bergmann 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: x...@kernel.org
Cc: linux-a...@vger.kernel.org
Cc: net...@vger.kernel.org

Andrea Parri (Microsoft) (4):
  x86/hyperv: Load/save the Isolation Configuration leaf
  Drivers: hv: vmbus: Restrict vmbus_devices on isolated guests
  Drivers: hv: vmbus: Enforce 'VMBus version >= 5.2' on isolated guests
  hv_netvsc: Restrict configurations on isolated guests

 arch/x86/hyperv/hv_init.c  | 15 
 arch/x86/include/asm/hyperv-tlfs.h | 15 
 arch/x86/kernel/cpu/mshyperv.c |  9 +++
 drivers/hv/channel_mgmt.c  | 38 ++
 drivers/hv/connection.c|  7 ++
 drivers/net/hyperv/netvsc.c| 18 --
 include/asm-generic/hyperv-tlfs.h  |  1 +
 include/asm-generic/mshyperv.h |  5 
 include/linux/hyperv.h |  1 +
 9 files changed, 107 insertions(+), 2 deletions(-)

-- 
2.25.1

[PATCH v2 net-next] hv_netvsc: Copy packets sent by Hyper-V out of the receive buffer

2021-01-26 Thread Andrea Parri (Microsoft)

Pointers to receive-buffer packets sent by Hyper-V are used within the
guest VM.  Hyper-V can send packets with erroneous values or modify
packet fields after they are processed by the guest.  To defend against
these scenarios, copy (sections of) the incoming packet after validating
their length and offset fields in netvsc_filter_receive().  In this way,
the packet can no longer be modified by the host.

Reported-by: Juan Vazquez 
Signed-off-by: Andrea Parri (Microsoft) 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: net...@vger.kernel.org
---
Changes since v1 [1]:
  - copy certain PPIs into the RSC pkt

[1] https://lkml.kernel.org/r/20210126113847.1676-1-parri.and...@gmail.com

 drivers/net/hyperv/hyperv_net.h   | 93 +++--
 drivers/net/hyperv/netvsc.c   | 20 +++
 drivers/net/hyperv/netvsc_drv.c   | 24 
 drivers/net/hyperv/rndis_filter.c | 99 +--
 4 files changed, 150 insertions(+), 86 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 2a87cfa27ac02..e1a497d3c9ba4 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -105,9 +105,43 @@ struct ndis_recv_scale_param { /* 
NDIS_RECEIVE_SCALE_PARAMETERS */
u32 processor_masks_entry_size;
 };
 
-/* Fwd declaration */
-struct ndis_tcp_ip_checksum_info;
-struct ndis_pkt_8021q_info;
+struct ndis_tcp_ip_checksum_info {
+   union {
+   struct {
+   u32 is_ipv4:1;
+   u32 is_ipv6:1;
+   u32 tcp_checksum:1;
+   u32 udp_checksum:1;
+   u32 ip_header_checksum:1;
+   u32 reserved:11;
+   u32 tcp_header_offset:10;
+   } transmit;
+   struct {
+   u32 tcp_checksum_failed:1;
+   u32 udp_checksum_failed:1;
+   u32 ip_checksum_failed:1;
+   u32 tcp_checksum_succeeded:1;
+   u32 udp_checksum_succeeded:1;
+   u32 ip_checksum_succeeded:1;
+   u32 loopback:1;
+   u32 tcp_checksum_value_invalid:1;
+   u32 ip_checksum_value_invalid:1;
+   } receive;
+   u32  value;
+   };
+};
+
+struct ndis_pkt_8021q_info {
+   union {
+   struct {
+   u32 pri:3; /* User Priority */
+   u32 cfi:1; /* Canonical Format ID */
+   u32 vlanid:12; /* VLAN ID */
+   u32 reserved:16;
+   };
+   u32 value;
+   };
+};
 
 /*
  * Represent netvsc packet which contains 1 RNDIS and 1 ethernet frame
@@ -194,7 +228,8 @@ int netvsc_send(struct net_device *net,
struct sk_buff *skb,
bool xdp_tx);
 void netvsc_linkstatus_callback(struct net_device *net,
-   struct rndis_message *resp);
+   struct rndis_message *resp,
+   void *data);
 int netvsc_recv_callback(struct net_device *net,
 struct netvsc_device *nvdev,
 struct netvsc_channel *nvchan);
@@ -884,9 +919,10 @@ struct multi_recv_comp {
 #define NVSP_RSC_MAX 562 /* Max #RSC frags in a vmbus xfer page pkt */
 
 struct nvsc_rsc {
-   const struct ndis_pkt_8021q_info *vlan;
-   const struct ndis_tcp_ip_checksum_info *csum_info;
-   const u32 *hash_info;
+   struct ndis_pkt_8021q_info vlan;
+   struct ndis_tcp_ip_checksum_info csum_info;
+   u32 hash_info;
+   u8 ppi_flags; /* valid/present bits for the above PPIs */
u8 is_last; /* last RNDIS msg in a vmtransfer_page */
u32 cnt; /* #fragments in an RSC packet */
u32 pktlen; /* Full packet length */
@@ -894,6 +930,10 @@ struct nvsc_rsc {
u32 len[NVSP_RSC_MAX];
 };
 
+#define NVSC_RSC_VLAN  BIT(0)  /* valid/present bit for 'vlan' */
+#define NVSC_RSC_CSUM_INFO BIT(1)  /* valid/present bit for 'csum_info' */
+#define NVSC_RSC_HASH_INFO BIT(2)  /* valid/present bit for 'hash_info' */
+
 struct netvsc_stats {
u64 packets;
u64 bytes;
@@ -1002,6 +1042,7 @@ struct net_device_context {
 struct netvsc_channel {
struct vmbus_channel *channel;
struct netvsc_device *net_device;
+   void *recv_buf; /* buffer to copy packets out from the receive buffer */
const struct vmpacket_descriptor *desc;
struct napi_struct napi;
struct multi_send_data msd;
@@ -1234,18 +1275,6 @@ struct rndis_pktinfo_id {
u16 pkt_id;
 };
 
-struct ndis_pkt_8021q_info {
-   union {
-   struct {
-   u32 pri:3; /* User Priority */
-   u32 cfi:1; /* Canonical Format ID */
-   u32 vlanid:12; /* VLAN ID */
-

Re: [PATCH net-next] hv_netvsc: Copy packets sent by Hyper-V out of the receive buffer

2021-01-26 Thread Andrea Parri

On Tue, Jan 26, 2021 at 12:38:47PM +0100, Andrea Parri (Microsoft) wrote:
> Pointers to receive-buffer packets sent by Hyper-V are used within the
> guest VM.  Hyper-V can send packets with erroneous values or modify
> packet fields after they are processed by the guest.  To defend against
> these scenarios, copy (sections of) the incoming packet after validating
> their length and offset fields in netvsc_filter_receive().  In this way,
> the packet can no longer be modified by the host.
> 
> Reported-by: Juan Vazquez 
> Signed-off-by: Andrea Parri (Microsoft) 
> Cc: "David S. Miller" 
> Cc: Jakub Kicinski 
> Cc: net...@vger.kernel.org

Please ignore this submission, I'm sending a new version shortly...  Sorry for
the hassle.

  Andrea

[PATCH v2 1/4] x86/hyperv: Load/save the Isolation Configuration leaf

2021-01-26 Thread Andrea Parri (Microsoft)

If bit 22 of Group B Features is set, the guest has access to the
Isolation Configuration CPUID leaf.  On x86, the first four bits
of EAX in this leaf provide the isolation type of the partition;
we entail three isolation types: 'SNP' (hardware-based isolation),
'VBS' (software-based isolation), and 'NONE' (no isolation).

Signed-off-by: Andrea Parri (Microsoft) 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Arnd Bergmann 
Cc: x...@kernel.org
Cc: linux-a...@vger.kernel.org
---
 arch/x86/hyperv/hv_init.c  | 15 +++
 arch/x86/include/asm/hyperv-tlfs.h | 15 +++
 arch/x86/kernel/cpu/mshyperv.c |  9 +
 include/asm-generic/hyperv-tlfs.h  |  1 +
 include/asm-generic/mshyperv.h |  5 +
 5 files changed, 45 insertions(+)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index e04d90af4c27c..dc94e95c57b98 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -528,3 +529,17 @@ bool hv_is_hibernation_supported(void)
return acpi_sleep_state_supported(ACPI_STATE_S4);
 }
 EXPORT_SYMBOL_GPL(hv_is_hibernation_supported);
+
+enum hv_isolation_type hv_get_isolation_type(void)
+{
+   if (!(ms_hyperv.hypercalls_features & HV_ISOLATION))
+   return HV_ISOLATION_TYPE_NONE;
+   return FIELD_GET(HV_ISOLATION_TYPE, ms_hyperv.isolation_config_b);
+}
+EXPORT_SYMBOL_GPL(hv_get_isolation_type);
+
+bool hv_is_isolation_supported(void)
+{
+   return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;
+}
+EXPORT_SYMBOL_GPL(hv_is_isolation_supported);
diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
b/arch/x86/include/asm/hyperv-tlfs.h
index 6bf42aed387e3..6aed936e5e962 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -22,6 +22,7 @@
 #define HYPERV_CPUID_ENLIGHTMENT_INFO  0x4004
 #define HYPERV_CPUID_IMPLEMENT_LIMITS  0x4005
 #define HYPERV_CPUID_NESTED_FEATURES   0x400A
+#define HYPERV_CPUID_ISOLATION_CONFIG  0x400C
 
 #define HYPERV_CPUID_VIRT_STACK_INTERFACE  0x4081
 #define HYPERV_VS_INTERFACE_EAX_SIGNATURE  0x31235356  /* "VS#1" */
@@ -122,6 +123,20 @@
 #define HV_X64_NESTED_GUEST_MAPPING_FLUSH  BIT(18)
 #define HV_X64_NESTED_MSR_BITMAP   BIT(19)
 
+/* HYPERV_CPUID_ISOLATION_CONFIG.EAX bits. */
+#define HV_PARAVISOR_PRESENT   BIT(0)
+
+/* HYPERV_CPUID_ISOLATION_CONFIG.EBX bits. */
+#define HV_ISOLATION_TYPE  GENMASK(3, 0)
+#define HV_SHARED_GPA_BOUNDARY_ACTIVE  BIT(5)
+#define HV_SHARED_GPA_BOUNDARY_BITSGENMASK(11, 6)
+
+enum hv_isolation_type {
+   HV_ISOLATION_TYPE_NONE  = 0,
+   HV_ISOLATION_TYPE_VBS   = 1,
+   HV_ISOLATION_TYPE_SNP   = 2
+};
+
 /* Hyper-V specific model specific registers (MSRs) */
 
 /* MSR used to identify the guest OS. */
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index f628e3dc150f3..0d4aaf6694d01 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -225,6 +225,7 @@ static void __init ms_hyperv_init_platform(void)
 * Extract the features and hints
 */
ms_hyperv.features = cpuid_eax(HYPERV_CPUID_FEATURES);
+   ms_hyperv.hypercalls_features = cpuid_ebx(HYPERV_CPUID_FEATURES);
ms_hyperv.misc_features = cpuid_edx(HYPERV_CPUID_FEATURES);
ms_hyperv.hints= cpuid_eax(HYPERV_CPUID_ENLIGHTMENT_INFO);
 
@@ -259,6 +260,14 @@ static void __init ms_hyperv_init_platform(void)
x86_platform.calibrate_cpu = hv_get_tsc_khz;
}
 
+   if (ms_hyperv.hypercalls_features & HV_ISOLATION) {
+   ms_hyperv.isolation_config_a = 
cpuid_eax(HYPERV_CPUID_ISOLATION_CONFIG);
+   ms_hyperv.isolation_config_b = 
cpuid_ebx(HYPERV_CPUID_ISOLATION_CONFIG);
+
+   pr_info("Hyper-V: Isolation Config: GroupA 0x%x, GroupB 0x%x\n",
+   ms_hyperv.isolation_config_a, 
ms_hyperv.isolation_config_b);
+   }
+
if (ms_hyperv.hints & HV_X64_ENLIGHTENED_VMCS_RECOMMENDED) {
ms_hyperv.nested_features =
cpuid_eax(HYPERV_CPUID_NESTED_FEATURES);
diff --git a/include/asm-generic/hyperv-tlfs.h 
b/include/asm-generic/hyperv-tlfs.h
index e73a11850055c..20d3cd9502043 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -89,6 +89,7 @@
 #define HV_ACCESS_STATSBIT(8)
 #define HV_DEBUGGING   BIT(11)
 #define HV_CPU_POWER_MANAGEMENTBIT(12)
+#define HV_ISOLATION   BIT(22)
 
 
 /*
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index c5779968417

[PATCH v2 0/4] Drivers: hv: vmbus: Restrict devices and configurations on 'isolated' guests

2021-01-26 Thread Andrea Parri (Microsoft)

Changes since v1 [1]:
  - improve/add logging (Haiyang Zhang)
  - move NVSC version check after version negotiation (Haiyang Zhang)

[1] https://lkml.kernel.org/r/20210119175841.22248-1-parri.and...@gmail.com

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Arnd Bergmann 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: x...@kernel.org
Cc: linux-a...@vger.kernel.org
Cc: net...@vger.kernel.org

Andrea Parri (Microsoft) (4):
  x86/hyperv: Load/save the Isolation Configuration leaf
  Drivers: hv: vmbus: Restrict vmbus_devices on isolated guests
  Drivers: hv: vmbus: Enforce 'VMBus version >= 5.2' on isolated guests
  hv_netvsc: Restrict configurations on isolated guests

 arch/x86/hyperv/hv_init.c  | 15 +
 arch/x86/include/asm/hyperv-tlfs.h | 15 +
 arch/x86/kernel/cpu/mshyperv.c |  9 
 drivers/hv/channel_mgmt.c  | 36 ++
 drivers/hv/connection.c| 13 +++
 drivers/net/hyperv/netvsc.c| 27 +++---
 include/asm-generic/hyperv-tlfs.h  |  1 +
 include/asm-generic/mshyperv.h |  5 +
 include/linux/hyperv.h |  1 +
 9 files changed, 119 insertions(+), 3 deletions(-)

-- 
2.25.1

[PATCH v2 4/4] hv_netvsc: Restrict configurations on isolated guests

2021-01-26 Thread Andrea Parri (Microsoft)

Restrict the NVSP protocol version(s) that will be negotiated with the
host to be NVSP_PROTOCOL_VERSION_61 or greater if the guest is running
isolated.  Moreover, do not advertise the SR-IOV capability and ignore
NVSP_MSG_4_TYPE_SEND_VF_ASSOCIATION messages in isolated guests, which
are not supposed to support SR-IOV.  This reduces the footprint of the
code that will be exercised by Confidential VMs and hence the exposure
to bugs and vulnerabilities.

Signed-off-by: Andrea Parri (Microsoft) 
Acked-by: Jakub Kicinski 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: net...@vger.kernel.org
---
 drivers/net/hyperv/netvsc.c | 27 ---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 1510a236aa341..afd92b4aa21fe 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -22,6 +22,7 @@
 #include 
 
 #include 
+#include 
 
 #include "hyperv_net.h"
 #include "netvsc_trace.h"
@@ -544,7 +545,10 @@ static int negotiate_nvsp_ver(struct hv_device *device,
init_packet->msg.v2_msg.send_ndis_config.capability.ieee8021q = 1;
 
if (nvsp_ver >= NVSP_PROTOCOL_VERSION_5) {
-   init_packet->msg.v2_msg.send_ndis_config.capability.sriov = 1;
+   if (!hv_is_isolation_supported())
+   
init_packet->msg.v2_msg.send_ndis_config.capability.sriov = 1;
+   else
+   netdev_info(ndev, "SR-IOV not advertised by guests on 
the host supporting isolation\n");
 
/* Teaming bit is needed to receive link speed updates */
init_packet->msg.v2_msg.send_ndis_config.capability.teaming = 1;
@@ -563,6 +567,13 @@ static int negotiate_nvsp_ver(struct hv_device *device,
return ret;
 }
 
+static bool nvsp_is_valid_version(u32 version)
+{
+   if (hv_is_isolation_supported())
+   return version >= NVSP_PROTOCOL_VERSION_61;
+   return true;
+}
+
 static int netvsc_connect_vsp(struct hv_device *device,
  struct netvsc_device *net_device,
  const struct netvsc_device_info *device_info)
@@ -579,12 +590,19 @@ static int netvsc_connect_vsp(struct hv_device *device,
init_packet = _device->channel_init_pkt;
 
/* Negotiate the latest NVSP protocol supported */
-   for (i = ARRAY_SIZE(ver_list) - 1; i >= 0; i--)
+   for (i = ARRAY_SIZE(ver_list) - 1; i >= 0; i--) {
if (negotiate_nvsp_ver(device, net_device, init_packet,
   ver_list[i])  == 0) {
+   if (!nvsp_is_valid_version(ver_list[i])) {
+   netdev_err(ndev, "Invalid NVSP version 0x%x 
(expected >= 0x%x) from the host with isolation supported\n",
+  ver_list[i], 
NVSP_PROTOCOL_VERSION_61);
+   ret = -EPROTO;
+   goto cleanup;
+   }
net_device->nvsp_version = ver_list[i];
break;
}
+   }
 
if (i < 0) {
ret = -EPROTO;
@@ -1357,7 +1375,10 @@ static void netvsc_receive_inband(struct net_device 
*ndev,
break;
 
case NVSP_MSG4_TYPE_SEND_VF_ASSOCIATION:
-   netvsc_send_vf(ndev, nvmsg, msglen);
+   if (!hv_is_isolation_supported())
+   netvsc_send_vf(ndev, nvmsg, msglen);
+   else
+   netdev_err(ndev, "Ignore VF_ASSOCIATION msg from the 
host supporting isolation\n");
break;
}
 }
-- 
2.25.1

[PATCH v2 2/4] Drivers: hv: vmbus: Restrict vmbus_devices on isolated guests

2021-01-26 Thread Andrea Parri (Microsoft)

Only the VSCs or ICs that have been hardened and that are critical for
the successful adoption of Confidential VMs should be allowed if the
guest is running isolated.  This change reduces the footprint of the
code that will be exercised by Confidential VMs and hence the exposure
to bugs and vulnerabilities.

Signed-off-by: Andrea Parri (Microsoft) 
---
 drivers/hv/channel_mgmt.c | 36 
 include/linux/hyperv.h|  1 +
 2 files changed, 37 insertions(+)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 68950a1e4b638..774ee19e3e90d 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -31,101 +31,118 @@ const struct vmbus_device vmbus_devs[] = {
{ .dev_type = HV_IDE,
  HV_IDE_GUID,
  .perf_device = true,
+ .allowed_in_isolated = false,
},
 
/* SCSI */
{ .dev_type = HV_SCSI,
  HV_SCSI_GUID,
  .perf_device = true,
+ .allowed_in_isolated = true,
},
 
/* Fibre Channel */
{ .dev_type = HV_FC,
  HV_SYNTHFC_GUID,
  .perf_device = true,
+ .allowed_in_isolated = false,
},
 
/* Synthetic NIC */
{ .dev_type = HV_NIC,
  HV_NIC_GUID,
  .perf_device = true,
+ .allowed_in_isolated = true,
},
 
/* Network Direct */
{ .dev_type = HV_ND,
  HV_ND_GUID,
  .perf_device = true,
+ .allowed_in_isolated = false,
},
 
/* PCIE */
{ .dev_type = HV_PCIE,
  HV_PCIE_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* Synthetic Frame Buffer */
{ .dev_type = HV_FB,
  HV_SYNTHVID_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* Synthetic Keyboard */
{ .dev_type = HV_KBD,
  HV_KBD_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* Synthetic MOUSE */
{ .dev_type = HV_MOUSE,
  HV_MOUSE_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* KVP */
{ .dev_type = HV_KVP,
  HV_KVP_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* Time Synch */
{ .dev_type = HV_TS,
  HV_TS_GUID,
  .perf_device = false,
+ .allowed_in_isolated = true,
},
 
/* Heartbeat */
{ .dev_type = HV_HB,
  HV_HEART_BEAT_GUID,
  .perf_device = false,
+ .allowed_in_isolated = true,
},
 
/* Shutdown */
{ .dev_type = HV_SHUTDOWN,
  HV_SHUTDOWN_GUID,
  .perf_device = false,
+ .allowed_in_isolated = true,
},
 
/* File copy */
{ .dev_type = HV_FCOPY,
  HV_FCOPY_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* Backup */
{ .dev_type = HV_BACKUP,
  HV_VSS_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* Dynamic Memory */
{ .dev_type = HV_DM,
  HV_DM_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* Unknown GUID */
{ .dev_type = HV_UNKNOWN,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 };
 
@@ -903,6 +920,20 @@ find_primary_channel_by_offer(const struct 
vmbus_channel_offer_channel *offer)
return channel;
 }
 
+static bool vmbus_is_valid_device(const guid_t *guid)
+{
+   u16 i;
+
+   if (!hv_is_isolation_supported())
+   return true;
+
+   for (i = 0; i < ARRAY_SIZE(vmbus_devs); i++) {
+   if (guid_equal(guid, _devs[i].guid))
+   return vmbus_devs[i].allowed_in_isolated;
+   }
+   return false;
+}
+
 /*
  * vmbus_onoffer - Handler for channel offers from vmbus in parent partition.
  *
@@ -917,6 +948,11 @@ static void vmbus_onoffer(struct 
vmbus_channel_message_header *hdr)
 
trace_vmbus_onoffer(offer);
 
+   if (!vmbus_is_valid_device(>offer.if_type)) {
+   atomic_dec(_connection.offer_in_progress);
+   return;
+   }
+
oldchannel = find_primary_channel_by_offer(offer);
 
if (oldchannel != NULL) {
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index f0d48a368f131..e3426f8c12db9 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -789,6 +789,7 @@ struct vmbus_device {
u16  dev_type;
guid_t guid;
bool perf_device;
+   bool allowed_in_isolated;
 };
 
 #define VMBUS_DEFAULT_MAX_PKT_SIZE 4096
-- 
2.25.1

[PATCH v2 3/4] Drivers: hv: vmbus: Enforce 'VMBus version >= 5.2' on isolated guests

2021-01-26 Thread Andrea Parri (Microsoft)

Restrict the protocol version(s) that will be negotiated with the host
to be 5.2 or greater if the guest is running isolated.  This reduces the
footprint of the code that will be exercised by Confidential VMs and
hence the exposure to bugs and vulnerabilities.

Signed-off-by: Andrea Parri (Microsoft) 
---
 drivers/hv/connection.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 11170d9a2e1a5..bcf4d7def6838 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -66,6 +66,13 @@ module_param(max_version, uint, S_IRUGO);
 MODULE_PARM_DESC(max_version,
 "Maximal VMBus protocol version which can be negotiated");
 
+static bool vmbus_is_valid_version(u32 version)
+{
+   if (hv_is_isolation_supported())
+   return version >= VERSION_WIN10_V5_2;
+   return true;
+}
+
 int vmbus_negotiate_version(struct vmbus_channel_msginfo *msginfo, u32 version)
 {
int ret = 0;
@@ -233,6 +240,12 @@ int vmbus_connect(void)
goto cleanup;
 
version = vmbus_versions[i];
+
+   if (!vmbus_is_valid_version(version)) {
+   ret = -EINVAL;
+   goto cleanup;
+   }
+
if (version > max_version)
continue;
 
-- 
2.25.1

[PATCH net-next] hv_netvsc: Copy packets sent by Hyper-V out of the receive buffer

2021-01-26 Thread Andrea Parri (Microsoft)

Pointers to receive-buffer packets sent by Hyper-V are used within the
guest VM.  Hyper-V can send packets with erroneous values or modify
packet fields after they are processed by the guest.  To defend against
these scenarios, copy (sections of) the incoming packet after validating
their length and offset fields in netvsc_filter_receive().  In this way,
the packet can no longer be modified by the host.

Reported-by: Juan Vazquez 
Signed-off-by: Andrea Parri (Microsoft) 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: net...@vger.kernel.org
---
 drivers/net/hyperv/hyperv_net.h   |  4 +-
 drivers/net/hyperv/netvsc.c   | 20 +
 drivers/net/hyperv/netvsc_drv.c   |  9 ++--
 drivers/net/hyperv/rndis_filter.c | 74 +++
 4 files changed, 75 insertions(+), 32 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 2a87cfa27ac02..2c0ab80d8ae4e 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -194,7 +194,8 @@ int netvsc_send(struct net_device *net,
struct sk_buff *skb,
bool xdp_tx);
 void netvsc_linkstatus_callback(struct net_device *net,
-   struct rndis_message *resp);
+   struct rndis_message *resp,
+   void *data);
 int netvsc_recv_callback(struct net_device *net,
 struct netvsc_device *nvdev,
 struct netvsc_channel *nvchan);
@@ -1002,6 +1003,7 @@ struct net_device_context {
 struct netvsc_channel {
struct vmbus_channel *channel;
struct netvsc_device *net_device;
+   void *recv_buf; /* buffer to copy packets out from the receive buffer */
const struct vmpacket_descriptor *desc;
struct napi_struct napi;
struct multi_send_data msd;
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 6184e99c7f31f..0fba8257fc119 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -131,6 +131,7 @@ static void free_netvsc_device(struct rcu_head *head)
 
for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
xdp_rxq_info_unreg(>chan_table[i].xdp_rxq);
+   kfree(nvdev->chan_table[i].recv_buf);
vfree(nvdev->chan_table[i].mrc.slots);
}
 
@@ -1284,6 +1285,19 @@ static int netvsc_receive(struct net_device *ndev,
continue;
}
 
+   /* We're going to copy (sections of) the packet into 
nvchan->recv_buf;
+* make sure that nvchan->recv_buf is large enough to hold the 
packet.
+*/
+   if (unlikely(buflen > net_device->recv_section_size)) {
+   nvchan->rsc.cnt = 0;
+   status = NVSP_STAT_FAIL;
+   netif_err(net_device_ctx, rx_err, ndev,
+ "Packet too big: buflen=%u 
recv_section_size=%u\n",
+ buflen, net_device->recv_section_size);
+
+   continue;
+   }
+
data = recv_buf + offset;
 
nvchan->rsc.is_last = (i == count - 1);
@@ -1535,6 +1549,12 @@ struct netvsc_device *netvsc_device_add(struct hv_device 
*device,
for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
struct netvsc_channel *nvchan = _device->chan_table[i];
 
+   nvchan->recv_buf = kzalloc(device_info->recv_section_size, 
GFP_KERNEL);
+   if (nvchan->recv_buf == NULL) {
+   ret = -ENOMEM;
+   goto cleanup2;
+   }
+
nvchan->channel = device->channel;
nvchan->net_device = net_device;
u64_stats_init(>tx_stats.syncp);
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index ac20c432d4d8f..1ec27bbe267da 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -743,7 +743,8 @@ static netdev_tx_t netvsc_start_xmit(struct sk_buff *skb,
  * netvsc_linkstatus_callback - Link up/down notification
  */
 void netvsc_linkstatus_callback(struct net_device *net,
-   struct rndis_message *resp)
+   struct rndis_message *resp,
+   void *data)
 {
struct rndis_indicate_status *indicate = >msg.indicate_status;
struct net_device_context *ndev_ctx = netdev_priv(net);
@@ -757,6 +758,9 @@ void netvsc_linkstatus_callback(struct net_device *net,
return;
}
 
+   /* Copy the RNDIS indicate status into nvchan->recv_buf */
+   memcpy(indicate, data + RNDIS_HEADER_SIZE, sizeof(*indicate));
+
/* Update the physical link speed when changing to another vSwitch */
if (indicate->status

[PATCH v2] x86/entry: build thunk_$(BITS) only if CONFIG_PREEMPTION=y

2021-01-23 Thread Andrea Righi

With CONFIG_PREEMPTION disabled, arch/x86/entry/thunk_64.o is just an
empty object file.

With the newer binutils (tested with 2.35.90.20210113-1ubuntu1) the GNU
assembler doesn't generate a symbol table for empty object files and
objtool fails with the following error when a valid symbol table cannot
be found:

  arch/x86/entry/thunk_64.o: warning: objtool: missing symbol table

To prevent this from happening, build thunk_$(BITS).o only if
CONFIG_PREEMPTION is enabled.

BugLink: https://bugs.launchpad.net/bugs/1911359
Fixes: 320100a5ffe5 ("x86/entry: Remove the TRACE_IRQS cruft")
Signed-off-by: Andrea Righi 
---
 arch/x86/entry/Makefile   | 3 ++-
 arch/x86/entry/thunk_32.S | 2 --
 arch/x86/entry/thunk_64.S | 4 
 arch/x86/um/Makefile  | 3 ++-
 4 files changed, 4 insertions(+), 8 deletions(-)

ChangeLog (v1 -> v2):
 - do not break UML build

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index 08bf95dbc911..83c98dae74a6 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -21,12 +21,13 @@ CFLAGS_syscall_64.o += $(call 
cc-option,-Wno-override-init,)
 CFLAGS_syscall_32.o+= $(call cc-option,-Wno-override-init,)
 CFLAGS_syscall_x32.o   += $(call cc-option,-Wno-override-init,)
 
-obj-y  := entry_$(BITS).o thunk_$(BITS).o 
syscall_$(BITS).o
+obj-y  := entry_$(BITS).o syscall_$(BITS).o
 obj-y  += common.o
 
 obj-y  += vdso/
 obj-y  += vsyscall/
 
+obj-$(CONFIG_PREEMPTION)   += thunk_$(BITS).o
 obj-$(CONFIG_IA32_EMULATION)   += entry_64_compat.o syscall_32.o
 obj-$(CONFIG_X86_X32_ABI)  += syscall_x32.o
 
diff --git a/arch/x86/entry/thunk_32.S b/arch/x86/entry/thunk_32.S
index f1f96d4d8cd6..5997ec0b4b17 100644
--- a/arch/x86/entry/thunk_32.S
+++ b/arch/x86/entry/thunk_32.S
@@ -29,10 +29,8 @@ SYM_CODE_START_NOALIGN(\name)
 SYM_CODE_END(\name)
.endm
 
-#ifdef CONFIG_PREEMPTION
THUNK preempt_schedule_thunk, preempt_schedule
THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace
EXPORT_SYMBOL(preempt_schedule_thunk)
EXPORT_SYMBOL(preempt_schedule_notrace_thunk)
-#endif
 
diff --git a/arch/x86/entry/thunk_64.S b/arch/x86/entry/thunk_64.S
index ccd32877a3c4..c7cf79be7231 100644
--- a/arch/x86/entry/thunk_64.S
+++ b/arch/x86/entry/thunk_64.S
@@ -36,14 +36,11 @@ SYM_FUNC_END(\name)
_ASM_NOKPROBE(\name)
.endm
 
-#ifdef CONFIG_PREEMPTION
THUNK preempt_schedule_thunk, preempt_schedule
THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace
EXPORT_SYMBOL(preempt_schedule_thunk)
EXPORT_SYMBOL(preempt_schedule_notrace_thunk)
-#endif
 
-#ifdef CONFIG_PREEMPTION
 SYM_CODE_START_LOCAL_NOALIGN(.L_restore)
popq %r11
popq %r10
@@ -58,4 +55,3 @@ SYM_CODE_START_LOCAL_NOALIGN(.L_restore)
ret
_ASM_NOKPROBE(.L_restore)
 SYM_CODE_END(.L_restore)
-#endif
diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile
index 77f70b969d14..3113800da63a 100644
--- a/arch/x86/um/Makefile
+++ b/arch/x86/um/Makefile
@@ -27,7 +27,8 @@ else
 
 obj-y += syscalls_64.o vdso/
 
-subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o ../entry/thunk_64.o
+subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o
+subarch-$(CONFIG_PREEMPTION) += ../entry/thunk_64.o
 
 endif
 
-- 
2.29.2

Re: [PATCH 4/4] hv_netvsc: Restrict configurations on isolated guests

2021-01-21 Thread Andrea Parri

> > > > @@ -544,7 +545,8 @@ static int negotiate_nvsp_ver(struct hv_device
> > > > *device,
> > > > init_packet->msg.v2_msg.send_ndis_config.capability.ieee8021q = 
> > > > 1;
> > > >
> > > > if (nvsp_ver >= NVSP_PROTOCOL_VERSION_5) {
> > > > -   
> > > > init_packet->msg.v2_msg.send_ndis_config.capability.sriov =
> > > > 1;
> > > > +   if (!hv_is_isolation_supported())
> > > > +   init_packet-
> > > > >msg.v2_msg.send_ndis_config.capability.sriov = 1;
> > >
> > > Please also add a log there stating we don't support sriov in this case.
> > Otherwise,
> > > customers will ask why vf not showing up.
> > 
> > IIUC, you're suggesting that I append something like:
> > 
> > +   else
> > +   netdev_info(ndev, "SR-IOV not advertised: isolation
> > supported\n");
> > 
> > I've added this locally; please let me know if you had something else
> > /better in mind.
> 
> This message explains the failure reason better:
>   "SR-IOV not advertised by guests on the host supporting isolation"

Applied.


> > > > @@ -579,12 +588,17 @@ static int netvsc_connect_vsp(struct hv_device
> > > > *device,
> > > > init_packet = _device->channel_init_pkt;
> > > >
> > > > /* Negotiate the latest NVSP protocol supported */
> > > > -   for (i = ARRAY_SIZE(ver_list) - 1; i >= 0; i--)
> > > > +   for (i = ARRAY_SIZE(ver_list) - 1; i >= 0; i--) {
> > > > +   if (!nvsp_is_valid_version(ver_list[i])) {
> > > > +   ret = -EPROTO;
> > > > +   goto cleanup;
> > > > +   }
> > >
> > > This code can catch the invalid, but cannot get the current host nvsp
> > version.
> > > I'd suggest move this check after version negotiation is done. So we can 
> > > log
> > what's
> > > the current host nvsp version, and why we fail it (the expected nvsp ver).
> > 
> > Mmh, invalid versions are not negotiated.  How about I simply add the
> > following logging right before the above 'ret = -EPROTO' say?
> > 
> > +   netdev_err(ndev, "Invalid NVSP version %x
> > (expected >= %x): isolation supported\n",
> > +  ver_list[i], NVSP_PROTOCOL_VERSION_61);
> > 
> > (or something along these lines)
> 
> The negotiation process runs from the latest to oldest. If the host is 5, 
> your code 
> will fail before trying v6.0, and log:
>   "Invalid NVSP version 6  (expected >= 60001): isolation supported"
> This will make user think the NVSP version is 6.0.
> 
> Since you will let the NIC fail and cleanup, there is no harm to check the 
> version 
> after negotiation. And this case is unexpected from a "normal" host. So I 
> suggest 
> move the check after negotiation is done, then we know the actual host nvsp 
> version that causing this issue. And we can bring the accurate info to host 
> team 
> for better diagnosability.

Fair enough, will do.


> Please point out this invalid version is caused by the host side, like this:
>   "Invalid NVSP version 0x5  (expected >= 0x60001) from the host with 
> isolation support"
> Also please use "0x%x" for hexadecimal numbers.

Sure.


> > > > @@ -1357,7 +1371,8 @@ static void netvsc_receive_inband(struct
> > > > net_device *ndev,
> > > >     break;
> > > >
> > > > case NVSP_MSG4_TYPE_SEND_VF_ASSOCIATION:
> > > > -   netvsc_send_vf(ndev, nvmsg, msglen);
> > > > +   if (!hv_is_isolation_supported())
> > > > +   netvsc_send_vf(ndev, nvmsg, msglen);
> > >
> > > When the driver doesn't advertise SRIOV, this message is not expected.
> > > Instead of ignore silently, we should log an error.
> > 
> > I've appended:
> > 
> > +   else
> > +   netdev_err(ndev, "Unexpected VF message:
> > isolation supported\n");
> 
> Please log the msg type:
>   "Ignore VF_ASSOCIATION msg from the host supporting isolation"

Applied.

Thanks,
  Andrea

Re: [tip: x86/entry] x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y

2021-01-21 Thread Andrea Righi

On Thu, Jan 21, 2021 at 09:52:01AM +0100, Andrea Righi wrote:
> On Thu, Jan 21, 2021 at 08:49:28AM +0100, Ingo Molnar wrote:
> > 
> > * tip-bot2 for Andrea Righi  wrote:
> > 
> > > The following commit has been merged into the x86/entry branch of tip:
> > > 
> > > Commit-ID: e6d92b6680371ae1aeeb6c5eb2387fdc5d9a2c89
> > > Gitweb:
> > > https://git.kernel.org/tip/e6d92b6680371ae1aeeb6c5eb2387fdc5d9a2c89
> > > Author:Andrea Righi 
> > > AuthorDate:Thu, 14 Jan 2021 12:48:35 +01:00
> > > Committer: Ingo Molnar 
> > > CommitterDate: Thu, 21 Jan 2021 08:11:52 +01:00
> > > 
> > > x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y
> > > 
> > > With CONFIG_PREEMPTION disabled, arch/x86/entry/thunk_64.o is just an
> > > empty object file.
> > > 
> > > With the newer binutils (tested with 2.35.90.20210113-1ubuntu1) the GNU
> > > assembler doesn't generate a symbol table for empty object files and
> > > objtool fails with the following error when a valid symbol table cannot
> > > be found:
> > > 
> > >   arch/x86/entry/thunk_64.o: warning: objtool: missing symbol table
> > > 
> > > To prevent this from happening, build thunk_$(BITS).o only if
> > > CONFIG_PREEMPTION is enabled.
> > > 
> > >   BugLink: https://bugs.launchpad.net/bugs/1911359
> > > 
> > > Fixes: 320100a5ffe5 ("x86/entry: Remove the TRACE_IRQS cruft")
> > > Signed-off-by: Andrea Righi 
> > > Signed-off-by: Ingo Molnar 
> > > Cc: Borislav Petkov 
> > > Link: https://lore.kernel.org/r/YAAvk0UQelq0Ae7+@xps-13-7390
> > 
> > Hm, this fails to build on UML defconfig:
> > 
> >  
> > /home/mingo/gcc/cross/lib/gcc/x86_64-linux/9.3.1/../../../../x86_64-linux/bin/ld:
> >  arch/x86/um/../entry/thunk_64.o: in function `preempt_schedule_thunk':
> >  /home/mingo/tip.cross/arch/x86/um/../entry/thunk_64.S:34: undefined 
> > reference to `preempt_schedule'
> >  
> > /home/mingo/gcc/cross/lib/gcc/x86_64-linux/9.3.1/../../../../x86_64-linux/bin/ld:
> >  arch/x86/um/../entry/thunk_64.o: in function 
> > `preempt_schedule_notrace_thunk':
> >  /home/mingo/tip.cross/arch/x86/um/../entry/thunk_64.S:35: undefined 
> > reference to `preempt_schedule_notrace'
> > 
> > Thanks,
> > 
> > Ingo
> 
> I've been able to reproduce it, I'm looking at this right now. Thanks!

I see, basically UML selects ARCH_NO_PREEMPT, but in
arch/x86/um/Makefile it explicitly includes thunk_$(BITS).o regardless.

Considering that thunk_$(BITS) only contains preemption code now, we can
probably drop it from the Makefile, or, to be more consistent with the
x86 change, we could include it only if CONFIG_PREEMPTION is enabled
(even if it would never be, because UML has ARCH_NO_PREEMPT).

If it's unlikely that preemption will be enabled in UML one day I'd
probably go with the former, otherwise I'd go with the latter, because
it looks more consistent.

Opinions?

Thanks,
-Andrea

Re: [tip: x86/entry] x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y

2021-01-21 Thread Andrea Righi

On Thu, Jan 21, 2021 at 08:49:28AM +0100, Ingo Molnar wrote:
> 
> * tip-bot2 for Andrea Righi  wrote:
> 
> > The following commit has been merged into the x86/entry branch of tip:
> > 
> > Commit-ID: e6d92b6680371ae1aeeb6c5eb2387fdc5d9a2c89
> > Gitweb:
> > https://git.kernel.org/tip/e6d92b6680371ae1aeeb6c5eb2387fdc5d9a2c89
> > Author:Andrea Righi 
> > AuthorDate:Thu, 14 Jan 2021 12:48:35 +01:00
> > Committer: Ingo Molnar 
> > CommitterDate: Thu, 21 Jan 2021 08:11:52 +01:00
> > 
> > x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y
> > 
> > With CONFIG_PREEMPTION disabled, arch/x86/entry/thunk_64.o is just an
> > empty object file.
> > 
> > With the newer binutils (tested with 2.35.90.20210113-1ubuntu1) the GNU
> > assembler doesn't generate a symbol table for empty object files and
> > objtool fails with the following error when a valid symbol table cannot
> > be found:
> > 
> >   arch/x86/entry/thunk_64.o: warning: objtool: missing symbol table
> > 
> > To prevent this from happening, build thunk_$(BITS).o only if
> > CONFIG_PREEMPTION is enabled.
> > 
> >   BugLink: https://bugs.launchpad.net/bugs/1911359
> > 
> > Fixes: 320100a5ffe5 ("x86/entry: Remove the TRACE_IRQS cruft")
> > Signed-off-by: Andrea Righi 
> > Signed-off-by: Ingo Molnar 
> > Cc: Borislav Petkov 
> > Link: https://lore.kernel.org/r/YAAvk0UQelq0Ae7+@xps-13-7390
> 
> Hm, this fails to build on UML defconfig:
> 
>  
> /home/mingo/gcc/cross/lib/gcc/x86_64-linux/9.3.1/../../../../x86_64-linux/bin/ld:
>  arch/x86/um/../entry/thunk_64.o: in function `preempt_schedule_thunk':
>  /home/mingo/tip.cross/arch/x86/um/../entry/thunk_64.S:34: undefined 
> reference to `preempt_schedule'
>  
> /home/mingo/gcc/cross/lib/gcc/x86_64-linux/9.3.1/../../../../x86_64-linux/bin/ld:
>  arch/x86/um/../entry/thunk_64.o: in function 
> `preempt_schedule_notrace_thunk':
>  /home/mingo/tip.cross/arch/x86/um/../entry/thunk_64.S:35: undefined 
> reference to `preempt_schedule_notrace'
> 
> Thanks,
> 
>   Ingo

I've been able to reproduce it, I'm looking at this right now. Thanks!

-Andrea

[tip: x86/entry] x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y

2021-01-20 Thread tip-bot2 for Andrea Righi

The following commit has been merged into the x86/entry branch of tip:

Commit-ID: e6d92b6680371ae1aeeb6c5eb2387fdc5d9a2c89
Gitweb:
https://git.kernel.org/tip/e6d92b6680371ae1aeeb6c5eb2387fdc5d9a2c89
Author:Andrea Righi 
AuthorDate:Thu, 14 Jan 2021 12:48:35 +01:00
Committer: Ingo Molnar 
CommitterDate: Thu, 21 Jan 2021 08:11:52 +01:00

x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y

With CONFIG_PREEMPTION disabled, arch/x86/entry/thunk_64.o is just an
empty object file.

With the newer binutils (tested with 2.35.90.20210113-1ubuntu1) the GNU
assembler doesn't generate a symbol table for empty object files and
objtool fails with the following error when a valid symbol table cannot
be found:

  arch/x86/entry/thunk_64.o: warning: objtool: missing symbol table

To prevent this from happening, build thunk_$(BITS).o only if
CONFIG_PREEMPTION is enabled.

  BugLink: https://bugs.launchpad.net/bugs/1911359

Fixes: 320100a5ffe5 ("x86/entry: Remove the TRACE_IRQS cruft")
Signed-off-by: Andrea Righi 
Signed-off-by: Ingo Molnar 
Cc: Borislav Petkov 
Link: https://lore.kernel.org/r/YAAvk0UQelq0Ae7+@xps-13-7390
---
 arch/x86/entry/Makefile   | 3 ++-
 arch/x86/entry/thunk_32.S | 2 --
 arch/x86/entry/thunk_64.S | 4 
 3 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index 08bf95d..83c98da 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -21,12 +21,13 @@ CFLAGS_syscall_64.o += $(call 
cc-option,-Wno-override-init,)
 CFLAGS_syscall_32.o+= $(call cc-option,-Wno-override-init,)
 CFLAGS_syscall_x32.o   += $(call cc-option,-Wno-override-init,)
 
-obj-y  := entry_$(BITS).o thunk_$(BITS).o 
syscall_$(BITS).o
+obj-y  := entry_$(BITS).o syscall_$(BITS).o
 obj-y  += common.o
 
 obj-y  += vdso/
 obj-y  += vsyscall/
 
+obj-$(CONFIG_PREEMPTION)   += thunk_$(BITS).o
 obj-$(CONFIG_IA32_EMULATION)   += entry_64_compat.o syscall_32.o
 obj-$(CONFIG_X86_X32_ABI)  += syscall_x32.o
 
diff --git a/arch/x86/entry/thunk_32.S b/arch/x86/entry/thunk_32.S
index f1f96d4..5997ec0 100644
--- a/arch/x86/entry/thunk_32.S
+++ b/arch/x86/entry/thunk_32.S
@@ -29,10 +29,8 @@ SYM_CODE_START_NOALIGN(\name)
 SYM_CODE_END(\name)
.endm
 
-#ifdef CONFIG_PREEMPTION
THUNK preempt_schedule_thunk, preempt_schedule
THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace
EXPORT_SYMBOL(preempt_schedule_thunk)
EXPORT_SYMBOL(preempt_schedule_notrace_thunk)
-#endif
 
diff --git a/arch/x86/entry/thunk_64.S b/arch/x86/entry/thunk_64.S
index 496b11e..9d543c4 100644
--- a/arch/x86/entry/thunk_64.S
+++ b/arch/x86/entry/thunk_64.S
@@ -31,14 +31,11 @@ SYM_FUNC_END(\name)
_ASM_NOKPROBE(\name)
.endm
 
-#ifdef CONFIG_PREEMPTION
THUNK preempt_schedule_thunk, preempt_schedule
THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace
EXPORT_SYMBOL(preempt_schedule_thunk)
EXPORT_SYMBOL(preempt_schedule_notrace_thunk)
-#endif
 
-#ifdef CONFIG_PREEMPTION
 SYM_CODE_START_LOCAL_NOALIGN(__thunk_restore)
popq %r11
popq %r10
@@ -53,4 +50,3 @@ SYM_CODE_START_LOCAL_NOALIGN(__thunk_restore)
ret
_ASM_NOKPROBE(__thunk_restore)
 SYM_CODE_END(__thunk_restore)
-#endif

Re: [PATCH 4/4] hv_netvsc: Restrict configurations on isolated guests

2021-01-20 Thread Andrea Parri

> > @@ -544,7 +545,8 @@ static int negotiate_nvsp_ver(struct hv_device
> > *device,
> > init_packet->msg.v2_msg.send_ndis_config.capability.ieee8021q = 1;
> > 
> > if (nvsp_ver >= NVSP_PROTOCOL_VERSION_5) {
> > -   init_packet->msg.v2_msg.send_ndis_config.capability.sriov =
> > 1;
> > +   if (!hv_is_isolation_supported())
> > +   init_packet-
> > >msg.v2_msg.send_ndis_config.capability.sriov = 1;
> 
> Please also add a log there stating we don't support sriov in this case. 
> Otherwise,
> customers will ask why vf not showing up.

IIUC, you're suggesting that I append something like:

+   else
+   netdev_info(ndev, "SR-IOV not advertised: isolation 
supported\n");

I've added this locally; please let me know if you had something else
/better in mind.


> > @@ -563,6 +565,13 @@ static int negotiate_nvsp_ver(struct hv_device
> > *device,
> > return ret;
> >  }
> > 
> > +static bool nvsp_is_valid_version(u32 version)
> > +{
> > +   if (hv_is_isolation_supported())
> > +   return version >= NVSP_PROTOCOL_VERSION_61;
> > +   return true;
> Hosts support isolation should run nvsp 6.1+. This error is not expected.
> Instead of fail silently, we should log an error to explain why it's failed, 
> and the current version and expected version.

Please see my next comment below.


> > +}
> > +
> >  static int netvsc_connect_vsp(struct hv_device *device,
> >   struct netvsc_device *net_device,
> >   const struct netvsc_device_info *device_info)
> > @@ -579,12 +588,17 @@ static int netvsc_connect_vsp(struct hv_device
> > *device,
> > init_packet = _device->channel_init_pkt;
> > 
> > /* Negotiate the latest NVSP protocol supported */
> > -   for (i = ARRAY_SIZE(ver_list) - 1; i >= 0; i--)
> > +   for (i = ARRAY_SIZE(ver_list) - 1; i >= 0; i--) {
> > +   if (!nvsp_is_valid_version(ver_list[i])) {
> > +   ret = -EPROTO;
> > +   goto cleanup;
> > +   }
> 
> This code can catch the invalid, but cannot get the current host nvsp version.
> I'd suggest move this check after version negotiation is done. So we can log 
> what's
> the current host nvsp version, and why we fail it (the expected nvsp ver).

Mmh, invalid versions are not negotiated.  How about I simply add the
following logging right before the above 'ret = -EPROTO' say?

+   netdev_err(ndev, "Invalid NVSP version %x (expected >= 
%x): isolation supported\n",
+  ver_list[i], NVSP_PROTOCOL_VERSION_61);

(or something along these lines)


> > @@ -1357,7 +1371,8 @@ static void netvsc_receive_inband(struct
> > net_device *ndev,
> > break;
> > 
> > case NVSP_MSG4_TYPE_SEND_VF_ASSOCIATION:
> > -   netvsc_send_vf(ndev, nvmsg, msglen);
> > +   if (!hv_is_isolation_supported())
> > +   netvsc_send_vf(ndev, nvmsg, msglen);
> 
> When the driver doesn't advertise SRIOV, this message is not expected.
> Instead of ignore silently, we should log an error.

I've appended:

+   else
+   netdev_err(ndev, "Unexpected VF message: isolation 
supported\n");

Please let me know if I got this wrong.

Thanks,
  Andrea

[PATCH 0/4] Drivers: hv: vmbus: Restrict devices and configurations on 'isolated' guests

2021-01-19 Thread Andrea Parri (Microsoft)

Hi all,

To reduce the footprint of the code that will be exercised, and hence
the exposure to bugs and vulnerabilities, restrict configurations and
devices on 'isolated' VMs.

Specs of the Isolation Configuration leaf (cf. patch #1) were derived
from internal discussions with the Hyper-V team and, AFAICT, they are
not publicly available yet.

The series has some minor/naming conflict with on-going work aimed at
enabling SNP VMs on Hyper-V[1]; such conflicts can be addressed later
at the right time.

Applies to hyperv-next.

Thanks,
  Andrea

[1] https://github.com/lantianyu/linux # cvm

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Arnd Bergmann 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: x...@kernel.org
Cc: linux-a...@vger.kernel.org
Cc: net...@vger.kernel.org

Andrea Parri (Microsoft) (4):
  x86/hyperv: Load/save the Isolation Configuration leaf
  Drivers: hv: vmbus: Restrict vmbus_devices on isolated guests
  Drivers: hv: vmbus: Enforce 'VMBus version >= 5.2' on isolated guests
  hv_netvsc: Restrict configurations on isolated guests

 arch/x86/hyperv/hv_init.c  | 15 +
 arch/x86/include/asm/hyperv-tlfs.h | 15 +
 arch/x86/kernel/cpu/mshyperv.c |  9 
 drivers/hv/channel_mgmt.c  | 36 ++
 drivers/hv/connection.c| 13 +++
 drivers/net/hyperv/netvsc.c| 21 ++---
 include/asm-generic/hyperv-tlfs.h  |  1 +
 include/asm-generic/mshyperv.h |  5 +
 include/linux/hyperv.h |  1 +
 9 files changed, 113 insertions(+), 3 deletions(-)

-- 
2.25.1

[PATCH 2/4] Drivers: hv: vmbus: Restrict vmbus_devices on isolated guests

2021-01-19 Thread Andrea Parri (Microsoft)

Only the VSCs or ICs that have been hardened and that are critical for
the successful adoption of Confidential VMs should be allowed if the
guest is running isolated.  This change reduces the footprint of the
code that will be exercised by Confidential VMs and hence the exposure
to bugs and vulnerabilities.

Signed-off-by: Andrea Parri (Microsoft) 
---
 drivers/hv/channel_mgmt.c | 36 
 include/linux/hyperv.h|  1 +
 2 files changed, 37 insertions(+)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 68950a1e4b638..774ee19e3e90d 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -31,101 +31,118 @@ const struct vmbus_device vmbus_devs[] = {
{ .dev_type = HV_IDE,
  HV_IDE_GUID,
  .perf_device = true,
+ .allowed_in_isolated = false,
},
 
/* SCSI */
{ .dev_type = HV_SCSI,
  HV_SCSI_GUID,
  .perf_device = true,
+ .allowed_in_isolated = true,
},
 
/* Fibre Channel */
{ .dev_type = HV_FC,
  HV_SYNTHFC_GUID,
  .perf_device = true,
+ .allowed_in_isolated = false,
},
 
/* Synthetic NIC */
{ .dev_type = HV_NIC,
  HV_NIC_GUID,
  .perf_device = true,
+ .allowed_in_isolated = true,
},
 
/* Network Direct */
{ .dev_type = HV_ND,
  HV_ND_GUID,
  .perf_device = true,
+ .allowed_in_isolated = false,
},
 
/* PCIE */
{ .dev_type = HV_PCIE,
  HV_PCIE_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* Synthetic Frame Buffer */
{ .dev_type = HV_FB,
  HV_SYNTHVID_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* Synthetic Keyboard */
{ .dev_type = HV_KBD,
  HV_KBD_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* Synthetic MOUSE */
{ .dev_type = HV_MOUSE,
  HV_MOUSE_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* KVP */
{ .dev_type = HV_KVP,
  HV_KVP_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* Time Synch */
{ .dev_type = HV_TS,
  HV_TS_GUID,
  .perf_device = false,
+ .allowed_in_isolated = true,
},
 
/* Heartbeat */
{ .dev_type = HV_HB,
  HV_HEART_BEAT_GUID,
  .perf_device = false,
+ .allowed_in_isolated = true,
},
 
/* Shutdown */
{ .dev_type = HV_SHUTDOWN,
  HV_SHUTDOWN_GUID,
  .perf_device = false,
+ .allowed_in_isolated = true,
},
 
/* File copy */
{ .dev_type = HV_FCOPY,
  HV_FCOPY_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* Backup */
{ .dev_type = HV_BACKUP,
  HV_VSS_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* Dynamic Memory */
{ .dev_type = HV_DM,
  HV_DM_GUID,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 
/* Unknown GUID */
{ .dev_type = HV_UNKNOWN,
  .perf_device = false,
+ .allowed_in_isolated = false,
},
 };
 
@@ -903,6 +920,20 @@ find_primary_channel_by_offer(const struct 
vmbus_channel_offer_channel *offer)
return channel;
 }
 
+static bool vmbus_is_valid_device(const guid_t *guid)
+{
+   u16 i;
+
+   if (!hv_is_isolation_supported())
+   return true;
+
+   for (i = 0; i < ARRAY_SIZE(vmbus_devs); i++) {
+   if (guid_equal(guid, _devs[i].guid))
+   return vmbus_devs[i].allowed_in_isolated;
+   }
+   return false;
+}
+
 /*
  * vmbus_onoffer - Handler for channel offers from vmbus in parent partition.
  *
@@ -917,6 +948,11 @@ static void vmbus_onoffer(struct 
vmbus_channel_message_header *hdr)
 
trace_vmbus_onoffer(offer);
 
+   if (!vmbus_is_valid_device(>offer.if_type)) {
+   atomic_dec(_connection.offer_in_progress);
+   return;
+   }
+
oldchannel = find_primary_channel_by_offer(offer);
 
if (oldchannel != NULL) {
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index f0d48a368f131..e3426f8c12db9 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -789,6 +789,7 @@ struct vmbus_device {
u16  dev_type;
guid_t guid;
bool perf_device;
+   bool allowed_in_isolated;
 };
 
 #define VMBUS_DEFAULT_MAX_PKT_SIZE 4096
-- 
2.25.1

[PATCH 4/4] hv_netvsc: Restrict configurations on isolated guests

2021-01-19 Thread Andrea Parri (Microsoft)

Restrict the NVSP protocol version(s) that will be negotiated with the
host to be NVSP_PROTOCOL_VERSION_61 or greater if the guest is running
isolated.  Moreover, do not advertise the SR-IOV capability and ignore
NVSP_MSG_4_TYPE_SEND_VF_ASSOCIATION messages in isolated guests, which
are not supposed to support SR-IOV.  This reduces the footprint of the
code that will be exercised by Confidential VMs and hence the exposure
to bugs and vulnerabilities.

Signed-off-by: Andrea Parri (Microsoft) 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: net...@vger.kernel.org
---
 drivers/net/hyperv/netvsc.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 1510a236aa341..8027d553cb67d 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -22,6 +22,7 @@
 #include 
 
 #include 
+#include 
 
 #include "hyperv_net.h"
 #include "netvsc_trace.h"
@@ -544,7 +545,8 @@ static int negotiate_nvsp_ver(struct hv_device *device,
init_packet->msg.v2_msg.send_ndis_config.capability.ieee8021q = 1;
 
if (nvsp_ver >= NVSP_PROTOCOL_VERSION_5) {
-   init_packet->msg.v2_msg.send_ndis_config.capability.sriov = 1;
+   if (!hv_is_isolation_supported())
+   
init_packet->msg.v2_msg.send_ndis_config.capability.sriov = 1;
 
/* Teaming bit is needed to receive link speed updates */
init_packet->msg.v2_msg.send_ndis_config.capability.teaming = 1;
@@ -563,6 +565,13 @@ static int negotiate_nvsp_ver(struct hv_device *device,
return ret;
 }
 
+static bool nvsp_is_valid_version(u32 version)
+{
+   if (hv_is_isolation_supported())
+   return version >= NVSP_PROTOCOL_VERSION_61;
+   return true;
+}
+
 static int netvsc_connect_vsp(struct hv_device *device,
  struct netvsc_device *net_device,
  const struct netvsc_device_info *device_info)
@@ -579,12 +588,17 @@ static int netvsc_connect_vsp(struct hv_device *device,
init_packet = _device->channel_init_pkt;
 
/* Negotiate the latest NVSP protocol supported */
-   for (i = ARRAY_SIZE(ver_list) - 1; i >= 0; i--)
+   for (i = ARRAY_SIZE(ver_list) - 1; i >= 0; i--) {
+   if (!nvsp_is_valid_version(ver_list[i])) {
+   ret = -EPROTO;
+   goto cleanup;
+   }
if (negotiate_nvsp_ver(device, net_device, init_packet,
   ver_list[i])  == 0) {
net_device->nvsp_version = ver_list[i];
break;
}
+   }
 
if (i < 0) {
ret = -EPROTO;
@@ -1357,7 +1371,8 @@ static void netvsc_receive_inband(struct net_device *ndev,
break;
 
case NVSP_MSG4_TYPE_SEND_VF_ASSOCIATION:
-   netvsc_send_vf(ndev, nvmsg, msglen);
+   if (!hv_is_isolation_supported())
+   netvsc_send_vf(ndev, nvmsg, msglen);
break;
}
 }
-- 
2.25.1

[PATCH 1/4] x86/hyperv: Load/save the Isolation Configuration leaf

2021-01-19 Thread Andrea Parri (Microsoft)

If bit 22 of Group B Features is set, the guest has access to the
Isolation Configuration CPUID leaf.  On x86, the first four bits
of EAX in this leaf provide the isolation type of the partition;
we entail three isolation types: 'SNP' (hardware-based isolation),
'VBS' (software-based isolation), and 'NONE' (no isolation).

Signed-off-by: Andrea Parri (Microsoft) 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Arnd Bergmann 
Cc: x...@kernel.org
Cc: linux-a...@vger.kernel.org
---
 arch/x86/hyperv/hv_init.c  | 15 +++
 arch/x86/include/asm/hyperv-tlfs.h | 15 +++
 arch/x86/kernel/cpu/mshyperv.c |  9 +
 include/asm-generic/hyperv-tlfs.h  |  1 +
 include/asm-generic/mshyperv.h |  5 +
 5 files changed, 45 insertions(+)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index e04d90af4c27c..dc94e95c57b98 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -528,3 +529,17 @@ bool hv_is_hibernation_supported(void)
return acpi_sleep_state_supported(ACPI_STATE_S4);
 }
 EXPORT_SYMBOL_GPL(hv_is_hibernation_supported);
+
+enum hv_isolation_type hv_get_isolation_type(void)
+{
+   if (!(ms_hyperv.hypercalls_features & HV_ISOLATION))
+   return HV_ISOLATION_TYPE_NONE;
+   return FIELD_GET(HV_ISOLATION_TYPE, ms_hyperv.isolation_config_b);
+}
+EXPORT_SYMBOL_GPL(hv_get_isolation_type);
+
+bool hv_is_isolation_supported(void)
+{
+   return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;
+}
+EXPORT_SYMBOL_GPL(hv_is_isolation_supported);
diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
b/arch/x86/include/asm/hyperv-tlfs.h
index 6bf42aed387e3..6aed936e5e962 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -22,6 +22,7 @@
 #define HYPERV_CPUID_ENLIGHTMENT_INFO  0x4004
 #define HYPERV_CPUID_IMPLEMENT_LIMITS  0x4005
 #define HYPERV_CPUID_NESTED_FEATURES   0x400A
+#define HYPERV_CPUID_ISOLATION_CONFIG  0x400C
 
 #define HYPERV_CPUID_VIRT_STACK_INTERFACE  0x4081
 #define HYPERV_VS_INTERFACE_EAX_SIGNATURE  0x31235356  /* "VS#1" */
@@ -122,6 +123,20 @@
 #define HV_X64_NESTED_GUEST_MAPPING_FLUSH  BIT(18)
 #define HV_X64_NESTED_MSR_BITMAP   BIT(19)
 
+/* HYPERV_CPUID_ISOLATION_CONFIG.EAX bits. */
+#define HV_PARAVISOR_PRESENT   BIT(0)
+
+/* HYPERV_CPUID_ISOLATION_CONFIG.EBX bits. */
+#define HV_ISOLATION_TYPE  GENMASK(3, 0)
+#define HV_SHARED_GPA_BOUNDARY_ACTIVE  BIT(5)
+#define HV_SHARED_GPA_BOUNDARY_BITSGENMASK(11, 6)
+
+enum hv_isolation_type {
+   HV_ISOLATION_TYPE_NONE  = 0,
+   HV_ISOLATION_TYPE_VBS   = 1,
+   HV_ISOLATION_TYPE_SNP   = 2
+};
+
 /* Hyper-V specific model specific registers (MSRs) */
 
 /* MSR used to identify the guest OS. */
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index f628e3dc150f3..0d4aaf6694d01 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -225,6 +225,7 @@ static void __init ms_hyperv_init_platform(void)
 * Extract the features and hints
 */
ms_hyperv.features = cpuid_eax(HYPERV_CPUID_FEATURES);
+   ms_hyperv.hypercalls_features = cpuid_ebx(HYPERV_CPUID_FEATURES);
ms_hyperv.misc_features = cpuid_edx(HYPERV_CPUID_FEATURES);
ms_hyperv.hints= cpuid_eax(HYPERV_CPUID_ENLIGHTMENT_INFO);
 
@@ -259,6 +260,14 @@ static void __init ms_hyperv_init_platform(void)
x86_platform.calibrate_cpu = hv_get_tsc_khz;
}
 
+   if (ms_hyperv.hypercalls_features & HV_ISOLATION) {
+   ms_hyperv.isolation_config_a = 
cpuid_eax(HYPERV_CPUID_ISOLATION_CONFIG);
+   ms_hyperv.isolation_config_b = 
cpuid_ebx(HYPERV_CPUID_ISOLATION_CONFIG);
+
+   pr_info("Hyper-V: Isolation Config: GroupA 0x%x, GroupB 0x%x\n",
+   ms_hyperv.isolation_config_a, 
ms_hyperv.isolation_config_b);
+   }
+
if (ms_hyperv.hints & HV_X64_ENLIGHTENED_VMCS_RECOMMENDED) {
ms_hyperv.nested_features =
cpuid_eax(HYPERV_CPUID_NESTED_FEATURES);
diff --git a/include/asm-generic/hyperv-tlfs.h 
b/include/asm-generic/hyperv-tlfs.h
index e73a11850055c..20d3cd9502043 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -89,6 +89,7 @@
 #define HV_ACCESS_STATSBIT(8)
 #define HV_DEBUGGING   BIT(11)
 #define HV_CPU_POWER_MANAGEMENTBIT(12)
+#define HV_ISOLATION   BIT(22)
 
 
 /*
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index c5779968417

[PATCH 3/4] Drivers: hv: vmbus: Enforce 'VMBus version >= 5.2' on isolated guests

2021-01-19 Thread Andrea Parri (Microsoft)

Restrict the protocol version(s) that will be negotiated with the host
to be 5.2 or greater if the guest is running isolated.  This reduces the
footprint of the code that will be exercised by Confidential VMs and
hence the exposure to bugs and vulnerabilities.

Signed-off-by: Andrea Parri (Microsoft) 
---
 drivers/hv/connection.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 11170d9a2e1a5..bcf4d7def6838 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -66,6 +66,13 @@ module_param(max_version, uint, S_IRUGO);
 MODULE_PARM_DESC(max_version,
 "Maximal VMBus protocol version which can be negotiated");
 
+static bool vmbus_is_valid_version(u32 version)
+{
+   if (hv_is_isolation_supported())
+   return version >= VERSION_WIN10_V5_2;
+   return true;
+}
+
 int vmbus_negotiate_version(struct vmbus_channel_msginfo *msginfo, u32 version)
 {
int ret = 0;
@@ -233,6 +240,12 @@ int vmbus_connect(void)
goto cleanup;
 
version = vmbus_versions[i];
+
+   if (!vmbus_is_valid_version(version)) {
+   ret = -EINVAL;
+   goto cleanup;
+   }
+
if (version > max_version)
continue;
 
-- 
2.25.1

Re: [PATCH v2] hv_netvsc: Add (more) validation for untrusted Hyper-V values

2021-01-17 Thread Andrea Parri

On Sun, Jan 17, 2021 at 03:10:32PM +, Wei Liu wrote:
> On Sat, Jan 16, 2021 at 02:02:01PM +0100, Andrea Parri wrote:
> > On Fri, Jan 15, 2021 at 08:30:22PM -0800, Jakub Kicinski wrote:
> > > On Thu, 14 Jan 2021 21:26:28 +0100 Andrea Parri (Microsoft) wrote:
> > > > For additional robustness in the face of Hyper-V errors or malicious
> > > > behavior, validate all values that originate from packets that Hyper-V
> > > > has sent to the guest.  Ensure that invalid values cannot cause indexing
> > > > off the end of an array, or subvert an existing validation via integer
> > > > overflow.  Ensure that outgoing packets do not have any leftover guest
> > > > memory that has not been zeroed out.
> > > > 
> > > > Reported-by: Juan Vazquez 
> > > > Signed-off-by: Andrea Parri (Microsoft) 
> > > > Cc: "David S. Miller" 
> > > > Cc: Jakub Kicinski 
> > > > Cc: Alexei Starovoitov 
> > > > Cc: Daniel Borkmann 
> > > > Cc: Andrii Nakryiko 
> > > > Cc: Martin KaFai Lau 
> > > > Cc: Song Liu 
> > > > Cc: Yonghong Song 
> > > > Cc: John Fastabend 
> > > > Cc: KP Singh 
> > > > Cc: net...@vger.kernel.org
> > > > Cc: b...@vger.kernel.org
> > > > ---
> > > > Applies to 5.11-rc3 (and hyperv-next).
> > > 
> > > So this is for hyperv-next or should we take it via netdev trees?
> > 
> > No preference, either way is good for me.
> 
> To be clear: There is no dependency on any patch in hyperv-next, right?
> 
> That's my understanding, but I would like to confirm it.

Well, I wrote that this *applies* to hyperv-next... but that's indeed
the only 'dependency' I can think of.

Hope this helps.

Thanks,
  Andrea

Re: [PATCH v2] hv_netvsc: Add (more) validation for untrusted Hyper-V values

2021-01-16 Thread Andrea Parri

On Fri, Jan 15, 2021 at 08:30:22PM -0800, Jakub Kicinski wrote:
> On Thu, 14 Jan 2021 21:26:28 +0100 Andrea Parri (Microsoft) wrote:
> > For additional robustness in the face of Hyper-V errors or malicious
> > behavior, validate all values that originate from packets that Hyper-V
> > has sent to the guest.  Ensure that invalid values cannot cause indexing
> > off the end of an array, or subvert an existing validation via integer
> > overflow.  Ensure that outgoing packets do not have any leftover guest
> > memory that has not been zeroed out.
> > 
> > Reported-by: Juan Vazquez 
> > Signed-off-by: Andrea Parri (Microsoft) 
> > Cc: "David S. Miller" 
> > Cc: Jakub Kicinski 
> > Cc: Alexei Starovoitov 
> > Cc: Daniel Borkmann 
> > Cc: Andrii Nakryiko 
> > Cc: Martin KaFai Lau 
> > Cc: Song Liu 
> > Cc: Yonghong Song 
> > Cc: John Fastabend 
> > Cc: KP Singh 
> > Cc: net...@vger.kernel.org
> > Cc: b...@vger.kernel.org
> > ---
> > Applies to 5.11-rc3 (and hyperv-next).
> 
> So this is for hyperv-next or should we take it via netdev trees?

No preference, either way is good for me.

Thanks,
  Andrea

[PATCH v2] hv_netvsc: Add (more) validation for untrusted Hyper-V values

2021-01-14 Thread Andrea Parri (Microsoft)

For additional robustness in the face of Hyper-V errors or malicious
behavior, validate all values that originate from packets that Hyper-V
has sent to the guest.  Ensure that invalid values cannot cause indexing
off the end of an array, or subvert an existing validation via integer
overflow.  Ensure that outgoing packets do not have any leftover guest
memory that has not been zeroed out.

Reported-by: Juan Vazquez 
Signed-off-by: Andrea Parri (Microsoft) 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Andrii Nakryiko 
Cc: Martin KaFai Lau 
Cc: Song Liu 
Cc: Yonghong Song 
Cc: John Fastabend 
Cc: KP Singh 
Cc: net...@vger.kernel.org
Cc: b...@vger.kernel.org
---
Applies to 5.11-rc3 (and hyperv-next).

Changes since v1 (Juan Vazquez):
  - Improve validation in rndis_set_link_state() and rndis_get_ppi()
  - Remove memory/skb leak in netvsc_alloc_recv_skb()

 drivers/net/hyperv/netvsc.c   |   3 +-
 drivers/net/hyperv/netvsc_bpf.c   |   6 ++
 drivers/net/hyperv/netvsc_drv.c   |  18 +++-
 drivers/net/hyperv/rndis_filter.c | 171 +++---
 4 files changed, 136 insertions(+), 62 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 1510a236aa341..d9324961e0d64 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -887,6 +887,7 @@ static inline int netvsc_send_pkt(
int ret;
u32 ring_avail = hv_get_avail_to_write_percent(_channel->outbound);
 
+   memset(, 0, sizeof(struct nvsp_message));
nvmsg.hdr.msg_type = NVSP_MSG1_TYPE_SEND_RNDIS_PKT;
if (skb)
rpkt->channel_type = 0; /* 0 is RMC_DATA */
@@ -1306,7 +1307,7 @@ static void netvsc_send_table(struct net_device *ndev,
 sizeof(union nvsp_6_message_uber);
 
/* Boundary check for all versions */
-   if (offset > msglen - count * sizeof(u32)) {
+   if (msglen < count * sizeof(u32) || offset > msglen - count * 
sizeof(u32)) {
netdev_err(ndev, "Received send-table offset too big:%u\n",
   offset);
return;
diff --git a/drivers/net/hyperv/netvsc_bpf.c b/drivers/net/hyperv/netvsc_bpf.c
index 440486d9c999e..11f0588a88843 100644
--- a/drivers/net/hyperv/netvsc_bpf.c
+++ b/drivers/net/hyperv/netvsc_bpf.c
@@ -37,6 +37,12 @@ u32 netvsc_run_xdp(struct net_device *ndev, struct 
netvsc_channel *nvchan,
if (!prog)
goto out;
 
+   /* Ensure that the below memcpy() won't overflow the page buffer. */
+   if (len > ndev->mtu + ETH_HLEN) {
+   act = XDP_DROP;
+   goto out;
+   }
+
/* allocate page buffer for data */
page = alloc_page(GFP_ATOMIC);
if (!page) {
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index f32f28311d573..e5501c1a0cbd4 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -760,6 +760,16 @@ void netvsc_linkstatus_callback(struct net_device *net,
if (indicate->status == RNDIS_STATUS_LINK_SPEED_CHANGE) {
u32 speed;
 
+   /* Validate status_buf_offset */
+   if (indicate->status_buflen < sizeof(speed) ||
+   indicate->status_buf_offset < sizeof(*indicate) ||
+   resp->msg_len - RNDIS_HEADER_SIZE < 
indicate->status_buf_offset ||
+   resp->msg_len - RNDIS_HEADER_SIZE - 
indicate->status_buf_offset
+   < indicate->status_buflen) {
+   netdev_err(net, "invalid rndis_indicate_status 
packet\n");
+   return;
+   }
+
speed = *(u32 *)((void *)indicate
 + indicate->status_buf_offset) / 1;
ndev_ctx->speed = speed;
@@ -865,8 +875,14 @@ static struct sk_buff *netvsc_alloc_recv_skb(struct 
net_device *net,
 */
if (csum_info && csum_info->receive.ip_checksum_value_invalid &&
csum_info->receive.ip_checksum_succeeded &&
-   skb->protocol == htons(ETH_P_IP))
+   skb->protocol == htons(ETH_P_IP)) {
+   /* Check that there is enough space to hold the IP header. */
+   if (skb_headlen(skb) < sizeof(struct iphdr)) {
+   kfree_skb(skb);
+   return NULL;
+   }
netvsc_comp_ipcsum(skb);
+   }
 
/* Do L4 checksum offload if enabled and present. */
if (csum_info && (net->features & NETIF_F_RXCSUM)) {
diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index 7e6dee2f02a43..68091a9a5070d 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -131,6

[PATCH] x86/entry: build thunk_$(BITS) only if CONFIG_PREEMPTION=y

2021-01-14 Thread Andrea Righi

With CONFIG_PREEMPTION disabled, arch/x86/entry/thunk_64.o is just an
empty object file.

With the newer binutils (tested with 2.35.90.20210113-1ubuntu1) the GNU
assembler doesn't generate a symbol table for empty object files and
objtool fails with the following error when a valid symbol table cannot
be found:

  arch/x86/entry/thunk_64.o: warning: objtool: missing symbol table

To prevent this from happening, build thunk_$(BITS).o only if
CONFIG_PREEMPTION is enabled.

BugLink: https://bugs.launchpad.net/bugs/1911359
Fixes: 320100a5ffe5 ("x86/entry: Remove the TRACE_IRQS cruft")
Signed-off-by: Andrea Righi 
---
 arch/x86/entry/Makefile   | 3 ++-
 arch/x86/entry/thunk_32.S | 2 --
 arch/x86/entry/thunk_64.S | 4 
 3 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index 08bf95dbc911..83c98dae74a6 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -21,12 +21,13 @@ CFLAGS_syscall_64.o += $(call 
cc-option,-Wno-override-init,)
 CFLAGS_syscall_32.o+= $(call cc-option,-Wno-override-init,)
 CFLAGS_syscall_x32.o   += $(call cc-option,-Wno-override-init,)
 
-obj-y  := entry_$(BITS).o thunk_$(BITS).o 
syscall_$(BITS).o
+obj-y  := entry_$(BITS).o syscall_$(BITS).o
 obj-y  += common.o
 
 obj-y  += vdso/
 obj-y  += vsyscall/
 
+obj-$(CONFIG_PREEMPTION)   += thunk_$(BITS).o
 obj-$(CONFIG_IA32_EMULATION)   += entry_64_compat.o syscall_32.o
 obj-$(CONFIG_X86_X32_ABI)  += syscall_x32.o
 
diff --git a/arch/x86/entry/thunk_32.S b/arch/x86/entry/thunk_32.S
index f1f96d4d8cd6..5997ec0b4b17 100644
--- a/arch/x86/entry/thunk_32.S
+++ b/arch/x86/entry/thunk_32.S
@@ -29,10 +29,8 @@ SYM_CODE_START_NOALIGN(\name)
 SYM_CODE_END(\name)
.endm
 
-#ifdef CONFIG_PREEMPTION
THUNK preempt_schedule_thunk, preempt_schedule
THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace
EXPORT_SYMBOL(preempt_schedule_thunk)
EXPORT_SYMBOL(preempt_schedule_notrace_thunk)
-#endif
 
diff --git a/arch/x86/entry/thunk_64.S b/arch/x86/entry/thunk_64.S
index ccd32877a3c4..c7cf79be7231 100644
--- a/arch/x86/entry/thunk_64.S
+++ b/arch/x86/entry/thunk_64.S
@@ -36,14 +36,11 @@ SYM_FUNC_END(\name)
_ASM_NOKPROBE(\name)
.endm
 
-#ifdef CONFIG_PREEMPTION
THUNK preempt_schedule_thunk, preempt_schedule
THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace
EXPORT_SYMBOL(preempt_schedule_thunk)
EXPORT_SYMBOL(preempt_schedule_notrace_thunk)
-#endif
 
-#ifdef CONFIG_PREEMPTION
 SYM_CODE_START_LOCAL_NOALIGN(.L_restore)
popq %r11
popq %r10
@@ -58,4 +55,3 @@ SYM_CODE_START_LOCAL_NOALIGN(.L_restore)
ret
_ASM_NOKPROBE(.L_restore)
 SYM_CODE_END(.L_restore)
-#endif
-- 
2.29.2

Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse

2021-01-09 Thread Andrea Arcangeli

On Sat, Jan 09, 2021 at 05:37:09PM -0800, Linus Torvalds wrote:
> On Sat, Jan 9, 2021 at 5:19 PM Linus Torvalds
>  wrote:
> >
> > And no, I didn't make the UFFDIO_WRITEPROTECT code take the mmap_sem
> > for writing. For whoever wants to look at that, it's
> > mwriteprotect_range() in mm/userfaultfd.c and the fix is literally to
> > turn the read-lock (and unlock) into a write-lock (and unlock).
> 
> Oh, and if it wasn't obvious, we'll have to debate what to do with
> trying to mprotect() a pinned page. Do we just ignore the pinned page
> (the way my clear_refs patch did)? Or do we turn it into -EBUSY? Or
> what?

Agreed, I assume mprotect would have the same effect.

mprotect in parallel of a read or recvmgs may be undefined, so I
didn't bring it up, but it was pretty clear.

The moment the write bit is cleared (no matter why and from who) and
the PG lock relased, if there's any GUP pin, GUP currently loses
synchrony.

In any case I intended to help exercising the new page_count logic
with the testcase, possibly to make it behave better somehow, no
matter how.

I admit I'm also wondering myself the exact semantics of O_DIRECT on
clear_refs or uffd-wp tracking, but the point is that losing reads and
getting unexpected data in the page, still doesn't look a good
behavior and it had to be at least checked.

To me ultimately the useful use case that is become impossible with
page_count isn't even clear_refs nor uffd-wp.

The useful case that I can see zero fundamental flaws in it, is a RDMA
or some other device computing in pure readonly DMA on the data while
a program runs normally and produces it. It could be even a
framebuffer that doesn't care about coherency. You may want to
occasionally wrprotect the memory under readonly long term GUP pin for
consistency even against bugs of the program itself. Why should
wrprotecting make the device lose synchrony? And kind of performance
we gain to the normal useful cases by breaking the special case? Is
there a benchmark showing it?

> So it's not *just* the locking that needs to be fixed. But just take a
> look at that suggested clear_refs patch of mine - it sure isn't
> complicated.

If we can skip the wrprotection it's fairly easy, I fully agree, even
then it still looks more difficult than using page_mapcount in
do_wp_page in my view, so I also don't see the simplification. And
overall the amount of kernel code had a net increase as result.

Thanks,
Andrea

Re: [PATCH 1/1] mm: restore full accuracy in COW page reuse

2021-01-09 Thread Andrea Arcangeli

Hello,

On Sat, Jan 09, 2021 at 07:44:35PM -0500, Andrea Arcangeli wrote:
> allowing a child to corrupt memory in the parent. That's a problem
> that could happen not-maliciously too. So the scenario described

I updated the above partly quoted sentence since in the previous
version it didn't have full accuracy:

https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=fc5a76b1c14e5e6cdc64ece306fc03773662d98a

"However since a single transient GUP pin on a tail page, would elevate
the page_count for all other tail pages (unlike the mapcount which is
subpage granular), the COW page reuse inaccuracy would then cross
different vmas and the effect would happen at a distance in vma of
different processes. A single GUP pin taken on a subpage mapped in a
different process could trigger 511 false positive COWs copies in the
local process, after a fork()."

This a best effort to try to document all side effects, but it'd be
great to hear from Kirill too on the above detail to have
confirmation.

Thanks and have a great weekend everyone,
Andrea

Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse

2021-01-09 Thread Andrea Arcangeli

Hello Linus,

On Sat, Jan 09, 2021 at 05:19:51PM -0800, Linus Torvalds wrote:
> +#define is_cow_mapping(flags) (((flags) & (VM_SHARED | VM_MAYWRITE)) == 
> VM_MAYWRITE)
> +
> +static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long 
> addr, pte_t pte)
> +{
> + struct page *page;
> +
> + if (!is_cow_mapping(vma->vm_flags))
> + return false;
> + if (likely(!atomic_read(>vm_mm->has_pinned)))
> + return false;
> + page = vm_normal_page(vma, addr, pte);
> + if (!page)
> + return false;
> + if (page_mapcount(page) != 1)
> + return false;
> + return page_maybe_dma_pinned(page);
> +}

I just don't see the simplification coming from
09854ba94c6aad7886996bfbee2530b3d8a7f4f4. Instead of checking
page_mapcount above as an optimization, to me it looks much simpler to
check it in a single place, in do_wp_page, that in addition of
optimizing away the superfluous copy, would optimize away the above
complexity as well.

And I won't comment if it's actually safe to skip random pages or
not. All I know is for mprotect and uffd-wp, definitely the above
approach wouldn't work.

In addition I dislike has_pinned and FOLL_PIN.

has_pinned450 include/linux/mm_types.h   * for instance during 
page table copying for fork().
has_pinned   1021 kernel/fork.c atomic64_set(>pinned_vm, 0);
has_pinned   1239 mm/gup.c  atomic_set(>has_pinned, 1);
has_pinned   2579 mm/gup.c  
atomic_set(>mm->has_pinned, 1);
has_pinned   1099 mm/huge_memory.c   
atomic_read(_mm->has_pinned) &&
has_pinned   1213 mm/huge_memory.c   
atomic_read(_mm->has_pinned) &&
has_pinned819 mm/memory.c   if 
(likely(!atomic_read(_mm->has_pinned)))

Are we spending 32bit in mm_struct atomic_t just to call atomic_set(1)
on it? Why isn't it a MMF_HAS_PINNED that already can be set
atomically under mmap_read_lock too? There's bit left free there, we
didn't run out yet to justify wasting another 31 bits. I hope I'm
overlooking something.

The existence of FOLL_LONGTERM is good and makes a difference at times
for writeback if it's on a MAP_SHARED, or it makes difference during
GUP to do a page_migrate before taking the pin, but for the whole rest
of the VM it's irrelevant if it's long or short term, so I'm also
concerned from what Jason mentioned about long term pins being treated
differently within the VM. That to me looks fundamentally as flawed as
introducing inaccuracies in do_wp_page, that even ignoring the
performance implications caused by the inaccuracy, simply prevent to
do some useful things.

I obviously agree all common workloads with no GUP pins and no
clear_refs and no uffd, are way more important to be optimal, but I
haven't seen a single benchmark showing the benefit of not taking the
PG_lock before a page copy or any other runtime benefit coming from
page_count in do_wp_page.

To the contrary now I see additional branches in fork and in various
other paths.

The only thing again where I see page_count provides a tangible
benefit, is the vmsplice attack from child to parent, but that has not
been fully fixed and if page_count is added to fix it in all COW
faults, it'll introduce extra inefficiency to the the very common
important workloads, not only to the special GUP/clear_refs/uffd-wp
workloads as your patch above shows.

Thanks,
Andrea

[PATCH 0/1] mm: restore full accuracy in COW page reuse

2021-01-09 Thread Andrea Arcangeli

Hello Andrew and everyone,

Once we agree that COW page reuse requires full accuracy, the next
step is to re-apply 17839856fd588f4ab6b789f482ed3ffd7c403e1f and to
return going in that direction.

Who is going to orthogonally secure vmsplice, Andy, Jason, Jens?  Once
vmsplice is secured from taking unconstrained unprivileged long term
GUP pins, the attack from child to parent can still happen in theory,
but statistically speaking once that huge window is closed, it won't
be a practical concern, so it'll give us time to perfect the full
solution by closing all windows the VM core. vmsplice has to be
orthogonally fixed anyway, even if all windows were closed in VM core
first.

Unfortunately it's still not clear exactly what failed with
17839856fd588f4ab6b789f482ed3ffd7c403e1f but the whole point is that
we need to discuss that together.

Thanks,
Andrea

// SPDX-License-Identifier: GPL-3.0-or-later
/*
 *  reproducer for v5.11 O_DIRECT mm corruption with page_count
 *  instead of mapcount in do_wp_page.
 *
 *  Copyright (C) 2021  Red Hat, Inc.
 *
 *  gcc -O2 -o page_count_do_wp_page page_count_do_wp_page.c -lpthread
 *  page_count_do_wp_page ./whateverfile
 *
 *  NOTE: CONFIG_SOFT_DIRTY=y is required in the kernel config.
 */

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define PAGE_SIZE (1UL<<12)
#define HARDBLKSIZE 512

static void* writer(void *_mem)
{
char *mem = (char *)_mem;
for(;;) {
usleep(random() % 1000);
mem[PAGE_SIZE-1] = 0;
}
}

static void* background_soft_dirty(void *data)
{
long fd = (long) data;
for (;;)
if (write(fd, "4", 1) != 1)
perror("write soft dirty"), exit(1);
}

int main(int argc, char *argv[])
{
if (argc < 2)
printf("%s ", argv[0]), exit(1);

char path[PAGE_SIZE];
strcpy(path, "/proc/");
sprintf(path + strlen(path), "%d", getpid());
strcat(path, "/clear_refs");
long soft_dirty_fd = open(path, O_WRONLY);
if (soft_dirty_fd < 0)
perror("open clear_refs"), exit(1);

char *mem;
if (posix_memalign((void **), PAGE_SIZE, PAGE_SIZE*3))
perror("posix_memalign"), exit(1);
/* THP is not using page_count so it would not corrupt memory */
if (madvise(mem, PAGE_SIZE, MADV_NOHUGEPAGE))
perror("madvise"), exit(1);
bzero(mem, PAGE_SIZE * 3);
memset(mem + PAGE_SIZE * 2, 0xff, HARDBLKSIZE);

/*
 * This is not specific to O_DIRECT. Even if O_DIRECT was
 * forced to use PAGE_SIZE minimum granularity for reads, a
 * recvmsg would create the same issue since it also use
 * iov_iter_get_pages internally to create transient GUP pins
 * on anon memory.
 */
int fd = open(argv[1], O_DIRECT|O_CREAT|O_RDWR|O_TRUNC, 0600);
if (fd < 0)
perror("open"), exit(1);
if (write(fd, mem, PAGE_SIZE) != PAGE_SIZE)
perror("write"), exit(1);

pthread_t soft_dirty;
if (pthread_create(_dirty, NULL,
   background_soft_dirty, (void *)soft_dirty_fd))
perror("soft_dirty"), exit(1);

pthread_t thread;
if (pthread_create(, NULL, writer, mem))
perror("pthread_create"), exit(1);

bool skip_memset = true;
while (1) {
if (pread(fd, mem, HARDBLKSIZE, 0) != HARDBLKSIZE)
perror("read"), exit(1);
if (memcmp(mem, mem+PAGE_SIZE, HARDBLKSIZE)) {
if (memcmp(mem, mem+PAGE_SIZE*2, PAGE_SIZE)) {
if (skip_memset)
printf("unexpected memory "
   "corruption detected\n");
else
printf("memory corruption detected, "
   "dumping page\n");
int end = PAGE_SIZE;
if (!memcmp(mem+HARDBLKSIZE, mem+PAGE_SIZE,
PAGE_SIZE-HARDBLKSIZE))
end = HARDBLKSIZE;
for (int i = 0; i < end; i++)
printf("%x", mem[i]);
printf("\n");
} else
printf("memory corruption detected\n");
}
skip_memset = !skip_memset;
if (!skip_memset)
memset(me

[PATCH 1/1] mm: restore full accuracy in COW page reuse

2021-01-09 Thread Andrea Arcangeli

This reverts commit 09854ba94c6aad7886996bfbee2530b3d8a7f4f4.
This reverts commit be068f29034fb00530a053d18b8cf140c32b12b3.
This reverts commit 1a0cf26323c80e2f1c58fc04f15686de61bfab0c.

This example below uses O_DIRECT, on a process under
clear_refs. There's no FOLL_LONGTERM involvement. All GUP pins are
transient. fork is never called and even when clear_refs is cleared by
an external program, fork() would not be involved.

thread0 thread1 other process
(or thread 3)
=   =   ===
read syscall
O_DIRECT
read DMA to vaddr+0
len = 512 bytes
GUP(FOLL_WRITE)
DMA writing to RAM started
clear_refs
pte_wrprotect
write vaddrA+512
page_count == 2
wp_copy_page
read syscall returns
read lost

Notwithstanding the fact that failing O_DIRECT at sub-PAGE_SIZE
granularity is an ABI break by itself, this is of course not just
about O_DIRECT. recvmsg kernel TLS and plenty of other GUP FOLL_WRITE
iov_iter_get_pages users would write to the memory with sub-PAGE_SIZE
granularity, depending on the msghdr iov tiny buffers. Those recvmsg
are already silently lost too as well as the above O_DIRECT already
get lost.

The fact COW must not happen too much is documented in commit
6d0a07edd17cfc12fdc1f36de8072fa17cc3666f:

==
This will provide fully accuracy to the mapcount calculation in the
write protect faults, so page pinning will not get broken by false
positive copy-on-writes.
==

And in the current comment above the THP mapcount:

==
 * [..] we only use
 * page_trans_huge_mapcount() in the copy-on-write faults where we
 * need full accuracy to avoid breaking page pinning, [..]
==

Quote from the Link tagged:

==
In other words, the page_count check in do_wp_page, extends the old
fork vs gup vs threads race, so that it also happens within a single
thread without fork() involvement.
===

In addition FOLL_LONGTERM breaks for readonly DMA, despite
FOLL_LONGTERM pages are now treated specially and copied in fork().

FOLL_LONGTERM is in no way special with regard to the VM core and it
can't be. From the VM standpoint all GUP in are equal and breaking
one, means breaking them all the same.

It is not possible to resolve this by looking only at FOLL_LONGTERM,
that only hides the problem, but it doesn't fully resolve it as shown
above.

In addition this avoids storms of extra false positive COWs if the THP
are shared by different processes, which causes extra overhead, but
the several performance considerations are only a secondary concern.

After this commit, the copy in fork() for FOLL_LONGTERM also must be
extended to all GUP pins including transient pins or we shouldn't even
copy FOLL_LONGTERM pinned pages. Treating FOLL_LONGTERM any different
from transient GUP pin is sign that something is fundamentally flawed.

FOLL_LONGTERM can be treated different in writeback and during GUP
runtime itself, but not in the VM-core when deciding when to COW or
not to COW an anonymous page.

This revert causes no theoretical security regression because THP is
not yet using page_count in do_huge_pmd_wp_page.

The alternative of this patch would be to replace the mapcount with
the page_count in do_huge_pmd_wp_page too in order to really close the
vmsplice attack from child to parent that way.

However since a single transient GUP pin on a tail page, would elevate
the page_count for all other tail pages (unlike the mapcount which is
subpage granular), the above "fork-less" race would span over a
HPAGE_PMD_SIZE range, it would even cross different vmas and the
effect would happen at a distance in vma of different processes
allowing a child to corrupt memory in the parent. That's a problem
that could happen not-maliciously too. So the scenario described
above, if extended to THP new refcounting, would be more concerning
than the vmsplice issue, that can be tackled first in vmsplice itself,
and only at a second stage in the COW code.

Link: https://lkml.kernel.org/r/20210107200402.31095-1-aarca...@redhat.com
Cc: sta...@kernel.org
Fixes: 09854ba94c6a ("mm: do_wp_page() simplification")
Signed-off-by: Andrea Arcangeli 
---
 include/linux/ksm.h |  7 ++
 mm/ksm.c| 25 +++
 mm/memory.c | 59 -
 3 files changed, 74 insertions(+), 17 deletions(-)

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 161e8164abcf..e48b1e453ff5 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -53,6 +53,8 @@ struct page *ksm_might_need_to_copy(struct page *page,
 
 void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc);
 void ksm_migrate_page(struct page *newpage, struct page *oldpage);

Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy

2021-01-08 Thread Andrea Arcangeli

and mmu notifier, is if the
hardware is capable of handling it. There is no real difference other
than that.

> mmu notifier users who are using hmm_range_fault() do not ever take any
> page references when doing their work, that seems like the right
> approach, for a shadow mmu?

They all can do like HMM or you can take the FOLL_GET as long as you
remember put_page.

Jerome also intended to optimize the KVM fault like that, but like
said above, we were naked on that attempt.

If there is the pin or not makes zero semantical difference, it's
purely an optimization when there is no pin, and it's a bugcheck
safety feature if there is the pin.

By the time it can make a runtime difference if there is the pin or
not, put_page has been called already.

> > Nothing at all can go wrong, unless wp_copy_page suddenly makes the
> > secondary MMU go out of sync the moment you wrprotect the page with
> > clear_refs.
> 
> To be honest, I've read most of this discussion, and the prior one,
> between you and Linus carefully, but I still don't understand what
> clear_refs is about or how KVM's use of mmu notifiers got broken. This
> is probably because I'm only a little familiar with those areas :\

KVM use of mmu notifier is not related to this.

clear_refs simply can wrprotect the page. Of any process. echo
.. >/proc/self/clear_refs. Then you check in /proc/self/pagemap looking
for soft dirty (or something like that).

The point is that if you do echo ... >/proc/self/clear_refs on your
pid, that has any FOLL_LONGTERM on its mm, it'll just cause your
device driver to go out of sync with the mm. It'll see the old pages,
before the spurious COWs. The CPU will use new pages (the spurious
COWs).

> Is it actually broken or just inefficient? If wp_copy_page is going
> more often than it should the secondary mmu should still fully track
> that?

It's about the DMA going out of sync and losing view of the mm.

In addition the TLB flush broke with the mmu_read_lock but that can be
fixed somehow.

The TLB flush, still only because of the spurious COWs, has now to
cope with the fact that there can be spurious wp_page_copy right after
wrprotecting a read-write page. Before that couldn't happen, fork()
couldn't run since it takes mmap_write_lock, so if the pte was
writable and transitioned to non-writable it'd mean it was a exclusive
page and it would be guaranteed re-used, so the stale TLB would keep
writing in place.

The stale TLB is the exact same equivalent of your FOLL_LONGTERM,
except it's the window the CPU has on the old page, the FOLL_LONGTERM
is the window the PCI device has on the old page. The spurious COW is
what makes TLB and PCI device go out of sync reading and writing to
the old page, while the CPU moved on to a new page. The issue is
really similar.

> To be clear, here I am only talking about pin_user_pages. We now have
> logic to tell if a page is under pin_user_pages(FOLL_LONGTERM) or not,
> and that is what is driving the copy on fork logic.

fork() wrprotects and like every other wrprotect, was just falling in
the above scenario.

> secondary-mmu drivers using mmu notifier should not trigger this logic
> and should not restrict write protect.

That's a great point. I didn't think the mmu notifier will invalidate
the secondary MMU and ultimately issue a GUP after the wp_copy_page to
keep it in sync.

The funny thing that doesn't make sense is that wp_copy_page will only
be invoked because the PIN was left by KVM on the page for that extra
safety I was talking about earlier.

Are we forced to drop all the page pins to be able to wrprotect the
memory without being flooded by immediate COWs?

So the ultimate breakpoint, is the FOLL_LONGTERM and no mmu notifier
to go out of sync on a wrprotect, which can happen if the device is
doing a readonly access long term.

I quote you earlier:

"A long term page pin must
use pin_user_pages(), and either FOLL_LONGTERM|FOLL_WRITE (write mode)
FOLL_LONGTERM|FOLL_FORCE|FOLL_WRITE (read mode)"

You clearly contemplate the existance of a read mode, long term. That
is also completely compatible with wrprotection. Why should we pick a
model that forbids this to work? What do we get back from it? I only
see unnecessary risk and inefficiencies coming back from it.
 
> > Ultimately, what do we really gain from all this breakage?
> 
> Well, the clean definition of pin_user_pages(FOLL_LONGTERM) is very
> positive for DMA drivers working in that area.

I was referring to page_count in do_wp_page, not pin_user_pages sorry
for the confusion.

Thanks,
Andrea

Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending

2021-01-08 Thread Andrea Arcangeli

On Fri, Jan 08, 2021 at 11:25:21AM -0800, Linus Torvalds wrote:
> On Fri, Jan 8, 2021 at 9:53 AM Andrea Arcangeli  wrote:
> >
> > Do you intend to eventually fix the zygote vmsplice case or not?
> > Because in current upstream it's not fixed currently using the
> > enterprise default config.
> 
> Is this the hugepage case? Neither of your patches actually touched
> that, so I've forgotten the details.

The two patches only fixed the TLB flushing deferral in clear_refs and
uffd-wp.

So I didn't actually try to fix the hugepage case by adding the
page_count checks there too. I could try to do that at least it'd be
consistent but I still would try to find an alternate solution later.

> > Irrelevant special case as in: long term GUP pin on the memory?
> 
> Irrelevant special case in that
> 
>  (a) an extra COW shouldn't be a correctness issue unless somebody
> does something horribly wrong (and obviously the code that hasn't
> taken the mmap_lock for writing are then examples of that)
> 
> and
> 
>  (b) it's not a performance issue either unless you can find a real
> load that does it.
> 
> Hmm?

For b) I don't have an hard time to imagine `ps` hanging for seconds,
if clear_refs is touched on a 4T mm, but b) is not the main
concern.

Having to rely on a) is the main concern and it's not about tlb
flushes but the long term GUP pins.

Thanks,
Andrea

Re: [PATCH] x86/vm86/32: Remove VM86_SCREEN_BITMAP support

2021-01-08 Thread Andrea Arcangeli

On Fri, Jan 08, 2021 at 10:55:05AM -0800, Andy Lutomirski wrote:
> The implementation was rather buggy.  It unconditionally marked PTEs
> released the mmap lock before flushing the TLB, which could allow a racing
> CoW operation to falsely believe that the underlying memory was not
> writable.
> 
> I can't find any users at all of this mechanism, so just remove it.

Reviewed-by: Andrea Arcangeli

Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy

2021-01-08 Thread Andrea Arcangeli

On Fri, Jan 08, 2021 at 10:31:24AM -0800, Andy Lutomirski wrote:
> Can we just remove vmsplice() support?  We could make it do a normal

The single case I've seen vmsplice used so far, that was really cool
is localhost live migration of qemu. However despite really cool, it
wasn't merged in the end, and I don't recall exactly why.

There are even more efficient (but slightly more complex) ways to do
that than vmsplice: using MAP_SHARED gigapages or MAP_SHARED tmpfs
with THP opted-in in the tmpfs mount, as guest physical memory instead
of anon memory and finding a way not having it cleared by kexec, so
you can also upgrade the host kernel and not just qemu... is a way
more optimal way to PIN and move all pages through the pipe and still
having to pay a superfluous copy on destination.

My guess why it's not popular, and I may be completely wrong on this
since I basically never used vmsplice (other than to proof of concept
DoS my phone to verify the long term GUP pin exploit works), is that
vmsplice is a more efficient, but not the most efficient option.

Exactly like in the live migration in place, it's always more
efficient to share a tmpfs THP backed region and have true zero copy,
than going through a pipe that still does one copy at the receiving
end. It may also be simpler and it's not dependent on F_SETPIPE_SIZE
obscure tunings. So in the end it's still too slow for apps that
requires maximum performance, and not worth the extra work for those
that don't.

I love vmsplice conceptually, just I'd rather prefer an luser cannot
run it.

> copy, thereby getting rid of a fair amount of nastiness and potential
> attacks.  Even ignoring issues relating to the length of time that the
> vmsplice reference is alive, we also have whatever problems could be
> caused by a malicious or misguided user vmsplice()ing some memory and
> then modifying it.

Sorry to ask but I'm curious, what also goes wrong if the user
modifies memory under GUP pin from vmsplice? That's not obvious to
see.

Thanks,
Andrea

Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy

2021-01-08 Thread Andrea Arcangeli

On Fri, Jan 08, 2021 at 02:19:45PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 08, 2021 at 12:00:36PM -0500, Andrea Arcangeli wrote:
> > > The majority cannot be converted to notifiers because they are DMA
> > > based. Every one of those is an ABI for something, and does not expect
> > > extra privilege to function. It would be a major breaking change to
> > > have pin_user_pages require some cap.
> > 
> > ... what makes them safe is to be transient GUP pin and not long
> > term.
> > 
> > Please note the "long term" in the underlined line.
> 
> Many of them are long term, though only 50 or so have been marked
> specifically with FOLL_LONGTERM. I don't see how we can make such a
> major ABI break.

io_uring is one of those indeed and I already flagged it.

This isn't a black and white issue, kernel memory is also pinned but
it's not in movable pageblocks... How do you tell the VM in GUP to
migrate memory to a non movable pageblock before pinning it? Because
that's what it should do to create less breakage.

For example iommu obviously need to be privileged, if your argument
that it's enough to use the right API to take long term pins
unconstrained, that's not the case. Pins are pins and prevent moving
or freeing the memory, their effect is the same and again worse than
mlock on many levels.

(then I know on preempt-rt should behave like a pin, and that's fine,
you disable all features for such purpose there)

io_uring is fine in comparison to vmpslice but still not perfect,
because it does the RLIMIT_MEMLOCK accounting but unfortunately, is
tangibly unreliable since a pin can cost 2m or 1G (now 1G is basically
privileged so it doesn't hurt to get the accounting wrong in such
case, but it's still technically mixing counting apples as oranges).

Maybe io_uring could keep not doing mmu notifier, I'd need more
investigation to be sure, but what's the point of keeping it
VM-breaking when it doesn't need to? Is io_uring required to setup the
ring at high frequency?

> Looking at it, vmsplice() is simply wrong. A long term page pin must
> use pin_user_pages(), and either FOLL_LONGTERM|FOLL_WRITE (write mode)
> FOLL_LONGTERM|FOLL_FORCE|FOLL_WRITE (read mode)
> 
> ie it must COW and it must reject cases that are not longterm safe,
> like DAX and CMA and so on.
>
> These are the well established rules, vmsplice does not get a pass

Where are the established rules written down? pin_user_pages.rst
doesn't even make a mention of FOLL_FORCE or FOLL_WRITE at all, mm.h
same thing.

In any case, the extra flags required in FOLL_LONGTERM should be
implied by FOLL_LONGTERM itself, once it enters the gup code, because
it's not cool having to FOLL_WRITE in all drivers for a GUP(write=0),
let alone having to specify FOLL_FORCE for just a read. But this is
going offtopic.

> simply because it is using the CPU to memory copy as its "DMA".

vmsplice can't find all put_pages that may release the pages when the
pipe is read, or it'd be at least be able to do the unreliable
RLIMIT_MEMLOCK accounting.

I'm glad we agree vmsplice is unsafe. The PR for the seccomp filter is
open so if you don't mind, I'll link your review as confirmation.

> > speaking in practice. io_uring has similar concern but it can use mmu
> > notifier, so it can totally fix it and be 100% safe from this.
> 
> IIRC io_uring does use FOLL_LONGTERM and FOLL_WRITE..

Right it's one of those 50. FOLL_WRITE won't magically allow the
memory to be swapped or migrated.

To make another example a single unprivileged pin on the movable zone,
can break memhotunplug unless you use the mmu notifier. Every other
advanced feature falls apart.

So again, if an unprivileged syscalls allows a very limited number of
pages, maybe it checks also if it got a THP or a gigapage page from
the pin, it sets its own limit, maybe again it's not a big
concern.

vmsplice currently with zero privilege allows this:

 2  0 1074432 9589344  13548 132186040 4   172 2052 9997  5  2 93  
0  0
-> vmsplice reproducer started here
 1  0 1074432 8538184  13548 132582000 0 0 1973 8838  4  3 93  
0  0
 1  0 1074432 8538436  13548 132552400 0 0 1730 8168  4  2 94  
0  0
 1  0 1074432 8539096  13556 132188000 072 1811 8640  3  2 95  
0  0
 0  0 1074432 8539348  13564 132202800 036 1936 8684  4  2 95  
0  0
-> vmsplice killed here
 1  0 1074432 9586720  13564 132224800 0 0 1893 8514  4  2 94  
0  0

That's ~1G that goes away for each task and I didn't even check if
it's all THP pages getting in there, the rss is 3MB despite 1G is
taken down in GUP pins with zero privilege:

 1512 pts/25   S  0:00  0 0 133147 3044  0.0 ./vmsplice

Again memcg is robust so it's not a concern for the host, the attack
remains contained in the per-memcg OOM kil

Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending

2021-01-08 Thread Andrea Arcangeli

On Fri, Jan 08, 2021 at 09:39:56AM -0800, Linus Torvalds wrote:
> page_count() is simply the right and efficient thing to do.
> 
> You talk about all these theoretical inefficiencies for cases like
> zygote and page pinning, which have never ever been seen except as a
> possible attack vector.

Do you intend to eventually fix the zygote vmsplice case or not?
Because in current upstream it's not fixed currently using the
enterprise default config.

> Stop talking about irrelevant things. Stop trying to "optimize" things
> that never happen and don't matter.
> 
> Instead, what matters is the *NORMAL* VM flow.
> 
> Things like COW.
> 
> Things like "oh, now that we check just the page count, we don't even
> need the page lock for the common case any more".
> 
> > For the long term, I can't see how using page_count in do_wp_page is a
> > tenable proposition,
> 
> I think you should re-calibrate your expectations, and accept that
> page_count() is the right thing to do, and live with it.
> 
> And instead of worrying about irrelevant special-case code, start

Irrelevant special case as in: long term GUP pin on the memory?

Or irrelevant special case as in: causing secondary MMU to hit silent
data loss if a pte is ever wrprotected (arch code, vm86, whatever, all
under mmap_write_lock of course).

> worrying about the code that gets triggered tens of thousands of times
> a second, on regular loads, without anybody doing anything odd or
> special at all, just running plain and normal shell scripts or any
> other normal traditional load.
> 
> Those irrelevant special cases should be simple and work, not badly
> optimized to the point where they are buggy. And they are MUCH LESS
> IMPORTANT than the normal VM code, so if somebody does something odd,
> and it's slow, then that is the problem for the _odd_ case, not for
> the normal codepaths.
> 
> This is why I refuse to add crazy new special cases to core code. Make
> the rules simple and straightforward, and make the code VM work well.

New special cases? which new cases?

There's nothing new here besides the zygote that wasn't fully fixed
with 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 and is actually the only
new case I can imagine where page_count actually isn't a regression.

All old cases that you seem to refer as irrelevant and are in
production in v4.18, I don't see anything new here.

Even for the pure COW case with zero GUP involvement an hugepage with
cows happening in different processes, would forever hit wp_copy_page
since count is always > 1 despite mapcount can be 1 for all
subpages. A simple app doing fork/exec would forever copy all memory
in the parent even after the exec is finished.

Thanks,
Andrea

Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy

2021-01-08 Thread Andrea Arcangeli

On Fri, Jan 08, 2021 at 09:36:49AM -0400, Jason Gunthorpe wrote:
> On Thu, Jan 07, 2021 at 04:45:33PM -0500, Andrea Arcangeli wrote:
> > On Thu, Jan 07, 2021 at 04:25:25PM -0400, Jason Gunthorpe wrote:
> > > On Thu, Jan 07, 2021 at 03:04:00PM -0500, Andrea Arcangeli wrote:
> > > 
> > > > vmsplice syscall API is insecure allowing long term GUP PINs without
  ^
> > > > privilege.
> > > 
> > > Lots of places are relying on pin_user_pages long term pins of memory,
> > > and cannot be converted to notifiers.
> > > 
> > > I don't think it is reasonable to just declare that insecure and
> > > requires privileges, it is a huge ABI break.
> > 
> > Where's that ABI? Are there specs or a code example in kernel besides
> > vmsplice itself?
> 
> If I understand you right, you are trying to say that the 193
> pin_user_pages() callers cannot exist as unpriv any more?

193, 1k 1m or their number in general, won't just make them safe...

> The majority cannot be converted to notifiers because they are DMA
> based. Every one of those is an ABI for something, and does not expect
> extra privilege to function. It would be a major breaking change to
> have pin_user_pages require some cap.

... what makes them safe is to be transient GUP pin and not long
term.

Please note the "long term" in the underlined line.

O_DIRECT is perfectly ok to be unprivileged obviously. The VM can
wait, eventually it goes away.

Even a swapout is not an instant event and can be hold off by any
number of other things besides a transient GUP pin. It can be hold off
by PG_lock just to make an example.

mlock however is long term, persistent, vmsplice takes persistent and
can pin way too much memory for each mm, that doesn't feel safe. The
more places doing stuff like that, the more likely one causes
a safety issue, not the other way around it in fact.

> > The whole zygote issue wouldn't even register if the child had the
> > exact same credentials of the parent. Problem is the child dropped
> > privileges and went with a luser id, that clearly cannot ptrace the
> > parent, and so if long term unprivileged GUP pins are gone from the
> > equation, what remains that the child can do is purely theoretical
> > even before commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f.
> 
> Sorry, I'm not sure I've found a good explanation how ptrace and GUP
> are interacting to become a security problem.

ptrace is not involved. What I meant by mentioning ptrace, is that if
the child can ptrace the parent, then it doesn't matter if it can also
do the below, so the security concern is zero in such case.

With O_DIRECT or any transient pin you will never munmap while
O_DIRECT is in flight, if you munmap it's undefined what happens in
such case anyway.

It is a theoretical security issue made practical by vmsplice API that
allows to enlarge the window to years of time (not guaranteed
milliseconds), to wait for the parent to trigger the
wp_page_reuse. Remove vmsplice and the security issue in theory
remains, but removed vmsplice it becomes irrelevant statistically
speaking in practice. io_uring has similar concern but it can use mmu
notifier, so it can totally fix it and be 100% safe from this.

The scheduler disclosure date was 2020-08-25 so I can freely explain
the case that motivated all these changes.

case A)

if !fork() {
   // in child
   mmap one page
   vmsplice takes gup pin long term on such page
   munmap one page
   // mapcount == 1 (parent mm)
   // page_count == 2 (gup in child, and parent mm)
} else {
   parent writes to the page
   // mapcount == 1, wp_page_reuse
}

parent did a COW with mapcount == 1 so the parent will take over a
page that is still GUP pinned in the child. That's the security issue
because in this case the GUP pin is malicious.

Now imagine this case B)

   mmap one page
   RDMA or any secondary MMU takes a long term GUP pin
   munmap one page
   // mapcount == 1 (parent mm)
   // page_count == 2 (gup in RDMA, and parent mm)

How does the VM can tell between the two different cases? It can't.

The current page_count in do_wp_page treats both cases the same and
because page_count is 2 in both cases, it calls wp_page_copy in both
cases breaking-COW in both cases.

However, you know full well in the second case it is a feature and not
a bug, that wp_page_reuse is called instead, and in fact it has to be
called or it's a bug (and that's the bug page_count in do_wp_page
introduces).

So page_count in do_wp_page is breaking all valid users, to take care
of the purely theoretical security issue that isn't a practical concern
if only vmsplice is secured at least as good as mlock.

page_count in do_wp_page is fundamentally flawed for all long term GUP
pin done by secondary MMUs attach

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 5733 matches

Mail list logo