Re: [PATCH 2/2] riscv: Fix text patching when IPI are used
On Wed, Feb 28, 2024 at 06:51:49PM +0100, Alexandre Ghiti wrote: > For now, we use stop_machine() to patch the text and when we use IPIs for > remote icache flushes (which is emitted in patch_text_nosync()), the system > hangs. > > So instead, make sure every cpu executes the stop_machine() patching > function and emit a local icache flush there. > > Co-developed-by: Björn Töpel > Signed-off-by: Björn Töpel > Signed-off-by: Alexandre Ghiti Modulo the removal of the hunk discussed with Samuel, Reviewed-by: Andrea Parri Some nits / amendments to the inline comments below: > + /* > + * Make sure the patching store is effective *before* we > + * increment the counter which releases all waiting cpus > + * by using the release version of atomic increment. > + */ s/cpus/CPUs s/release version/release variant The comment could be amended with a description of the matching barrier(s), say, "The release pairs with the call to local_flush_icache_all() on the waiting CPU". (Same for the comment in patch_text_cb().) Andrea
Re: [PATCH 1/2] riscv: Remove superfluous smp_mb()
On Wed, Feb 28, 2024 at 06:51:48PM +0100, Alexandre Ghiti wrote: > This memory barrier is not needed and not documented so simply remove > it. > > Suggested-by: Andrea Parri > Signed-off-by: Alexandre Ghiti Reviewed-by: Andrea Parri Andrea
Re: [PATCH] riscv: Fix text patching when icache flushes use IPIs
> I did not even think of that, and it actually makes sense so I'll go > with what you propose: I'll replace atomic_inc() with > atomic_inc_return_release(). And I'll add the following comment if > that's ok with you: > > "Make sure the patching store is effective *before* we increment the > counter which releases all waiting cpus" Yes, this sounds good to me. > Honestly, I looked at it one minute, did not understand its purpose > and said to myself "ok that can't hurt anyway, I may be missing > something". > > FWIW, I see that arm64 uses isb() here. If you don't see its purpose, > I'll remove it (here and where I copied it). Removing the smp_mb() (and keeping the local_flush_icache_all()) seems fine to me; thanks for the confirmation. > > On a last topic, although somehow orthogonal to the scope of this patch, > > I'm not sure the patch_{map,unmap}() dance in our patch_insn_write() is > > correct: I can see why we may want (need to do) the local TLB flush be- > > fore returning from patch_{map,unmap}(), but does a local flush suffice? > > For comparison, arm64 seems to go through a complete dsb-tlbi-dsb(-isb) > > sequence in their unmapping stage (and apparently relying on "no caching > > of invalid ptes" in their mapping stage). Of course, "broadcasting" our > > (riscv's) TLB invalidations will necessary introduce some complexity... > > > > Thoughts? > > To avoid remote TLBI, could we simply disable the preemption before > the first patch_map()? arm64 disables the irqs, but that seems > overkill to me, but maybe I'm missing something again? Mmh, I'm afraid this will require more thinking/probing on my end (not really "the expert" of the codebase at stake...). Maybe the ftrace reviewers will provide further ideas/suggestions for us to brainstorm. Andrea
Re: [PATCH] riscv: Fix text patching when icache flushes use IPIs
> +static int __ftrace_modify_code(void *data) > +{ > + struct ftrace_modify_param *param = data; > + > + if (atomic_inc_return(>cpu_count) == num_online_cpus()) { > + ftrace_modify_all_code(param->command); > + atomic_inc(>cpu_count); I stared at ftrace_modify_all_code() for a bit but honestly I don't see what prevents the ->cpu_count increment from being reordered before the insn write(s) (architecturally) now that you have removed the IPI dance: perhaps add an smp_wmb() right before the atomic_inc() (or promote this latter to a (void)atomic_inc_return_release()) and/or an inline comment saying why such reordering is not possible? > + } else { > + while (atomic_read(>cpu_count) <= num_online_cpus()) > + cpu_relax(); > + smp_mb(); I see that you've lifted/copied the memory barrier from patch_text_cb(): what's its point? AFAIU, the barrier has no ordering effect on program order later insn fetches; perhaps the code was based on some legacy/old version of Zifencei? IAC, comments, comments, ... or maybe just remove that memory barrier? > + } > + > + local_flush_icache_all(); > + > + return 0; > +} [...] > @@ -232,8 +230,7 @@ static int patch_text_cb(void *data) > if (atomic_inc_return(>cpu_count) == num_online_cpus()) { > for (i = 0; ret == 0 && i < patch->ninsns; i++) { > len = GET_INSN_LENGTH(patch->insns[i]); > - ret = patch_text_nosync(patch->addr + i * len, > - >insns[i], len); > + ret = patch_insn_write(patch->addr + i * len, > >insns[i], len); > } > atomic_inc(>cpu_count); > } else { > @@ -242,6 +239,8 @@ static int patch_text_cb(void *data) > smp_mb(); > } > > + local_flush_icache_all(); > + > return ret; > } > NOKPROBE_SYMBOL(patch_text_cb); My above remarks/questions also apply to this function. On a last topic, although somehow orthogonal to the scope of this patch, I'm not sure the patch_{map,unmap}() dance in our patch_insn_write() is correct: I can see why we may want (need to do) the local TLB flush be- fore returning from patch_{map,unmap}(), but does a local flush suffice? For comparison, arm64 seems to go through a complete dsb-tlbi-dsb(-isb) sequence in their unmapping stage (and apparently relying on "no caching of invalid ptes" in their mapping stage). Of course, "broadcasting" our (riscv's) TLB invalidations will necessary introduce some complexity... Thoughts? Andrea
Re: [GIT PULL] Modules changes for v6.7-rc1
On Thu, Nov 02, 2023 at 08:29:17AM +0100, Andrea Righi wrote: > On Wed, Nov 01, 2023 at 09:21:09PM -1000, Linus Torvalds wrote: > > On Wed, 1 Nov 2023 at 21:02, Linus Torvalds > > wrote: > > > > > > kmalloc() isn't just about "use physically contiguous allocations". > > > It's also more memory-efficient, and a *lot* faster than vmalloc(), > > > which has to play VM tricks. > > > > I've pulled this, but I think you should do something like the > > attached (UNTESTED!) patch. > > > > Linus > > Looks good to me, I'll give it a try ASAP. > > -Andrea Just tested this both with zstd and gzip module compression, all good. You can add my: Tested-by: Andrea Righi Or if you need a proper paperwork: -- From: Andrea Righi Subject: [PATCH] module/decompress: use kvmalloc() consistently We consistently switched from kmalloc() to vmalloc() in module decompression to prevent potential memory allocation failures with large modules, however vmalloc() is not as memory-efficient and fast as kmalloc(). Since we don't know in general the size of the workspace required by the decompression algorithm, it is more reasonable to use kvmalloc() consistently, also considering that we don't have special memory requirements here. Signed-off-by: Linus Torvalds Tested-by: Andrea Righi Signed-off-by: Andrea Righi --- kernel/module/decompress.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/module/decompress.c b/kernel/module/decompress.c index 4156d59be440..474e68f0f063 100644 --- a/kernel/module/decompress.c +++ b/kernel/module/decompress.c @@ -100,7 +100,7 @@ static ssize_t module_gzip_decompress(struct load_info *info, s.next_in = buf + gzip_hdr_len; s.avail_in = size - gzip_hdr_len; - s.workspace = vmalloc(zlib_inflate_workspacesize()); + s.workspace = kvmalloc(zlib_inflate_workspacesize(), GFP_KERNEL); if (!s.workspace) return -ENOMEM; @@ -138,7 +138,7 @@ static ssize_t module_gzip_decompress(struct load_info *info, out_inflate_end: zlib_inflateEnd(); out: - vfree(s.workspace); + kvfree(s.workspace); return retval; } #elif defined(CONFIG_MODULE_COMPRESS_XZ) @@ -241,7 +241,7 @@ static ssize_t module_zstd_decompress(struct load_info *info, } wksp_size = zstd_dstream_workspace_bound(header.windowSize); - wksp = vmalloc(wksp_size); + wksp = kvmalloc(wksp_size, GFP_KERNEL); if (!wksp) { retval = -ENOMEM; goto out; @@ -284,7 +284,7 @@ static ssize_t module_zstd_decompress(struct load_info *info, retval = new_size; out: - vfree(wksp); + kvfree(wksp); return retval; } #else -- 2.40.1
Re: [GIT PULL] Modules changes for v6.7-rc1
On Wed, Nov 01, 2023 at 09:21:09PM -1000, Linus Torvalds wrote: > On Wed, 1 Nov 2023 at 21:02, Linus Torvalds > wrote: > > > > kmalloc() isn't just about "use physically contiguous allocations". > > It's also more memory-efficient, and a *lot* faster than vmalloc(), > > which has to play VM tricks. > > I've pulled this, but I think you should do something like the > attached (UNTESTED!) patch. > > Linus Looks good to me, I'll give it a try ASAP. -Andrea > kernel/module/decompress.c | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/kernel/module/decompress.c b/kernel/module/decompress.c > index 4156d59be440..474e68f0f063 100644 > --- a/kernel/module/decompress.c > +++ b/kernel/module/decompress.c > @@ -100,7 +100,7 @@ static ssize_t module_gzip_decompress(struct load_info > *info, > s.next_in = buf + gzip_hdr_len; > s.avail_in = size - gzip_hdr_len; > > - s.workspace = vmalloc(zlib_inflate_workspacesize()); > + s.workspace = kvmalloc(zlib_inflate_workspacesize(), GFP_KERNEL); > if (!s.workspace) > return -ENOMEM; > > @@ -138,7 +138,7 @@ static ssize_t module_gzip_decompress(struct load_info > *info, > out_inflate_end: > zlib_inflateEnd(); > out: > - vfree(s.workspace); > + kvfree(s.workspace); > return retval; > } > #elif defined(CONFIG_MODULE_COMPRESS_XZ) > @@ -241,7 +241,7 @@ static ssize_t module_zstd_decompress(struct load_info > *info, > } > > wksp_size = zstd_dstream_workspace_bound(header.windowSize); > - wksp = vmalloc(wksp_size); > + wksp = kvmalloc(wksp_size, GFP_KERNEL); > if (!wksp) { > retval = -ENOMEM; > goto out; > @@ -284,7 +284,7 @@ static ssize_t module_zstd_decompress(struct load_info > *info, > retval = new_size; > > out: > - vfree(wksp); > + kvfree(wksp); > return retval; > } > #else
Re: [GIT PULL] Modules changes for v6.7-rc1
On Wed, Nov 01, 2023 at 09:02:51PM -1000, Linus Torvalds wrote: > On Wed, 1 Nov 2023 at 10:13, Luis Chamberlain wrote: > > > > The only thing worth highligthing is that gzip moves to use vmalloc() > > instead of > > kmalloc just as we had a fix for this for zstd on v6.6-rc1. > > Actually, that's almost certainly entirely the wrong thing to do. > > Unless you *know* that the allocation is large, you shouldn't use > vmalloc(). And since kmalloc() has worked fine, you most definitely > don't know that. > > So we have 'kvmalloc()' *exactly* for this reason, which is a "use > kmalloc, unless that is too small, then use vmalloc". > > kmalloc() isn't just about "use physically contiguous allocations". > It's also more memory-efficient, and a *lot* faster than vmalloc(), > which has to play VM tricks. > > So this "just switch to vmalloc()" is entirely wrong. > > Linus I proposed that change mostlfy for consistency with the zstd case, but I haven't experience any issue with gzip compressed modules (that seem to require less memory, even with larger modules). So, yes, it probably makes sense to drop this change for now and I can send another patch to switch to kvmalloc() for all the decompress cases. Thanks, -Andrea
[PATCH v2] Drivers: hv: vmbus: Initialize unload_event statically
If a malicious or compromised Hyper-V sends a spurious message of type CHANNELMSG_UNLOAD_RESPONSE, the function vmbus_unload_response() will call complete() on an uninitialized event, and cause an oops. Reported-by: Michael Kelley Signed-off-by: Andrea Parri (Microsoft) --- Changes since v1[1]: - add inline comment in vmbus_unload_response() [1] https://lkml.kernel.org/r/20210416143932.16512-1-parri.and...@gmail.com drivers/hv/channel_mgmt.c | 7 ++- drivers/hv/connection.c | 2 ++ 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index 4c9e45d1f462c..335a10ee03a5e 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -826,6 +826,11 @@ static void vmbus_unload_response(struct vmbus_channel_message_header *hdr) /* * This is a global event; just wakeup the waiting thread. * Once we successfully unload, we can cleanup the monitor state. +* +* NB. A malicious or compromised Hyper-V could send a spurious +* message of type CHANNELMSG_UNLOAD_RESPONSE, and trigger a call +* of the complete() below. Make sure that unload_event has been +* initialized by the time this complete() is executed. */ complete(_connection.unload_event); } @@ -841,7 +846,7 @@ void vmbus_initiate_unload(bool crash) if (vmbus_proto_version < VERSION_WIN8_1) return; - init_completion(_connection.unload_event); + reinit_completion(_connection.unload_event); memset(, 0, sizeof(struct vmbus_channel_message_header)); hdr.msgtype = CHANNELMSG_UNLOAD; vmbus_post_msg(, sizeof(struct vmbus_channel_message_header), diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c index dc19d5ae4373c..311cd005b3be6 100644 --- a/drivers/hv/connection.c +++ b/drivers/hv/connection.c @@ -26,6 +26,8 @@ struct vmbus_connection vmbus_connection = { .conn_state = DISCONNECTED, + .unload_event = COMPLETION_INITIALIZER( + vmbus_connection.unload_event), .next_gpadl_handle = ATOMIC_INIT(0xE1E10), .ready_for_suspend_event = COMPLETION_INITIALIZER( -- 2.25.1
Re: [PATCH] Drivers: hv: vmbus: Initialize unload_event statically
On Fri, Apr 16, 2021 at 03:25:03PM +, Michael Kelley wrote: > From: Andrea Parri (Microsoft) Sent: Friday, April > 16, 2021 7:40 AM > > > > If a malicious or compromised Hyper-V sends a spurious message of type > > CHANNELMSG_UNLOAD_RESPONSE, the function vmbus_unload_response() will > > call complete() on an uninitialized event, and cause an oops. > > Please leave a comment somewhere in the code itself that describes this > scenario so that somebody in the future doesn't decide it's OK to "simplify" > the > initialization of unload_event. :-) Yes, will do for v2. Thanks, Andrea
[PATCH] Drivers: hv: vmbus: Initialize unload_event statically
If a malicious or compromised Hyper-V sends a spurious message of type CHANNELMSG_UNLOAD_RESPONSE, the function vmbus_unload_response() will call complete() on an uninitialized event, and cause an oops. Reported-by: Michael Kelley Signed-off-by: Andrea Parri (Microsoft) --- drivers/hv/channel_mgmt.c | 2 +- drivers/hv/connection.c | 2 ++ 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index f3cf4af01e102..1efb616480a64 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -841,7 +841,7 @@ void vmbus_initiate_unload(bool crash) if (vmbus_proto_version < VERSION_WIN8_1) return; - init_completion(_connection.unload_event); + reinit_completion(_connection.unload_event); memset(, 0, sizeof(struct vmbus_channel_message_header)); hdr.msgtype = CHANNELMSG_UNLOAD; vmbus_post_msg(, sizeof(struct vmbus_channel_message_header), diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c index 350e8c5cafa8c..529dcc47f3e11 100644 --- a/drivers/hv/connection.c +++ b/drivers/hv/connection.c @@ -26,6 +26,8 @@ struct vmbus_connection vmbus_connection = { .conn_state = DISCONNECTED, + .unload_event = COMPLETION_INITIALIZER( + vmbus_connection.unload_event), .next_gpadl_handle = ATOMIC_INIT(0xE1E10), .ready_for_suspend_event = COMPLETION_INITIALIZER( -- 2.25.1
[PATCH v3 3/3] Drivers: hv: vmbus: Check for pending channel interrupts before taking a CPU offline
Check that enough time has passed such that the modify channel message has been processed before taking a CPU offline. Signed-off-by: Andrea Parri (Microsoft) --- drivers/hv/hv.c | 56 ++--- 1 file changed, 53 insertions(+), 3 deletions(-) diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c index 3e6ff83adff42..e0c522d143a37 100644 --- a/drivers/hv/hv.c +++ b/drivers/hv/hv.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include @@ -292,12 +293,50 @@ void hv_synic_disable_regs(unsigned int cpu) disable_percpu_irq(vmbus_irq); } +#define HV_MAX_TRIES 3 +/* + * Scan the event flags page of 'this' CPU looking for any bit that is set. If we find one + * bit set, then wait for a few milliseconds. Repeat these steps for a maximum of 3 times. + * Return 'true', if there is still any set bit after this operation; 'false', otherwise. + * + * If a bit is set, that means there is a pending channel interrupt. The expectation is + * that the normal interrupt handling mechanism will find and process the channel interrupt + * "very soon", and in the process clear the bit. + */ +static bool hv_synic_event_pending(void) +{ + struct hv_per_cpu_context *hv_cpu = this_cpu_ptr(hv_context.cpu_context); + union hv_synic_event_flags *event = + (union hv_synic_event_flags *)hv_cpu->synic_event_page + VMBUS_MESSAGE_SINT; + unsigned long *recv_int_page = event->flags; /* assumes VMBus version >= VERSION_WIN8 */ + bool pending; + u32 relid; + int tries = 0; + +retry: + pending = false; + for_each_set_bit(relid, recv_int_page, HV_EVENT_FLAGS_COUNT) { + /* Special case - VMBus channel protocol messages */ + if (relid == 0) + continue; + pending = true; + break; + } + if (pending && tries++ < HV_MAX_TRIES) { + usleep_range(1, 2); + goto retry; + } + return pending; +} int hv_synic_cleanup(unsigned int cpu) { struct vmbus_channel *channel, *sc; bool channel_found = false; + if (vmbus_connection.conn_state != CONNECTED) + goto always_cleanup; + /* * Hyper-V does not provide a way to change the connect CPU once * it is set; we must prevent the connect CPU from going offline @@ -305,8 +344,7 @@ int hv_synic_cleanup(unsigned int cpu) * path where the vmbus is already disconnected, the CPU must be * allowed to shut down. */ - if (cpu == VMBUS_CONNECT_CPU && - vmbus_connection.conn_state == CONNECTED) + if (cpu == VMBUS_CONNECT_CPU) return -EBUSY; /* @@ -333,9 +371,21 @@ int hv_synic_cleanup(unsigned int cpu) } mutex_unlock(_connection.channel_mutex); - if (channel_found && vmbus_connection.conn_state == CONNECTED) + if (channel_found) + return -EBUSY; + + /* +* channel_found == false means that any channels that were previously +* assigned to the CPU have been reassigned elsewhere with a call of +* vmbus_send_modifychannel(). Scan the event flags page looking for +* bits that are set and waiting with a timeout for vmbus_chan_sched() +* to process such bits. If bits are still set after this operation +* and VMBus is connected, fail the CPU offlining operation. +*/ + if (vmbus_proto_version >= VERSION_WIN10_V4_1 && hv_synic_event_pending()) return -EBUSY; +always_cleanup: hv_stimer_legacy_cleanup(cpu); hv_synic_disable_regs(cpu); -- 2.25.1
[PATCH v3 2/3] Drivers: hv: vmbus: Drivers: hv: vmbus: Introduce CHANNELMSG_MODIFYCHANNEL_RESPONSE
Introduce the CHANNELMSG_MODIFYCHANNEL_RESPONSE message type, and code to receive and process such a message. Signed-off-by: Andrea Parri (Microsoft) Reviewed-by: Michael Kelley --- drivers/hv/channel.c | 99 --- drivers/hv/channel_mgmt.c | 42 + drivers/hv/hv_trace.h | 15 ++ drivers/hv/vmbus_drv.c| 4 +- include/linux/hyperv.h| 11 - 5 files changed, 152 insertions(+), 19 deletions(-) diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c index db30be8f9ccea..aa4ef75d8dee2 100644 --- a/drivers/hv/channel.c +++ b/drivers/hv/channel.c @@ -209,31 +209,96 @@ int vmbus_send_tl_connect_request(const guid_t *shv_guest_servie_id, } EXPORT_SYMBOL_GPL(vmbus_send_tl_connect_request); +static int send_modifychannel_without_ack(struct vmbus_channel *channel, u32 target_vp) +{ + struct vmbus_channel_modifychannel msg; + int ret; + + memset(, 0, sizeof(msg)); + msg.header.msgtype = CHANNELMSG_MODIFYCHANNEL; + msg.child_relid = channel->offermsg.child_relid; + msg.target_vp = target_vp; + + ret = vmbus_post_msg(, sizeof(msg), true); + trace_vmbus_send_modifychannel(, ret); + + return ret; +} + +static int send_modifychannel_with_ack(struct vmbus_channel *channel, u32 target_vp) +{ + struct vmbus_channel_modifychannel *msg; + struct vmbus_channel_msginfo *info; + unsigned long flags; + int ret; + + info = kzalloc(sizeof(struct vmbus_channel_msginfo) + + sizeof(struct vmbus_channel_modifychannel), + GFP_KERNEL); + if (!info) + return -ENOMEM; + + init_completion(>waitevent); + info->waiting_channel = channel; + + msg = (struct vmbus_channel_modifychannel *)info->msg; + msg->header.msgtype = CHANNELMSG_MODIFYCHANNEL; + msg->child_relid = channel->offermsg.child_relid; + msg->target_vp = target_vp; + + spin_lock_irqsave(_connection.channelmsg_lock, flags); + list_add_tail(>msglistentry, _connection.chn_msg_list); + spin_unlock_irqrestore(_connection.channelmsg_lock, flags); + + ret = vmbus_post_msg(msg, sizeof(*msg), true); + trace_vmbus_send_modifychannel(msg, ret); + if (ret != 0) { + spin_lock_irqsave(_connection.channelmsg_lock, flags); + list_del(>msglistentry); + spin_unlock_irqrestore(_connection.channelmsg_lock, flags); + goto free_info; + } + + /* +* Release channel_mutex; otherwise, vmbus_onoffer_rescind() could block on +* the mutex and be unable to signal the completion. +* +* See the caller target_cpu_store() for information about the usage of the +* mutex. +*/ + mutex_unlock(_connection.channel_mutex); + wait_for_completion(>waitevent); + mutex_lock(_connection.channel_mutex); + + spin_lock_irqsave(_connection.channelmsg_lock, flags); + list_del(>msglistentry); + spin_unlock_irqrestore(_connection.channelmsg_lock, flags); + + if (info->response.modify_response.status) + ret = -EAGAIN; + +free_info: + kfree(info); + return ret; +} + /* * Set/change the vCPU (@target_vp) the channel (@child_relid) will interrupt. * - * CHANNELMSG_MODIFYCHANNEL messages are aynchronous. Also, Hyper-V does not - * ACK such messages. IOW we can't know when the host will stop interrupting - * the "old" vCPU and start interrupting the "new" vCPU for the given channel. + * CHANNELMSG_MODIFYCHANNEL messages are aynchronous. When VMbus version 5.3 + * or later is negotiated, Hyper-V always sends an ACK in response to such a + * message. For VMbus version 5.2 and earlier, it never sends an ACK. With- + * out an ACK, we can not know when the host will stop interrupting the "old" + * vCPU and start interrupting the "new" vCPU for the given channel. * * The CHANNELMSG_MODIFYCHANNEL message type is supported since VMBus version * VERSION_WIN10_V4_1. */ -int vmbus_send_modifychannel(u32 child_relid, u32 target_vp) +int vmbus_send_modifychannel(struct vmbus_channel *channel, u32 target_vp) { - struct vmbus_channel_modifychannel conn_msg; - int ret; - - memset(_msg, 0, sizeof(conn_msg)); - conn_msg.header.msgtype = CHANNELMSG_MODIFYCHANNEL; - conn_msg.child_relid = child_relid; - conn_msg.target_vp = target_vp; - - ret = vmbus_post_msg(_msg, sizeof(conn_msg), true); - - trace_vmbus_send_modifychannel(_msg, ret); - - return ret; + if (vmbus_proto_version >= VERSION_WIN10_V5_3) + return send_modifychannel_with_ack(channel, target_vp); + return send_modifychannel_without_ack(channel, target_vp); } EXPORT_SYMBOL_GPL(vmbus_send_modifychannel); diff --git a/drivers/hv/ch
[PATCH v3 1/3] Drivers: hv: vmbus: Introduce and negotiate VMBus protocol version 5.3
Hyper-V has added VMBus protocol version 5.3. Allow Linux guests to negotiate the new version on version of Hyper-V that support it. Signed-off-by: Andrea Parri (Microsoft) --- drivers/hv/connection.c | 3 ++- include/linux/hyperv.h | 2 ++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c index 350e8c5cafa8c..dc19d5ae4373c 100644 --- a/drivers/hv/connection.c +++ b/drivers/hv/connection.c @@ -45,6 +45,7 @@ EXPORT_SYMBOL_GPL(vmbus_proto_version); * Table of VMBus versions listed from newest to oldest. */ static __u32 vmbus_versions[] = { + VERSION_WIN10_V5_3, VERSION_WIN10_V5_2, VERSION_WIN10_V5_1, VERSION_WIN10_V5, @@ -60,7 +61,7 @@ static __u32 vmbus_versions[] = { * Maximal VMBus protocol version guests can negotiate. Useful to cap the * VMBus version for testing and debugging purpose. */ -static uint max_version = VERSION_WIN10_V5_2; +static uint max_version = VERSION_WIN10_V5_3; module_param(max_version, uint, S_IRUGO); MODULE_PARM_DESC(max_version, diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index 2c18c8e768efe..3ce36bbb398e9 100644 --- a/include/linux/hyperv.h +++ b/include/linux/hyperv.h @@ -234,6 +234,7 @@ static inline u32 hv_get_avail_to_write_percent( * 5 . 0 (Newer Windows 10) * 5 . 1 (Windows 10 RS4) * 5 . 2 (Windows Server 2019, RS5) + * 5 . 3 (Windows Server 2022) */ #define VERSION_WS2008 ((0 << 16) | (13)) @@ -245,6 +246,7 @@ static inline u32 hv_get_avail_to_write_percent( #define VERSION_WIN10_V5 ((5 << 16) | (0)) #define VERSION_WIN10_V5_1 ((5 << 16) | (1)) #define VERSION_WIN10_V5_2 ((5 << 16) | (2)) +#define VERSION_WIN10_V5_3 ((5 << 16) | (3)) /* Make maximum size of pipe payload of 16K */ #define MAX_PIPE_DATA_PAYLOAD (sizeof(u8) * 16384) -- 2.25.1
[PATCH v3 0/3] Drivers: hv: vmbus: Introduce CHANNELMSG_MODIFYCHANNEL_RESPONSE
Changes since v2[1]: - fix VMbus protocol version name - add Reviewed-by: tag - refactor/simplyfy changes in hv_synic_cleanup() Changes since v1[2]: - rebase on hyperv-next - split changes into three patches - fix error handling in send_modifychannel_with_ack() - remove rescind checks in send_modifychannel_with_ack() - remove unneeded test in hv_synic_event_pending() - add/amend inline comments - style changes [1] https://lkml.kernel.org/r/20210414150118.2843-1-parri.and...@gmail.com [2] https://lkml.kernel.org/r/20201126191210.13115-1-parri.and...@gmail.com Andrea Parri (Microsoft) (3): Drivers: hv: vmbus: Introduce and negotiate VMBus protocol version 5.3 Drivers: hv: vmbus: Drivers: hv: vmbus: Introduce CHANNELMSG_MODIFYCHANNEL_RESPONSE Drivers: hv: vmbus: Check for pending channel interrupts before taking a CPU offline drivers/hv/channel.c | 99 --- drivers/hv/channel_mgmt.c | 42 + drivers/hv/connection.c | 3 +- drivers/hv/hv.c | 56 -- drivers/hv/hv_trace.h | 15 ++ drivers/hv/vmbus_drv.c| 4 +- include/linux/hyperv.h| 13 - 7 files changed, 209 insertions(+), 23 deletions(-) -- 2.25.1
[RFC net-next v2] seg6: add counters support for SRv6 Behaviors
dev 2956,00 pps As can be observed, throughputs achieved in scenarios (2),(3) did not suffer any observable degradation compared to scenario (1). Comments, suggestions and improvements are very welcome! Thanks, Andrea v2: - improve comments; - guarantee alignment of 64 bit values, thanks to Jakub Kicinski; - pass counters within netlink attributes rather than passing a whole single structure. This is to address compatibility issues that might arise from having different versions of kernel/iproute2. Thanks to David Ahern for bringing attention to this. - include cover letter in the commit message, thanks to David Ahern. [2] https://www.cloudlab.us Signed-off-by: Andrea Mayer --- include/uapi/linux/seg6_local.h | 30 + net/ipv6/seg6_local.c | 198 +++- 2 files changed, 226 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/seg6_local.h b/include/uapi/linux/seg6_local.h index 3b39ef1dbb46..5ae3ace84de0 100644 --- a/include/uapi/linux/seg6_local.h +++ b/include/uapi/linux/seg6_local.h @@ -27,6 +27,7 @@ enum { SEG6_LOCAL_OIF, SEG6_LOCAL_BPF, SEG6_LOCAL_VRFTABLE, + SEG6_LOCAL_COUNTERS, __SEG6_LOCAL_MAX, }; #define SEG6_LOCAL_MAX (__SEG6_LOCAL_MAX - 1) @@ -78,4 +79,33 @@ enum { #define SEG6_LOCAL_BPF_PROG_MAX (__SEG6_LOCAL_BPF_PROG_MAX - 1) +/* SRv6 Behavior counters are encoded as netlink attributes guaranteeing the + * correct alignment. + * Each counter is identified by a different attribute type (i.e. + * SEG6_LOCAL_CNT_PACKETS). + * + * - SEG6_LOCAL_CNT_PACKETS: identifies a counter that counts the number of + * packets that have been CORRECTLY processed by an SRv6 Behavior instance + * (i.e., packets that generate errors or are dropped are NOT counted). + * + * - SEG6_LOCAL_CNT_BYTES: identifies a counter that counts the total amount + * of traffic in bytes of all packets that have been CORRECTLY processed by + * an SRv6 Behavior instance (i.e., packets that generate errors or are + * dropped are NOT counted). + * + * - SEG6_LOCAL_CNT_ERRORS: identifies a counter that counts the number of + * packets that have NOT been properly processed by an SRv6 Behavior instance + * (i.e., packets that generate errors or are dropped). + */ +enum { + SEG6_LOCAL_CNT_UNSPEC, + SEG6_LOCAL_CNT_PAD, /* pad for 64 bits values */ + SEG6_LOCAL_CNT_PACKETS, + SEG6_LOCAL_CNT_BYTES, + SEG6_LOCAL_CNT_ERRORS, + __SEG6_LOCAL_CNT_MAX, +}; + +#define SEG6_LOCAL_CNT_MAX (__SEG6_LOCAL_CNT_MAX - 1) + #endif diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c index bd7140885e60..3e627cf9ce2a 100644 --- a/net/ipv6/seg6_local.c +++ b/net/ipv6/seg6_local.c @@ -93,6 +93,35 @@ struct seg6_end_dt_info { int hdrlen; }; +struct pcpu_seg6_local_counters { + u64_stats_t packets; + u64_stats_t bytes; + u64_stats_t errors; + + struct u64_stats_sync syncp; +}; + +/* This struct groups all the SRv6 Behavior counters supported so far. + * + * put_nla_counters() makes use of this data structure to collect all counter + * values after the per-CPU counter evaluation has been performed. + * Finally, each counter value (in seg6_local_counters) is stored in the + * corresponding netlink attribute and sent to user space. + * + * NB: we don't want to expose this structure to user space! + */ +struct seg6_local_counters { + __u64 packets; + __u64 bytes; + __u64 errors; +}; + +#define seg6_local_alloc_pcpu_counters(__gfp) \ + __netdev_alloc_pcpu_stats(struct pcpu_seg6_local_counters, \ + ((__gfp) | __GFP_ZERO)) + +#define SEG6_F_LOCAL_COUNTERS SEG6_F_ATTR(SEG6_LOCAL_COUNTERS) + struct seg6_local_lwt { int action; struct ipv6_sr_hdr *srh; @@ -105,6 +134,7 @@ struct seg6_local_lwt { #ifdef CONFIG_NET_L3_MASTER_DEV struct seg6_end_dt_info dt_info; #endif + struct pcpu_seg6_local_counters __percpu *pcpu_counters; int headroom; struct seg6_action_desc *desc; @@ -878,36 +908,43 @@ static struct seg6_action_desc seg6_action_table[] = { { .action = SEG6_LOCAL_ACTION_END, .attrs = 0, + .optattrs = SEG6_F_LOCAL_COUNTERS, .input = input_action_end, }, { .action = SEG6_LOCAL_ACTION_END_X, .attrs = SEG6_F_ATTR(SEG6_LOCAL_NH6), + .optattrs = SEG6_F_LOCAL_COUNTERS, .input = input_action_end_x, }, { .action = SEG6_LOCAL_ACTION_END_T, .attrs = SEG6_F_ATTR(SEG6_LOCAL_TABLE), + .optattrs = SEG6_F_LOCAL_COUNTERS, .input = input_action_end_t, }, { .action = SE
Re: [PATCH v2 3/3] Drivers: hv: vmbus: Check for pending channel interrupts before taking a CPU offline
> > @@ -336,6 +372,19 @@ int hv_synic_cleanup(unsigned int cpu) > > if (channel_found && vmbus_connection.conn_state == CONNECTED) > > return -EBUSY; > > > > + if (vmbus_proto_version >= VERSION_WIN10_V4_1) { > > + /* > > +* channel_found == false means that any channels that were > > previously > > +* assigned to the CPU have been reassigned elsewhere with a > > call of > > +* vmbus_send_modifychannel(). Scan the event flags page > > looking for > > +* bits that are set and waiting with a timeout for > > vmbus_chan_sched() > > +* to process such bits. If bits are still set after this > > operation > > +* and VMBus is connected, fail the CPU offlining operation. > > +*/ > > + if (hv_synic_event_pending() && vmbus_connection.conn_state == > > CONNECTED) > > + return -EBUSY; > > + } > > + > > Perhaps the test for conn_state == CONNECTED could be factored out as > follows. If we're > not CONNECTED (i.e., in the panic or kexec path) we should be able to also > skip the search > for channels that are bound to the CPU since we will ignore the result anyway. > > if (vmbus_connection.conn_state != CONNECTED) > goto always_cleanup; > > if (cpu == VMBUS_CONNECT_CPU) > return -EBUSY; > > [Code to search for channels that are bound to the CPU we're about to > clean up] > > if (channel_found) > return -EBUSY; > > /* >* channel_found == false means that any channels that were previously >* assigned to the CPU have been reassigned elsewhere with a call of >* vmbus_send_modifychannel(). Scan the event flags page looking for >* bits that are set and waiting with a timeout for vmbus_chan_sched() >* to process such bits. If bits are still set after this operation >* and VMBus is connected, fail the CPU offlining operation. >*/ > if (vmbus_proto_version >= VERSION_WIN10_V4_1 && > hv_synic_event_pending()) > return -EBUSY; > > always_cleanup: Agreed, applied. Thank you for the suggestion, Andrea
Re: [PATCH v2 1/3] Drivers: hv: vmbus: Introduce and negotiate VMBus protocol version 5.3
> > @@ -234,6 +234,7 @@ static inline u32 hv_get_avail_to_write_percent( > > * 5 . 0 (Newer Windows 10) > > * 5 . 1 (Windows 10 RS4) > > * 5 . 2 (Windows Server 2019, RS5) > > + * 5 . 3 (Windows Server 2021) // FIXME: use proper version number/name > > The official name is now public information as "Windows Server 2022". Thank you, I've updated the name and removed the FIXME. Andrea
[PATCH hyperv-next] scsi: storvsc: Use blk_mq_unique_tag() to generate requestIDs
Use blk_mq_unique_tag() to generate requestIDs for StorVSC, avoiding all issues with allocating enough entries in the VMbus requestor. Suggested-by: Michael Kelley Signed-off-by: Andrea Parri (Microsoft) --- Changes since RFC: - pass sentinel values for {init,reset}_request in vmbus_sendpacket() - remove/inline the storvsc_request_addr() callback - make storvsc_next_request_id() 'static' - add code to handle the case of an unsolicited message from Hyper-V - other minor/style changes [1] https://lkml.kernel.org/r/20210408161315.341888-1-parri.and...@gmail.com drivers/hv/channel.c | 14 ++--- drivers/hv/ring_buffer.c | 13 +++-- drivers/net/hyperv/netvsc.c | 8 ++- drivers/net/hyperv/rndis_filter.c | 2 + drivers/scsi/storvsc_drv.c| 94 +-- include/linux/hyperv.h| 13 - 6 files changed, 95 insertions(+), 49 deletions(-) diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c index db30be8f9ccea..f78e02ace51e8 100644 --- a/drivers/hv/channel.c +++ b/drivers/hv/channel.c @@ -1121,15 +1121,14 @@ EXPORT_SYMBOL_GPL(vmbus_recvpacket_raw); * vmbus_next_request_id - Returns a new request id. It is also * the index at which the guest memory address is stored. * Uses a spin lock to avoid race conditions. - * @rqstor: Pointer to the requestor struct + * @channel: Pointer to the VMbus channel struct * @rqst_add: Guest memory address to be stored in the array */ -u64 vmbus_next_request_id(struct vmbus_requestor *rqstor, u64 rqst_addr) +u64 vmbus_next_request_id(struct vmbus_channel *channel, u64 rqst_addr) { + struct vmbus_requestor *rqstor = >requestor; unsigned long flags; u64 current_id; - const struct vmbus_channel *channel = - container_of(rqstor, const struct vmbus_channel, requestor); /* Check rqstor has been initialized */ if (!channel->rqstor_size) @@ -1163,16 +1162,15 @@ EXPORT_SYMBOL_GPL(vmbus_next_request_id); /* * vmbus_request_addr - Returns the memory address stored at @trans_id * in @rqstor. Uses a spin lock to avoid race conditions. - * @rqstor: Pointer to the requestor struct + * @channel: Pointer to the VMbus channel struct * @trans_id: Request id sent back from Hyper-V. Becomes the requestor's * next request id. */ -u64 vmbus_request_addr(struct vmbus_requestor *rqstor, u64 trans_id) +u64 vmbus_request_addr(struct vmbus_channel *channel, u64 trans_id) { + struct vmbus_requestor *rqstor = >requestor; unsigned long flags; u64 req_addr; - const struct vmbus_channel *channel = - container_of(rqstor, const struct vmbus_channel, requestor); /* Check rqstor has been initialized */ if (!channel->rqstor_size) diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c index ecd82ebfd5bc4..2bf57677272b5 100644 --- a/drivers/hv/ring_buffer.c +++ b/drivers/hv/ring_buffer.c @@ -310,10 +310,12 @@ int hv_ringbuffer_write(struct vmbus_channel *channel, */ if (desc->flags == VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED) { - rqst_id = vmbus_next_request_id(>requestor, requestid); - if (rqst_id == VMBUS_RQST_ERROR) { - spin_unlock_irqrestore(_info->ring_lock, flags); - return -EAGAIN; + if (channel->next_request_id_callback != NULL) { + rqst_id = channel->next_request_id_callback(channel, requestid); + if (rqst_id == VMBUS_RQST_ERROR) { + spin_unlock_irqrestore(_info->ring_lock, flags); + return -EAGAIN; + } } } desc = hv_get_ring_buffer(outring_info) + old_write; @@ -341,7 +343,8 @@ int hv_ringbuffer_write(struct vmbus_channel *channel, if (channel->rescind) { if (rqst_id != VMBUS_NO_RQSTOR) { /* Reclaim request ID to avoid leak of IDs */ - vmbus_request_addr(>requestor, rqst_id); + if (channel->request_addr_callback != NULL) + channel->request_addr_callback(channel, rqst_id); } return -ENODEV; } diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index c64cc7639c39c..1a221ce2d6fdc 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -730,7 +730,7 @@ static void netvsc_send_tx_complete(struct net_device *ndev, int queue_sends; u64 cmd_rqst; - cmd_rqst = vmbus_request_addr(>requestor, (u64)desc->trans_id); + cmd_rqst = channel->request_addr_callback(channel, (u64)desc->trans_id); if (cmd_rqst == VMBUS_RQST_ERROR) { netdev_err(ndev, "Incorrect transaction id\n"); return; @@ -790,
[PATCH v2 3/3] Drivers: hv: vmbus: Check for pending channel interrupts before taking a CPU offline
Check that enough time has passed such that the modify channel message has been processed before taking a CPU offline. Signed-off-by: Andrea Parri (Microsoft) --- drivers/hv/hv.c | 49 + 1 file changed, 49 insertions(+) diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c index 3e6ff83adff42..dc9aa1130b22f 100644 --- a/drivers/hv/hv.c +++ b/drivers/hv/hv.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include @@ -292,6 +293,41 @@ void hv_synic_disable_regs(unsigned int cpu) disable_percpu_irq(vmbus_irq); } +#define HV_MAX_TRIES 3 +/* + * Scan the event flags page of 'this' CPU looking for any bit that is set. If we find one + * bit set, then wait for a few milliseconds. Repeat these steps for a maximum of 3 times. + * Return 'true', if there is still any set bit after this operation; 'false', otherwise. + * + * If a bit is set, that means there is a pending channel interrupt. The expectation is + * that the normal interrupt handling mechanism will find and process the channel interrupt + * "very soon", and in the process clear the bit. + */ +static bool hv_synic_event_pending(void) +{ + struct hv_per_cpu_context *hv_cpu = this_cpu_ptr(hv_context.cpu_context); + union hv_synic_event_flags *event = + (union hv_synic_event_flags *)hv_cpu->synic_event_page + VMBUS_MESSAGE_SINT; + unsigned long *recv_int_page = event->flags; /* assumes VMBus version >= VERSION_WIN8 */ + bool pending; + u32 relid; + int tries = 0; + +retry: + pending = false; + for_each_set_bit(relid, recv_int_page, HV_EVENT_FLAGS_COUNT) { + /* Special case - VMBus channel protocol messages */ + if (relid == 0) + continue; + pending = true; + break; + } + if (pending && tries++ < HV_MAX_TRIES) { + usleep_range(1, 2); + goto retry; + } + return pending; +} int hv_synic_cleanup(unsigned int cpu) { @@ -336,6 +372,19 @@ int hv_synic_cleanup(unsigned int cpu) if (channel_found && vmbus_connection.conn_state == CONNECTED) return -EBUSY; + if (vmbus_proto_version >= VERSION_WIN10_V4_1) { + /* +* channel_found == false means that any channels that were previously +* assigned to the CPU have been reassigned elsewhere with a call of +* vmbus_send_modifychannel(). Scan the event flags page looking for +* bits that are set and waiting with a timeout for vmbus_chan_sched() +* to process such bits. If bits are still set after this operation +* and VMBus is connected, fail the CPU offlining operation. +*/ + if (hv_synic_event_pending() && vmbus_connection.conn_state == CONNECTED) + return -EBUSY; + } + hv_stimer_legacy_cleanup(cpu); hv_synic_disable_regs(cpu); -- 2.25.1
[PATCH v2 2/3] Drivers: hv: vmbus: Drivers: hv: vmbus: Introduce CHANNELMSG_MODIFYCHANNEL_RESPONSE
Introduce the CHANNELMSG_MODIFYCHANNEL_RESPONSE message type, and code to receive and process such a message. Signed-off-by: Andrea Parri (Microsoft) --- drivers/hv/channel.c | 99 --- drivers/hv/channel_mgmt.c | 42 + drivers/hv/hv_trace.h | 15 ++ drivers/hv/vmbus_drv.c| 4 +- include/linux/hyperv.h| 11 - 5 files changed, 152 insertions(+), 19 deletions(-) diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c index db30be8f9ccea..aa4ef75d8dee2 100644 --- a/drivers/hv/channel.c +++ b/drivers/hv/channel.c @@ -209,31 +209,96 @@ int vmbus_send_tl_connect_request(const guid_t *shv_guest_servie_id, } EXPORT_SYMBOL_GPL(vmbus_send_tl_connect_request); +static int send_modifychannel_without_ack(struct vmbus_channel *channel, u32 target_vp) +{ + struct vmbus_channel_modifychannel msg; + int ret; + + memset(, 0, sizeof(msg)); + msg.header.msgtype = CHANNELMSG_MODIFYCHANNEL; + msg.child_relid = channel->offermsg.child_relid; + msg.target_vp = target_vp; + + ret = vmbus_post_msg(, sizeof(msg), true); + trace_vmbus_send_modifychannel(, ret); + + return ret; +} + +static int send_modifychannel_with_ack(struct vmbus_channel *channel, u32 target_vp) +{ + struct vmbus_channel_modifychannel *msg; + struct vmbus_channel_msginfo *info; + unsigned long flags; + int ret; + + info = kzalloc(sizeof(struct vmbus_channel_msginfo) + + sizeof(struct vmbus_channel_modifychannel), + GFP_KERNEL); + if (!info) + return -ENOMEM; + + init_completion(>waitevent); + info->waiting_channel = channel; + + msg = (struct vmbus_channel_modifychannel *)info->msg; + msg->header.msgtype = CHANNELMSG_MODIFYCHANNEL; + msg->child_relid = channel->offermsg.child_relid; + msg->target_vp = target_vp; + + spin_lock_irqsave(_connection.channelmsg_lock, flags); + list_add_tail(>msglistentry, _connection.chn_msg_list); + spin_unlock_irqrestore(_connection.channelmsg_lock, flags); + + ret = vmbus_post_msg(msg, sizeof(*msg), true); + trace_vmbus_send_modifychannel(msg, ret); + if (ret != 0) { + spin_lock_irqsave(_connection.channelmsg_lock, flags); + list_del(>msglistentry); + spin_unlock_irqrestore(_connection.channelmsg_lock, flags); + goto free_info; + } + + /* +* Release channel_mutex; otherwise, vmbus_onoffer_rescind() could block on +* the mutex and be unable to signal the completion. +* +* See the caller target_cpu_store() for information about the usage of the +* mutex. +*/ + mutex_unlock(_connection.channel_mutex); + wait_for_completion(>waitevent); + mutex_lock(_connection.channel_mutex); + + spin_lock_irqsave(_connection.channelmsg_lock, flags); + list_del(>msglistentry); + spin_unlock_irqrestore(_connection.channelmsg_lock, flags); + + if (info->response.modify_response.status) + ret = -EAGAIN; + +free_info: + kfree(info); + return ret; +} + /* * Set/change the vCPU (@target_vp) the channel (@child_relid) will interrupt. * - * CHANNELMSG_MODIFYCHANNEL messages are aynchronous. Also, Hyper-V does not - * ACK such messages. IOW we can't know when the host will stop interrupting - * the "old" vCPU and start interrupting the "new" vCPU for the given channel. + * CHANNELMSG_MODIFYCHANNEL messages are aynchronous. When VMbus version 5.3 + * or later is negotiated, Hyper-V always sends an ACK in response to such a + * message. For VMbus version 5.2 and earlier, it never sends an ACK. With- + * out an ACK, we can not know when the host will stop interrupting the "old" + * vCPU and start interrupting the "new" vCPU for the given channel. * * The CHANNELMSG_MODIFYCHANNEL message type is supported since VMBus version * VERSION_WIN10_V4_1. */ -int vmbus_send_modifychannel(u32 child_relid, u32 target_vp) +int vmbus_send_modifychannel(struct vmbus_channel *channel, u32 target_vp) { - struct vmbus_channel_modifychannel conn_msg; - int ret; - - memset(_msg, 0, sizeof(conn_msg)); - conn_msg.header.msgtype = CHANNELMSG_MODIFYCHANNEL; - conn_msg.child_relid = child_relid; - conn_msg.target_vp = target_vp; - - ret = vmbus_post_msg(_msg, sizeof(conn_msg), true); - - trace_vmbus_send_modifychannel(_msg, ret); - - return ret; + if (vmbus_proto_version >= VERSION_WIN10_V5_3) + return send_modifychannel_with_ack(channel, target_vp); + return send_modifychannel_without_ack(channel, target_vp); } EXPORT_SYMBOL_GPL(vmbus_send_modifychannel); diff --git a/drivers/hv/channel_mgmt.c b/dr
[PATCH v2 0/3] Drivers: hv: vmbus: Introduce CHANNELMSG_MODIFYCHANNEL_RESPONSE
Changes since v1[1]: - rebase on hyperv-next - split changes into three patches - fix error handling in send_modifychannel_with_ack() - remove rescind checks in send_modifychannel_with_ack() - remove unneeded test in hv_synic_event_pending() - add/amend inline comments - style changes [1] https://lkml.kernel.org/r/20201126191210.13115-1-parri.and...@gmail.com Andrea Parri (Microsoft) (3): Drivers: hv: vmbus: Introduce and negotiate VMBus protocol version 5.3 Drivers: hv: vmbus: Drivers: hv: vmbus: Introduce CHANNELMSG_MODIFYCHANNEL_RESPONSE Drivers: hv: vmbus: Check for pending channel interrupts before taking a CPU offline drivers/hv/channel.c | 99 --- drivers/hv/channel_mgmt.c | 42 + drivers/hv/connection.c | 3 +- drivers/hv/hv.c | 49 +++ drivers/hv/hv_trace.h | 15 ++ drivers/hv/vmbus_drv.c| 4 +- include/linux/hyperv.h| 13 - 7 files changed, 205 insertions(+), 20 deletions(-) -- 2.25.1
[PATCH v2 1/3] Drivers: hv: vmbus: Introduce and negotiate VMBus protocol version 5.3
Hyper-V has added VMBus protocol version 5.3. Allow Linux guests to negotiate the new version on version of Hyper-V that support it. Signed-off-by: Andrea Parri (Microsoft) --- drivers/hv/connection.c | 3 ++- include/linux/hyperv.h | 2 ++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c index 350e8c5cafa8c..dc19d5ae4373c 100644 --- a/drivers/hv/connection.c +++ b/drivers/hv/connection.c @@ -45,6 +45,7 @@ EXPORT_SYMBOL_GPL(vmbus_proto_version); * Table of VMBus versions listed from newest to oldest. */ static __u32 vmbus_versions[] = { + VERSION_WIN10_V5_3, VERSION_WIN10_V5_2, VERSION_WIN10_V5_1, VERSION_WIN10_V5, @@ -60,7 +61,7 @@ static __u32 vmbus_versions[] = { * Maximal VMBus protocol version guests can negotiate. Useful to cap the * VMBus version for testing and debugging purpose. */ -static uint max_version = VERSION_WIN10_V5_2; +static uint max_version = VERSION_WIN10_V5_3; module_param(max_version, uint, S_IRUGO); MODULE_PARM_DESC(max_version, diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index 2c18c8e768efe..d6a6f76040b5f 100644 --- a/include/linux/hyperv.h +++ b/include/linux/hyperv.h @@ -234,6 +234,7 @@ static inline u32 hv_get_avail_to_write_percent( * 5 . 0 (Newer Windows 10) * 5 . 1 (Windows 10 RS4) * 5 . 2 (Windows Server 2019, RS5) + * 5 . 3 (Windows Server 2021) // FIXME: use proper version number/name */ #define VERSION_WS2008 ((0 << 16) | (13)) @@ -245,6 +246,7 @@ static inline u32 hv_get_avail_to_write_percent( #define VERSION_WIN10_V5 ((5 << 16) | (0)) #define VERSION_WIN10_V5_1 ((5 << 16) | (1)) #define VERSION_WIN10_V5_2 ((5 << 16) | (2)) +#define VERSION_WIN10_V5_3 ((5 << 16) | (3)) /* Make maximum size of pipe payload of 16K */ #define MAX_PIPE_DATA_PAYLOAD (sizeof(u8) * 16384) -- 2.25.1
Re: [PATCH] Drivers: hv: vmbus: Use after free in __vmbus_open()
On Tue, Apr 13, 2021 at 01:50:04PM +0300, Dan Carpenter wrote: > The "open_info" variable is added to the _connection.chn_msg_list, > but the error handling frees "open_info" without removing it from the > list. This will result in a use after free. First remove it from the > list, and then free it. > > Fixes: 6f3d791f3006 ("Drivers: hv: vmbus: Fix rescind handling issues") > Signed-off-by: Dan Carpenter I had this 'queued' in my list, Reviewed-by: Andrea Parri Andrea > --- > From static analysis. Untested etc. There is almost certainly a good > reason to add it to the list before checking "newchannel->rescind" but I > don't know the code well enough to know what the reason is. > > drivers/hv/channel.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c > index db30be8f9cce..1c5a418c1962 100644 > --- a/drivers/hv/channel.c > +++ b/drivers/hv/channel.c > @@ -653,7 +653,7 @@ static int __vmbus_open(struct vmbus_channel *newchannel, > > if (newchannel->rescind) { > err = -ENODEV; > - goto error_free_info; > + goto error_clean_msglist; > } > > err = vmbus_post_msg(open_msg, > -- > 2.30.2 >
Re: [RFC PATCH hyperv-next] scsi: storvsc: Use blk_mq_unique_tag() to generate requestIDs
On Fri, Apr 09, 2021 at 03:38:14PM +, Michael Kelley wrote: > From: Andrea Parri (Microsoft) Sent: Thursday, April > 8, 2021 9:13 AM > > > > Use blk_mq_unique_tag() to generate requestIDs for StorVSC, avoiding > > all issues with allocating enough entries in the VMbus requestor. > > This looks good to me! I'm glad to see that the idea worked without > too much complexity. > > See a few comments inline below. Thank you for these suggestions; I've tried to implement them, cf. the diff at the bottom of this email (on top of this RFC, plus 'change the storvsc callbacks to 'static''). I like the result, however this does not work well yet: I am getting 'Incorrect transaction id' messages at boot time with this diff; I'll dig more tomorrow... hints are welcome! Andrea > > > > > Suggested-by: Michael Kelley > > Signed-off-by: Andrea Parri (Microsoft) > > --- > > drivers/hv/channel.c | 14 +++--- > > drivers/hv/ring_buffer.c | 12 ++--- > > drivers/net/hyperv/netvsc.c | 8 ++-- > > drivers/net/hyperv/rndis_filter.c | 2 + > > drivers/scsi/storvsc_drv.c| 73 ++- > > include/linux/hyperv.h| 13 +- > > 6 files changed, 92 insertions(+), 30 deletions(-) > > > > diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c > > index db30be8f9ccea..f78e02ace51e8 100644 > > --- a/drivers/hv/channel.c > > +++ b/drivers/hv/channel.c > > @@ -1121,15 +1121,14 @@ EXPORT_SYMBOL_GPL(vmbus_recvpacket_raw); > > * vmbus_next_request_id - Returns a new request id. It is also > > * the index at which the guest memory address is stored. > > * Uses a spin lock to avoid race conditions. > > - * @rqstor: Pointer to the requestor struct > > + * @channel: Pointer to the VMbus channel struct > > * @rqst_add: Guest memory address to be stored in the array > > */ > > -u64 vmbus_next_request_id(struct vmbus_requestor *rqstor, u64 rqst_addr) > > +u64 vmbus_next_request_id(struct vmbus_channel *channel, u64 rqst_addr) > > { > > + struct vmbus_requestor *rqstor = >requestor; > > unsigned long flags; > > u64 current_id; > > - const struct vmbus_channel *channel = > > - container_of(rqstor, const struct vmbus_channel, requestor); > > > > /* Check rqstor has been initialized */ > > if (!channel->rqstor_size) > > @@ -1163,16 +1162,15 @@ EXPORT_SYMBOL_GPL(vmbus_next_request_id); > > /* > > * vmbus_request_addr - Returns the memory address stored at @trans_id > > * in @rqstor. Uses a spin lock to avoid race conditions. > > - * @rqstor: Pointer to the requestor struct > > + * @channel: Pointer to the VMbus channel struct > > * @trans_id: Request id sent back from Hyper-V. Becomes the requestor's > > * next request id. > > */ > > -u64 vmbus_request_addr(struct vmbus_requestor *rqstor, u64 trans_id) > > +u64 vmbus_request_addr(struct vmbus_channel *channel, u64 trans_id) > > { > > + struct vmbus_requestor *rqstor = >requestor; > > unsigned long flags; > > u64 req_addr; > > - const struct vmbus_channel *channel = > > - container_of(rqstor, const struct vmbus_channel, requestor); > > > > /* Check rqstor has been initialized */ > > if (!channel->rqstor_size) > > diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c > > index ecd82ebfd5bc4..46d8e038e4ee1 100644 > > --- a/drivers/hv/ring_buffer.c > > +++ b/drivers/hv/ring_buffer.c > > @@ -310,10 +310,12 @@ int hv_ringbuffer_write(struct vmbus_channel *channel, > > */ > > > > if (desc->flags == VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED) { > > - rqst_id = vmbus_next_request_id(>requestor, requestid); > > - if (rqst_id == VMBUS_RQST_ERROR) { > > - spin_unlock_irqrestore(_info->ring_lock, flags); > > - return -EAGAIN; > > + if (channel->next_request_id_callback != NULL) { > > + rqst_id = channel->next_request_id_callback(channel, > > requestid); > > + if (rqst_id == VMBUS_RQST_ERROR) { > > + > > spin_unlock_irqrestore(_info->ring_lock, flags); > > + return -EAGAIN; > > + } > > } > > } > > desc = hv_get_ring_buffer(outring_info) + old_write; > > @@ -341,7 +343,7 @@ int hv_ringbuffer_write(struct vmbus_channel *channel, > > if (channel->rescind) { > >
[PATCH net] net: seg6: trivial fix of a spelling mistake in comment
There is a comment spelling mistake "interfarence" -> "interference" in function parse_nla_action(). Fix it. Signed-off-by: Andrea Mayer --- net/ipv6/seg6_local.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c index 8936f48570fc..bd7140885e60 100644 --- a/net/ipv6/seg6_local.c +++ b/net/ipv6/seg6_local.c @@ -1475,7 +1475,7 @@ static int parse_nla_action(struct nlattr **attrs, struct seg6_local_lwt *slwt) /* Forcing the desc->optattrs *set* and the desc->attrs *set* to be * disjoined, this allow us to release acquired resources by optional * attributes and by required attributes independently from each other -* without any interfarence. +* without any interference. * In other terms, we are sure that we do not release some the acquired * resources twice. * -- 2.20.1
Re: [RFC net-next 1/1] seg6: add counters support for SRv6 Behaviors
On Wed, 7 Apr 2021 16:55:41 -0600 David Ahern wrote: > On 4/7/21 12:03 PM, Andrea Mayer wrote: > > diff --git a/include/uapi/linux/seg6_local.h > > b/include/uapi/linux/seg6_local.h > > index 3b39ef1dbb46..ae5e3fd12b73 100644 > > --- a/include/uapi/linux/seg6_local.h > > +++ b/include/uapi/linux/seg6_local.h > > @@ -27,6 +27,7 @@ enum { > > SEG6_LOCAL_OIF, > > SEG6_LOCAL_BPF, > > SEG6_LOCAL_VRFTABLE, > > + SEG6_LOCAL_COUNTERS, > > __SEG6_LOCAL_MAX, > > }; > > #define SEG6_LOCAL_MAX (__SEG6_LOCAL_MAX - 1) > > @@ -78,4 +79,11 @@ enum { > > > > #define SEG6_LOCAL_BPF_PROG_MAX (__SEG6_LOCAL_BPF_PROG_MAX - 1) > > > > +/* SRv6 Behavior counters */ > > +struct seg6_local_counters { > > + __u64 rx_packets; > > + __u64 rx_bytes; > > + __u64 rx_errors; > > +}; > > + > > #endif > > It's highly likely that more stats would get added over time. It would > be good to document that here for interested parties and then make sure > iproute2 can handle different sized stats structs. e.g., commit support > to your repo, then add a new one (e.g, rx_drops) and verify the > combinations handle it. e.g., old kernel - new iproute2, new kernel - > old iproute, old - old and new-new. > Hi David, thanks for your review. I totally agree with you: we may want to add other counters in the future, even if they are not considered in RFC8986. With that in mind, the shared struct seg6_local_counters is not the best way to go if we want to add other counters (because it will be difficult to manage different sized structures when considering different kernel/iproute2 versions). To make it easier adding new counters, instead of sharing the struct seg6_local_counters, I would use netlink nested attributes to exchange counters individually. In this way, only recognized (nested) attributes can be processed by both the kernel and iproute2. For example: enum { SEG6_LOCAL_CNT_UNSPEC, SEG6_LOCAL_CNT_PAD, /* padding for 64 bits values */ SEG6_LOCAL_CNT_RX_PACKETS, SEG6_LOCAL_CNT_RX_BYTES, SEG6_LOCAL_CNT_RX_ERRORS, __SEG6_LOCAL_CNT_MAX, }; #define SEG6_LOCAL_CNT_MAX (__SEG6_LOCAL_CNT_MAX - 1) updating the policy for SEG6_LOCAL_COUNTERS to NLA_NESTED. Then, I create a new policy for counters which handles each supported counter separately. static const struct nla_policy seg6_local_counters_policy[SEG6_LOCAL_CNT_MAX + 1] = { [SEG6_LOCAL_CNT_RX_PACKETS] = { .type = NLA_U64 }, [SEG6_LOCAL_CNT_RX_BYTES] = { .type = NLA_U64 }, [SEG6_LOCAL_CNT_RX_ERRORS] = { .type = NLA_U64 }, }; At the end, I update the parse_nla_counters(), put_nla_counters(), etc according to the changes, i.e: - nla_parse_nested() in parse_nla_counters(); - nla_nest_{start/end}() and for each supported counter nla_put_u64_64bit() in put_nla_counters(). On the iproute2 side, we have to update the code to reflect the changes discussed above. I plan to issue an RFC v2 in a few days. Andrea
Re: [PATCH hyperv-next] Drivers: hv: vmbus: Copy packets sent by Hyper-V out of the ring buffer
On Fri, Apr 09, 2021 at 03:49:00PM +, Michael Kelley wrote: > From: Andrea Parri (Microsoft) Sent: Thursday, April > 8, 2021 9:15 AM > > > > Pointers to ring-buffer packets sent by Hyper-V are used within the > > guest VM. Hyper-V can send packets with erroneous values or modify > > packet fields after they are processed by the guest. To defend > > against these scenarios, return a copy of the incoming VMBus packet > > after validating its length and offset fields in hv_pkt_iter_first(). > > In this way, the packet can no longer be modified by the host. > > Andrea -- has anything changed in this version of this patch, except > the value of NETVSC_MAX_XFER_PAGE_RANGES? It used to be a > fixed 375, but now is NVSP_RSC_MAX, which is 562. > > If that's the only change, then > > Reviewed-by: Michael Kelley The only change here is indeed the value of NETVSC_MAX_XFER_PAGE_RANGES, apologies for the omission of the changelog. Thanks for the review. Andrea
Re: [RFC net-next 1/1] seg6: add counters support for SRv6 Behaviors
On Wed, 7 Apr 2021 13:24:04 -0700 Jakub Kicinski wrote: > On Wed, 7 Apr 2021 20:03:32 +0200 Andrea Mayer wrote: > > This patch provides counters for SRv6 Behaviors as defined in [1], section > > 6. For each SRv6 Behavior instance, the counters defined in [1] are: > > > > - the total number of packets that have been correctly processed; > > - the total amount of traffic in bytes of all packets that have been > >correctly processed; > > > > In addition, we introduces a new counter that counts the number of packets > > that have NOT been properly processed (i.e. errors) by an SRv6 Behavior > > instance. > > > > Each SRv6 Behavior instance can be configured, at the time of its creation, > > to make use of counters. > > This is done through iproute2 which allows the user to create an SRv6 > > Behavior instance specifying the optional "count" attribute as shown in the > > following example: > > > > $ ip -6 route add 2001:db8::1 encap seg6local action End count dev eth0 > > > > per-behavior counters can be shown by adding "-s" to the iproute2 command > > line, i.e.: > > > > $ ip -s -6 route show 2001:db8::1 > > 2001:db8::1 encap seg6local action End packets 0 bytes 0 errors 0 dev eth0 > > > > [1] https://www.rfc-editor.org/rfc/rfc8986.html#name-counters > > > > Signed-off-by: Andrea Mayer > > > +static int put_nla_counters(struct sk_buff *skb, struct seg6_local_lwt > > *slwt) > > +{ > > + struct seg6_local_counters counters = { 0, 0, 0 }; > > + struct nlattr *nla; > > + int i; > > + > > + nla = nla_reserve(skb, SEG6_LOCAL_COUNTERS, sizeof(counters)); > > + if (!nla) > > + return -EMSGSIZE; > > nla_reserve_64bit(), IIUC netlink guarantees alignment of 64 bit values. Hi Jakub, thanks for your review! Yes, we should guarantee alignment of 64 bit values. I will definitely follow your advice. Andrea
[PATCH hyperv-next] Drivers: hv: vmbus: Copy packets sent by Hyper-V out of the ring buffer
From: Andres Beltran Pointers to ring-buffer packets sent by Hyper-V are used within the guest VM. Hyper-V can send packets with erroneous values or modify packet fields after they are processed by the guest. To defend against these scenarios, return a copy of the incoming VMBus packet after validating its length and offset fields in hv_pkt_iter_first(). In this way, the packet can no longer be modified by the host. Signed-off-by: Andres Beltran Co-developed-by: Andrea Parri (Microsoft) Signed-off-by: Andrea Parri (Microsoft) --- drivers/hv/channel.c | 9 ++-- drivers/hv/hv_fcopy.c | 1 + drivers/hv/hv_kvp.c | 1 + drivers/hv/hyperv_vmbus.h | 2 +- drivers/hv/ring_buffer.c | 82 ++- drivers/net/hyperv/hyperv_net.h | 7 +++ drivers/net/hyperv/netvsc.c | 2 + drivers/net/hyperv/rndis_filter.c | 2 + drivers/scsi/storvsc_drv.c| 10 include/linux/hyperv.h| 48 +++--- net/vmw_vsock/hyperv_transport.c | 4 +- 11 files changed, 143 insertions(+), 25 deletions(-) diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c index db30be8f9ccea..b665db21e120d 100644 --- a/drivers/hv/channel.c +++ b/drivers/hv/channel.c @@ -597,12 +597,15 @@ static int __vmbus_open(struct vmbus_channel *newchannel, newchannel->onchannel_callback = onchannelcallback; newchannel->channel_callback_context = context; - err = hv_ringbuffer_init(>outbound, page, send_pages); + if (!newchannel->max_pkt_size) + newchannel->max_pkt_size = VMBUS_DEFAULT_MAX_PKT_SIZE; + + err = hv_ringbuffer_init(>outbound, page, send_pages, 0); if (err) goto error_clean_ring; - err = hv_ringbuffer_init(>inbound, -[send_pages], recv_pages); + err = hv_ringbuffer_init(>inbound, [send_pages], +recv_pages, newchannel->max_pkt_size); if (err) goto error_clean_ring; diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c index 59ce85e00a028..660036da74495 100644 --- a/drivers/hv/hv_fcopy.c +++ b/drivers/hv/hv_fcopy.c @@ -349,6 +349,7 @@ int hv_fcopy_init(struct hv_util_service *srv) { recv_buffer = srv->recv_buffer; fcopy_transaction.recv_channel = srv->channel; + fcopy_transaction.recv_channel->max_pkt_size = HV_HYP_PAGE_SIZE * 2; /* * When this driver loads, the user level daemon that diff --git a/drivers/hv/hv_kvp.c b/drivers/hv/hv_kvp.c index b49962d312cef..c698592b83e42 100644 --- a/drivers/hv/hv_kvp.c +++ b/drivers/hv/hv_kvp.c @@ -757,6 +757,7 @@ hv_kvp_init(struct hv_util_service *srv) { recv_buffer = srv->recv_buffer; kvp_transaction.recv_channel = srv->channel; + kvp_transaction.recv_channel->max_pkt_size = HV_HYP_PAGE_SIZE * 4; /* * When this driver loads, the user level daemon that diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h index 9416e09ebd58c..42f3d9d123a12 100644 --- a/drivers/hv/hyperv_vmbus.h +++ b/drivers/hv/hyperv_vmbus.h @@ -174,7 +174,7 @@ extern int hv_synic_cleanup(unsigned int cpu); void hv_ringbuffer_pre_init(struct vmbus_channel *channel); int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info, - struct page *pages, u32 pagecnt); + struct page *pages, u32 pagecnt, u32 max_pkt_size); void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info); diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c index ecd82ebfd5bc4..848f3bba83f8b 100644 --- a/drivers/hv/ring_buffer.c +++ b/drivers/hv/ring_buffer.c @@ -190,7 +190,7 @@ void hv_ringbuffer_pre_init(struct vmbus_channel *channel) /* Initialize the ring buffer. */ int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info, - struct page *pages, u32 page_cnt) + struct page *pages, u32 page_cnt, u32 max_pkt_size) { int i; struct page **pages_wraparound; @@ -232,6 +232,14 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info, sizeof(struct hv_ring_buffer); ring_info->priv_read_index = 0; + /* Initialize buffer that holds copies of incoming packets */ + if (max_pkt_size) { + ring_info->pkt_buffer = kzalloc(max_pkt_size, GFP_KERNEL); + if (!ring_info->pkt_buffer) + return -ENOMEM; + ring_info->pkt_buffer_size = max_pkt_size; + } + spin_lock_init(_info->ring_lock); return 0; @@ -244,6 +252,9 @@ void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info) vunmap(ring_info->ring_buffer); ring_info->ring_buffer = NULL; mutex_unlock(_info->ring_buffer_mutex); + + kfree(ring_info->pkt_buffer); + ring_in
[RFC PATCH hyperv-next] scsi: storvsc: Use blk_mq_unique_tag() to generate requestIDs
Use blk_mq_unique_tag() to generate requestIDs for StorVSC, avoiding all issues with allocating enough entries in the VMbus requestor. Suggested-by: Michael Kelley Signed-off-by: Andrea Parri (Microsoft) --- drivers/hv/channel.c | 14 +++--- drivers/hv/ring_buffer.c | 12 ++--- drivers/net/hyperv/netvsc.c | 8 ++-- drivers/net/hyperv/rndis_filter.c | 2 + drivers/scsi/storvsc_drv.c| 73 ++- include/linux/hyperv.h| 13 +- 6 files changed, 92 insertions(+), 30 deletions(-) diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c index db30be8f9ccea..f78e02ace51e8 100644 --- a/drivers/hv/channel.c +++ b/drivers/hv/channel.c @@ -1121,15 +1121,14 @@ EXPORT_SYMBOL_GPL(vmbus_recvpacket_raw); * vmbus_next_request_id - Returns a new request id. It is also * the index at which the guest memory address is stored. * Uses a spin lock to avoid race conditions. - * @rqstor: Pointer to the requestor struct + * @channel: Pointer to the VMbus channel struct * @rqst_add: Guest memory address to be stored in the array */ -u64 vmbus_next_request_id(struct vmbus_requestor *rqstor, u64 rqst_addr) +u64 vmbus_next_request_id(struct vmbus_channel *channel, u64 rqst_addr) { + struct vmbus_requestor *rqstor = >requestor; unsigned long flags; u64 current_id; - const struct vmbus_channel *channel = - container_of(rqstor, const struct vmbus_channel, requestor); /* Check rqstor has been initialized */ if (!channel->rqstor_size) @@ -1163,16 +1162,15 @@ EXPORT_SYMBOL_GPL(vmbus_next_request_id); /* * vmbus_request_addr - Returns the memory address stored at @trans_id * in @rqstor. Uses a spin lock to avoid race conditions. - * @rqstor: Pointer to the requestor struct + * @channel: Pointer to the VMbus channel struct * @trans_id: Request id sent back from Hyper-V. Becomes the requestor's * next request id. */ -u64 vmbus_request_addr(struct vmbus_requestor *rqstor, u64 trans_id) +u64 vmbus_request_addr(struct vmbus_channel *channel, u64 trans_id) { + struct vmbus_requestor *rqstor = >requestor; unsigned long flags; u64 req_addr; - const struct vmbus_channel *channel = - container_of(rqstor, const struct vmbus_channel, requestor); /* Check rqstor has been initialized */ if (!channel->rqstor_size) diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c index ecd82ebfd5bc4..46d8e038e4ee1 100644 --- a/drivers/hv/ring_buffer.c +++ b/drivers/hv/ring_buffer.c @@ -310,10 +310,12 @@ int hv_ringbuffer_write(struct vmbus_channel *channel, */ if (desc->flags == VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED) { - rqst_id = vmbus_next_request_id(>requestor, requestid); - if (rqst_id == VMBUS_RQST_ERROR) { - spin_unlock_irqrestore(_info->ring_lock, flags); - return -EAGAIN; + if (channel->next_request_id_callback != NULL) { + rqst_id = channel->next_request_id_callback(channel, requestid); + if (rqst_id == VMBUS_RQST_ERROR) { + spin_unlock_irqrestore(_info->ring_lock, flags); + return -EAGAIN; + } } } desc = hv_get_ring_buffer(outring_info) + old_write; @@ -341,7 +343,7 @@ int hv_ringbuffer_write(struct vmbus_channel *channel, if (channel->rescind) { if (rqst_id != VMBUS_NO_RQSTOR) { /* Reclaim request ID to avoid leak of IDs */ - vmbus_request_addr(>requestor, rqst_id); + channel->request_addr_callback(channel, rqst_id); } return -ENODEV; } diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index c64cc7639c39c..1a221ce2d6fdc 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -730,7 +730,7 @@ static void netvsc_send_tx_complete(struct net_device *ndev, int queue_sends; u64 cmd_rqst; - cmd_rqst = vmbus_request_addr(>requestor, (u64)desc->trans_id); + cmd_rqst = channel->request_addr_callback(channel, (u64)desc->trans_id); if (cmd_rqst == VMBUS_RQST_ERROR) { netdev_err(ndev, "Incorrect transaction id\n"); return; @@ -790,8 +790,8 @@ static void netvsc_send_completion(struct net_device *ndev, /* First check if this is a VMBUS completion without data payload */ if (!msglen) { - cmd_rqst = vmbus_request_addr(_channel->requestor, - (u64)desc->trans_id); + cmd_rqst = incoming_channel->request_addr_callback(incoming_channel, +
[RFC net-next 1/1] seg6: add counters support for SRv6 Behaviors
This patch provides counters for SRv6 Behaviors as defined in [1], section 6. For each SRv6 Behavior instance, the counters defined in [1] are: - the total number of packets that have been correctly processed; - the total amount of traffic in bytes of all packets that have been correctly processed; In addition, we introduces a new counter that counts the number of packets that have NOT been properly processed (i.e. errors) by an SRv6 Behavior instance. Each SRv6 Behavior instance can be configured, at the time of its creation, to make use of counters. This is done through iproute2 which allows the user to create an SRv6 Behavior instance specifying the optional "count" attribute as shown in the following example: $ ip -6 route add 2001:db8::1 encap seg6local action End count dev eth0 per-behavior counters can be shown by adding "-s" to the iproute2 command line, i.e.: $ ip -s -6 route show 2001:db8::1 2001:db8::1 encap seg6local action End packets 0 bytes 0 errors 0 dev eth0 [1] https://www.rfc-editor.org/rfc/rfc8986.html#name-counters Signed-off-by: Andrea Mayer --- include/uapi/linux/seg6_local.h | 8 ++ net/ipv6/seg6_local.c | 133 +++- 2 files changed, 139 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/seg6_local.h b/include/uapi/linux/seg6_local.h index 3b39ef1dbb46..ae5e3fd12b73 100644 --- a/include/uapi/linux/seg6_local.h +++ b/include/uapi/linux/seg6_local.h @@ -27,6 +27,7 @@ enum { SEG6_LOCAL_OIF, SEG6_LOCAL_BPF, SEG6_LOCAL_VRFTABLE, + SEG6_LOCAL_COUNTERS, __SEG6_LOCAL_MAX, }; #define SEG6_LOCAL_MAX (__SEG6_LOCAL_MAX - 1) @@ -78,4 +79,11 @@ enum { #define SEG6_LOCAL_BPF_PROG_MAX (__SEG6_LOCAL_BPF_PROG_MAX - 1) +/* SRv6 Behavior counters */ +struct seg6_local_counters { + __u64 rx_packets; + __u64 rx_bytes; + __u64 rx_errors; +}; + #endif diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c index 8936f48570fc..0f905a4410bd 100644 --- a/net/ipv6/seg6_local.c +++ b/net/ipv6/seg6_local.c @@ -93,6 +93,20 @@ struct seg6_end_dt_info { int hdrlen; }; +struct pcpu_seg6_local_counters { + u64_stats_t rx_packets; + u64_stats_t rx_bytes; + u64_stats_t rx_errors; + + struct u64_stats_sync syncp; +}; + +#define seg6_local_alloc_pcpu_counters(__gfp) \ + __netdev_alloc_pcpu_stats(struct pcpu_seg6_local_counters, \ + ((__gfp) | __GFP_ZERO)) + +#define SEG6_F_LOCAL_COUNTERS SEG6_F_ATTR(SEG6_LOCAL_COUNTERS) + struct seg6_local_lwt { int action; struct ipv6_sr_hdr *srh; @@ -105,6 +119,7 @@ struct seg6_local_lwt { #ifdef CONFIG_NET_L3_MASTER_DEV struct seg6_end_dt_info dt_info; #endif + struct pcpu_seg6_local_counters __percpu *pcpu_counters; int headroom; struct seg6_action_desc *desc; @@ -878,36 +893,43 @@ static struct seg6_action_desc seg6_action_table[] = { { .action = SEG6_LOCAL_ACTION_END, .attrs = 0, + .optattrs = SEG6_F_LOCAL_COUNTERS, .input = input_action_end, }, { .action = SEG6_LOCAL_ACTION_END_X, .attrs = SEG6_F_ATTR(SEG6_LOCAL_NH6), + .optattrs = SEG6_F_LOCAL_COUNTERS, .input = input_action_end_x, }, { .action = SEG6_LOCAL_ACTION_END_T, .attrs = SEG6_F_ATTR(SEG6_LOCAL_TABLE), + .optattrs = SEG6_F_LOCAL_COUNTERS, .input = input_action_end_t, }, { .action = SEG6_LOCAL_ACTION_END_DX2, .attrs = SEG6_F_ATTR(SEG6_LOCAL_OIF), + .optattrs = SEG6_F_LOCAL_COUNTERS, .input = input_action_end_dx2, }, { .action = SEG6_LOCAL_ACTION_END_DX6, .attrs = SEG6_F_ATTR(SEG6_LOCAL_NH6), + .optattrs = SEG6_F_LOCAL_COUNTERS, .input = input_action_end_dx6, }, { .action = SEG6_LOCAL_ACTION_END_DX4, .attrs = SEG6_F_ATTR(SEG6_LOCAL_NH4), + .optattrs = SEG6_F_LOCAL_COUNTERS, .input = input_action_end_dx4, }, { .action = SEG6_LOCAL_ACTION_END_DT4, .attrs = SEG6_F_ATTR(SEG6_LOCAL_VRFTABLE), + .optattrs = SEG6_F_LOCAL_COUNTERS, #ifdef CONFIG_NET_L3_MASTER_DEV .input = input_action_end_dt4, .slwt_ops = { @@ -919,30 +941,35 @@ static struct seg6_action_desc seg6_action_table[] = { .action = SEG6_LOCAL_ACTION_END_DT6, #ifdef CONF
[RFC net-next 0/1] seg6: Counters for SRv6 Behaviors
pps As can be observed, throughputs achieved in scenarios (2),(3) did not suffer any observable degradation compared to scenario (1). Comments, suggestions and improvements are very welcome! Thanks, Andrea [2] https://www.cloudlab.us Andrea Mayer (1): seg6: add counters support for SRv6 Behaviors include/uapi/linux/seg6_local.h | 8 ++ net/ipv6/seg6_local.c | 133 +++- 2 files changed, 139 insertions(+), 2 deletions(-) -- 2.20.1
[PATCH v2 1/2] clocksource: arm_global_timer: implement rate compensation whenever source clock changes
This patch adds rate change notification support for the parent clock; should that clock change, then we try to adjust the our prescaler in order to compensate (i.e. we adjust to still get the same timer frequency). This is loosely based on what it's done in timer-cadence-ttc. timer-sun51, mips-gic-timer and smp_twd.c also seem to look at their parent clock rate and to perform some kind of adjustment whenever needed. In this particular case we have only one single counter and prescaler for all clocksource, clockevent and timer_delay, and we just update it for all (i.e. we don't let it go and call clockevents_update_freq() to notify to the kernel that our rate has changed). Note that, there is apparently no other way to fixup things, because once we call register_current_timer_delay(), specifying the timer rate, it seems that that rate is not supposed to change ever. In order for this mechanism to work, we have to make assumptions about how much the initial clock is supposed to eventually decrease from the initial one, and set our initial prescaler to a value that we can eventually decrease enough to compensate. We provide an option in KConfig for this. In case we end up in a situation in which we are not able to compensate the parent clock change, we fail returning NOTIFY_BAD. This fixes a real-world problem with Zynq arch not being able to use this driver and CPU_FREQ at the same time (because ARM global timer is fed by the CPU clock, which may keep changing when CPU_FREQ is enabled). Signed-off-by: Andrea Merello Cc: Patrice Chotard Cc: linux-kernel@vger.kernel.org Cc: linux-arm-ker...@lists.infradead.org Cc: Michal Simek Cc: Sören Brinkmann --- drivers/clocksource/Kconfig| 13 +++ drivers/clocksource/arm_global_timer.c | 122 +++-- 2 files changed, 125 insertions(+), 10 deletions(-) diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig index 39aa21d01e05..19fc5f8883e0 100644 --- a/drivers/clocksource/Kconfig +++ b/drivers/clocksource/Kconfig @@ -358,6 +358,19 @@ config ARM_GLOBAL_TIMER help This option enables support for the ARM global timer unit. +config ARM_GT_INITIAL_PRESCALER_VAL + int "ARM global timer initial prescaler value" + default 1 + depends on ARM_GLOBAL_TIMER + help + When the ARM global timer initializes, its current rate is declared + to the kernel and maintained forever. Should it's parent clock + change, the driver tries to fix the timer's internal prescaler. + On some machs (i.e. Zynq) the initial prescaler value thus poses + bounds about how much the parent clock is allowed to decrease or + increase wrt the initial clock value. + This affects CPU_FREQ max delta from the initial frequency. + config ARM_TIMER_SP804 bool "Support for Dual Timer SP804 module" if COMPILE_TEST depends on GENERIC_SCHED_CLOCK && CLKDEV_LOOKUP diff --git a/drivers/clocksource/arm_global_timer.c b/drivers/clocksource/arm_global_timer.c index 88b2d38a7a61..60a8047fd32e 100644 --- a/drivers/clocksource/arm_global_timer.c +++ b/drivers/clocksource/arm_global_timer.c @@ -31,6 +31,10 @@ #define GT_CONTROL_COMP_ENABLE BIT(1) /* banked */ #define GT_CONTROL_IRQ_ENABLE BIT(2) /* banked */ #define GT_CONTROL_AUTO_INCBIT(3) /* banked */ +#define GT_CONTROL_PRESCALER_SHIFT 8 +#define GT_CONTROL_PRESCALER_MAX0xF +#define GT_CONTROL_PRESCALER_MASK (GT_CONTROL_PRESCALER_MAX << \ +GT_CONTROL_PRESCALER_SHIFT) #define GT_INT_STATUS 0x0c #define GT_INT_STATUS_EVENT_FLAG BIT(0) @@ -39,6 +43,7 @@ #define GT_COMP1 0x14 #define GT_AUTO_INC0x18 +#define MAX_F_ERR 50 /* * We are expecting to be clocked by the ARM peripheral clock. * @@ -46,7 +51,8 @@ * the units for all operations. */ static void __iomem *gt_base; -static unsigned long gt_clk_rate; +struct notifier_block gt_clk_rate_change_nb; +static u32 gt_psv_new, gt_psv_bck, gt_target_rate; static int gt_ppi; static struct clock_event_device __percpu *gt_evt; @@ -96,7 +102,10 @@ static void gt_compare_set(unsigned long delta, int periodic) unsigned long ctrl; counter += delta; - ctrl = GT_CONTROL_TIMER_ENABLE; + ctrl = readl(gt_base + GT_CONTROL); + ctrl &= ~(GT_CONTROL_COMP_ENABLE | GT_CONTROL_IRQ_ENABLE | + GT_CONTROL_AUTO_INC | GT_CONTROL_AUTO_INC); + ctrl |= GT_CONTROL_TIMER_ENABLE; writel_relaxed(ctrl, gt_base + GT_CONTROL); writel_relaxed(lower_32_bits(counter), gt_base + GT_COMP0); writel_relaxed(upper_32_bits(counter), gt_base + GT_COMP1); @@ -123,7 +132,7 @@ static int gt_clockevent_shutdown(struct clock_event_device *evt) static int gt_clockevent_set_periodic(struct clock_event_device *evt) { - gt_compare_set
[PATCH v2 2/2] arm: zynq: don't disable CONFIG_ARM_GLOBAL_TIMER due to CONFIG_CPU_FREQ anymore
Now ARM global timer driver could work even if it's source clock rate changes, so we don't need to disable that driver when cpu frequency scaling is in use. This cause Zynq arch to get support for timer delay and get_cycles(). Signed-off-by: Andrea Merello Cc: Patrice Chotard Cc: linux-kernel@vger.kernel.org Cc: linux-arm-ker...@lists.infradead.org Cc: Michal Simek Cc: Sören Brinkmann --- arch/arm/mach-zynq/Kconfig | 2 +- drivers/clocksource/Kconfig | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/arm/mach-zynq/Kconfig b/arch/arm/mach-zynq/Kconfig index 43fb941dcd07..a56748d671c4 100644 --- a/arch/arm/mach-zynq/Kconfig +++ b/arch/arm/mach-zynq/Kconfig @@ -6,7 +6,7 @@ config ARCH_ZYNQ select ARCH_SUPPORTS_BIG_ENDIAN select ARM_AMBA select ARM_GIC - select ARM_GLOBAL_TIMER if !CPU_FREQ + select ARM_GLOBAL_TIMER select CADENCE_TTC_TIMER select HAVE_ARM_SCU if SMP select HAVE_ARM_TWD if SMP diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig index 19fc5f8883e0..9fa28237715a 100644 --- a/drivers/clocksource/Kconfig +++ b/drivers/clocksource/Kconfig @@ -360,6 +360,7 @@ config ARM_GLOBAL_TIMER config ARM_GT_INITIAL_PRESCALER_VAL int "ARM global timer initial prescaler value" + default 2 if ARCH_ZYNQ default 1 depends on ARM_GLOBAL_TIMER help -- 2.17.1
[PATCH v2 0/2] Fix missing entropy on Zynq arch due to get_cycles() not supported
Changes wrt v1: - rebased on latest kernel tree - fix cover letter: does -> doesn't A real-world problem has been seen with a Zynq-based board running Debian 10, where ssh daemon takes a really long time to come up at boot. This is due to lack of random numbers. Since commit 50ee7529ec450 ("random: try to actively add entropy rather than passively wait for it") we try to generate entropy whenever we are in short of random numbers and someone needs them. This trick works only when CPU cycle counter is available. On ARM this means that get_cycles() works and in turn read_current_timer() works. Zynq HW includes two "cadence TTC" timers and the "ARM global timer". All these pieces of HW are fed by the CPU clock, which dynamically changes whenever CPU frequency scaling is enabled. In timer-cadence-ttc driver this scenario is handled by looking at parent clock changes and adjusting things when required. This is the only usable clocksource when CPU frequency scaling is in use. arm_global_timer driver is disabled in Kconfig when CPU_FREQ is enabled for Zynq arch. Unfortunately timer-cadence-ttc driver doesn't register itself via register_current_timer_delay() and that ultimately ends up in get_cycles() to always return zero, causing the aforementioned lack of entropy problem. I believe that the reason for this is because Cadence TTC counter on Zynq silicon is only 16-bit wide. This patchset works around this by implementing in ARM global timer driver a mechanism to compensate for parent clock variations, similarly to what it's done in Cadence TTC timer driver, so that it can be used together with CPU frequency scaling on Zynq arch. This proved to finally fix the problem on my Zynq-based Z-turn board. Signed-off-by: Andrea Merello Cc: Patrice Chotard Cc: linux-kernel@vger.kernel.org Cc: linux-arm-ker...@lists.infradead.org Cc: Michal Simek Cc: Sören Brinkmann Andrea Merello (2): clocksource: arm_global_timer: implement rate compensation whenever source clock changes arm: zynq: don't disable CONFIG_ARM_GLOBAL_TIMER due to CONFIG_CPU_FREQ anymore arch/arm/mach-zynq/Kconfig | 2 +- drivers/clocksource/Kconfig| 14 +++ drivers/clocksource/arm_global_timer.c | 122 +++-- 3 files changed, 127 insertions(+), 11 deletions(-) -- 2.17.1
Re: [PATCH 3/3] scsi: storvsc: Validate length of incoming packet in storvsc_on_channel_callback()
Hi Olaf, On Mon, Mar 29, 2021 at 06:37:21PM +0200, Olaf Hering wrote: > On Thu, Dec 17, Andrea Parri (Microsoft) wrote: > > > Check that the packet is of the expected size at least, don't copy data > > past the packet. > > > + if (hv_pkt_datalen(desc) < sizeof(struct vstor_packet) - > > + stor_device->vmscsi_size_delta) { > > + dev_err(>device, "Invalid packet len\n"); > > + continue; > > + } > > + > > Sorry for being late: > > It might be just cosmetic, but should this check be done prior the call to > vmbus_request_addr()? TBH, I'm not immediately seeing why it 'should'; it could make sense to move the check on the packet data length. > Unrelated: my copy of vmbus_request_addr() can return 0, which is apparently > not handled by this loop in storvsc_on_channel_callback(). Indeed, IDs of 0 are reserved for so called unsolicited messages; I think we should check that storvsc_on_io_completion() is not called on such messages. Thanks, Andrea
Re: [PATCH] leds: trigger: fix potential deadlock with libata
On Sun, Mar 07, 2021 at 10:02:32AM +0800, Boqun Feng wrote: > On Sat, Mar 06, 2021 at 09:39:54PM +0100, Marc Kleine-Budde wrote: > > Hello *, > > > > On 02.11.2020 11:41:52, Andrea Righi wrote: > > > We have the following potential deadlock condition: > > > > > > > > > WARNING: possible irq lock inversion dependency detected > > > 5.10.0-rc2+ #25 Not tainted > > > > > > swapper/3/0 just changed the state of lock: > > > 8880063bd618 (>lock){-...}-{2:2}, at: > > > ata_bmdma_interrupt+0x27/0x200 > > > but this lock took another, HARDIRQ-READ-unsafe lock in the past: > > > (>leddev_list_lock){.+.?}-{2:2} > > > > > > and interrupts could create inverse lock ordering between them. > > > > [...] > > > > > --- > > > drivers/leds/led-triggers.c | 5 +++-- > > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > > > diff --git a/drivers/leds/led-triggers.c b/drivers/leds/led-triggers.c > > > index 91da90cfb11d..16d1a93a10a8 100644 > > > --- a/drivers/leds/led-triggers.c > > > +++ b/drivers/leds/led-triggers.c > > > @@ -378,14 +378,15 @@ void led_trigger_event(struct led_trigger *trig, > > > enum led_brightness brightness) > > > { > > > struct led_classdev *led_cdev; > > > + unsigned long flags; > > > > > > if (!trig) > > > return; > > > > > > - read_lock(>leddev_list_lock); > > > + read_lock_irqsave(>leddev_list_lock, flags); > > > list_for_each_entry(led_cdev, >led_cdevs, trig_list) > > > led_set_brightness(led_cdev, brightness); > > > - read_unlock(>leddev_list_lock); > > > + read_unlock_irqrestore(>leddev_list_lock, flags); > > > } > > > EXPORT_SYMBOL_GPL(led_trigger_event); > > > > meanwhile this patch hit v5.10.x stable and caused a performance > > degradation on our use case: > > > > It's an embedded ARM system, 4x Cortex A53, with an SPI attached CAN > > controller. CAN stands for Controller Area Network and here used to > > connect to some automotive equipment. Over CAN an ISOTP (a CAN-specific > > Transport Protocol) transfer is running. With this patch, we see CAN > > frames delayed for ~6ms, the usual gap between CAN frames is 240µs. > > > > Reverting this patch, restores the old performance. > > > > What is the best way to solve this dilemma? Identify the critical path > > in our use case? Is there a way we can get around the irqsave in > > led_trigger_event()? > > > > Probably, we can change from rwlock to rcu here, POC code as follow, > only compile tested. Marc, could you see whether this help the > performance on your platform? Please note that I haven't test it in a > running kernel and I'm not that familir with led subsystem, so use it > with caution ;-) If we don't want to touch the led subsystem at all maybe we could try to fix the problem in libata, we just need to prevent calling led_trigger_blink_oneshot() with >lock held from ata_qc_complete(), maybe doing the led blinking from another context (a workqueue for example)? -Andrea
[PATCH] Drivers: hv: vmbus: Drop error message when 'No request id available'
Running out of request IDs on a channel essentially produces the same effect as running out of space in the ring buffer, in that -EAGAIN is returned. The error message in hv_ringbuffer_write() should either be dropped (since we don't output a message when the ring buffer is full) or be made conditional/debug-only. Suggested-by: Michael Kelley Signed-off-by: Andrea Parri (Microsoft) Fixes: e8b7db38449ac ("Drivers: hv: vmbus: Add vmbus_requestor data structure for VMBus hardening") --- drivers/hv/ring_buffer.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c index 35833d4d1a1dc..ecd82ebfd5bc4 100644 --- a/drivers/hv/ring_buffer.c +++ b/drivers/hv/ring_buffer.c @@ -313,7 +313,6 @@ int hv_ringbuffer_write(struct vmbus_channel *channel, rqst_id = vmbus_next_request_id(>requestor, requestid); if (rqst_id == VMBUS_RQST_ERROR) { spin_unlock_irqrestore(_info->ring_lock, flags); - pr_err("No request id available\n"); return -EAGAIN; } } -- 2.25.1
[PATCH net] hv_netvsc: Fix validation in netvsc_linkstatus_callback()
Contrary to the RNDIS protocol specification, certain (pre-Fe) implementations of Hyper-V's vSwitch did not account for the status buffer field in the length of an RNDIS packet; the bug was fixed in newer implementations. Validate the status buffer fields using the length of the 'vmtransfer_page' packet (all implementations), that is known/validated to be less than or equal to the receive section size and not smaller than the length of the RNDIS message. Reported-by: Dexuan Cui Suggested-by: Haiyang Zhang Signed-off-by: Andrea Parri (Microsoft) Fixes: 505e3f00c3f36 ("hv_netvsc: Add (more) validation for untrusted Hyper-V values") --- drivers/net/hyperv/hyperv_net.h | 2 +- drivers/net/hyperv/netvsc_drv.c | 13 + drivers/net/hyperv/rndis_filter.c | 2 +- 3 files changed, 11 insertions(+), 6 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index e1a497d3c9ba4..59ac04a610adb 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -229,7 +229,7 @@ int netvsc_send(struct net_device *net, bool xdp_tx); void netvsc_linkstatus_callback(struct net_device *net, struct rndis_message *resp, - void *data); + void *data, u32 data_buflen); int netvsc_recv_callback(struct net_device *net, struct netvsc_device *nvdev, struct netvsc_channel *nvchan); diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index 8176fa0c8b168..15f262b70489e 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -744,7 +744,7 @@ static netdev_tx_t netvsc_start_xmit(struct sk_buff *skb, */ void netvsc_linkstatus_callback(struct net_device *net, struct rndis_message *resp, - void *data) + void *data, u32 data_buflen) { struct rndis_indicate_status *indicate = >msg.indicate_status; struct net_device_context *ndev_ctx = netdev_priv(net); @@ -765,11 +765,16 @@ void netvsc_linkstatus_callback(struct net_device *net, if (indicate->status == RNDIS_STATUS_LINK_SPEED_CHANGE) { u32 speed; - /* Validate status_buf_offset */ + /* Validate status_buf_offset and status_buflen. +* +* Certain (pre-Fe) implementations of Hyper-V's vSwitch didn't account +* for the status buffer field in resp->msg_len; perform the validation +* using data_buflen (>= resp->msg_len). +*/ if (indicate->status_buflen < sizeof(speed) || indicate->status_buf_offset < sizeof(*indicate) || - resp->msg_len - RNDIS_HEADER_SIZE < indicate->status_buf_offset || - resp->msg_len - RNDIS_HEADER_SIZE - indicate->status_buf_offset + data_buflen - RNDIS_HEADER_SIZE < indicate->status_buf_offset || + data_buflen - RNDIS_HEADER_SIZE - indicate->status_buf_offset < indicate->status_buflen) { netdev_err(net, "invalid rndis_indicate_status packet\n"); return; diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c index 123cc9d25f5ed..c0e89e107d575 100644 --- a/drivers/net/hyperv/rndis_filter.c +++ b/drivers/net/hyperv/rndis_filter.c @@ -620,7 +620,7 @@ int rndis_filter_receive(struct net_device *ndev, case RNDIS_MSG_INDICATE: /* notification msgs */ - netvsc_linkstatus_callback(ndev, rndis_msg, data); + netvsc_linkstatus_callback(ndev, rndis_msg, data, buflen); break; default: netdev_err(ndev, -- 2.25.1
Re: [PATCH AUTOSEL 5.11 50/67] Drivers: hv: vmbus: Initialize memory to be sent to the host
On Wed, Feb 24, 2021 at 02:16:00PM +0100, Andrea Parri wrote: > On Wed, Feb 24, 2021 at 07:50:08AM -0500, Sasha Levin wrote: > > From: "Andrea Parri (Microsoft)" > > > > [ Upstream commit e99c4afbee07e9323e9191a20b24d74dbf815bdf ] > > > > __vmbus_open() and vmbus_teardown_gpadl() do not inizialite the memory > > for the vmbus_channel_open_channel and the vmbus_channel_gpadl_teardown > > objects they allocate respectively. These objects contain padding bytes > > and fields that are left uninitialized and that are later sent to the > > host, potentially leaking guest data. Zero initialize such fields to > > avoid leaking sensitive information to the host. > > > > Reported-by: Juan Vazquez > > Signed-off-by: Andrea Parri (Microsoft) > > Reviewed-by: Michael Kelley > > Link: > > https://lore.kernel.org/r/20201209070827.29335-2-parri.and...@gmail.com > > Signed-off-by: Wei Liu > > Signed-off-by: Sasha Levin > > Sasha - This patch is one of a group of patches where a Linux guest running on > Hyper-V will start assuming that hypervisor behavior might be malicious, and > guards against such behavior. Because this is a new assumption, these patches > are more properly treated as new functionality rather than as bug fixes. So I > would propose that we *not* bring such patches back to stable branches. For future/similar cases: I'm wondering, is there some way to annotate a patch with "please do not bring it back"? Thanks, Andrea
Re: [PATCH AUTOSEL 4.14 15/16] Drivers: hv: vmbus: Resolve race condition in vmbus_onoffer_rescind()
On Wed, Feb 24, 2021 at 07:55:12AM -0500, Sasha Levin wrote: > From: "Andrea Parri (Microsoft)" > > [ Upstream commit e4d221b42354b2e2ddb9187a806afb651eee2cda ] > > An erroneous or malicious host could send multiple rescind messages for > a same channel. In vmbus_onoffer_rescind(), the guest maps the channel > ID to obtain a pointer to the channel object and it eventually releases > such object and associated data. The host could time rescind messages > and lead to an use-after-free. Add a new flag to the channel structure > to make sure that only one instance of vmbus_onoffer_rescind() can get > the reference to the channel object. > > Reported-by: Juan Vazquez > Signed-off-by: Andrea Parri (Microsoft) > Reviewed-by: Michael Kelley > Link: https://lore.kernel.org/r/20201209070827.29335-6-parri.and...@gmail.com > Signed-off-by: Wei Liu > Signed-off-by: Sasha Levin Same here. Andrea > --- > drivers/hv/channel_mgmt.c | 12 > include/linux/hyperv.h| 1 + > 2 files changed, 13 insertions(+) > > diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c > index 5bf633c15cd4b..6ddda97030628 100644 > --- a/drivers/hv/channel_mgmt.c > +++ b/drivers/hv/channel_mgmt.c > @@ -942,6 +942,18 @@ static void vmbus_onoffer_rescind(struct > vmbus_channel_message_header *hdr) > > mutex_lock(_connection.channel_mutex); > channel = relid2channel(rescind->child_relid); > + if (channel != NULL) { > + /* > + * Guarantee that no other instance of vmbus_onoffer_rescind() > + * has got a reference to the channel object. Synchronize on > + * _connection.channel_mutex. > + */ > + if (channel->rescind_ref) { > + mutex_unlock(_connection.channel_mutex); > + return; > + } > + channel->rescind_ref = true; > + } > mutex_unlock(_connection.channel_mutex); > > if (channel == NULL) { > diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h > index 63cd81e5610d1..22e2c2d75361e 100644 > --- a/include/linux/hyperv.h > +++ b/include/linux/hyperv.h > @@ -710,6 +710,7 @@ struct vmbus_channel { > u8 monitor_bit; > > bool rescind; /* got rescind msg */ > + bool rescind_ref; /* got rescind msg, got channel reference */ > struct completion rescind_event; > > u32 ringbuffer_gpadlhandle; > -- > 2.27.0 >
Re: [PATCH AUTOSEL 5.10 40/56] Drivers: hv: vmbus: Initialize memory to be sent to the host
On Wed, Feb 24, 2021 at 07:51:56AM -0500, Sasha Levin wrote: > From: "Andrea Parri (Microsoft)" > > [ Upstream commit e99c4afbee07e9323e9191a20b24d74dbf815bdf ] > > __vmbus_open() and vmbus_teardown_gpadl() do not inizialite the memory > for the vmbus_channel_open_channel and the vmbus_channel_gpadl_teardown > objects they allocate respectively. These objects contain padding bytes > and fields that are left uninitialized and that are later sent to the > host, potentially leaking guest data. Zero initialize such fields to > avoid leaking sensitive information to the host. > > Reported-by: Juan Vazquez > Signed-off-by: Andrea Parri (Microsoft) > Reviewed-by: Michael Kelley > Link: https://lore.kernel.org/r/20201209070827.29335-2-parri.and...@gmail.com > Signed-off-by: Wei Liu > Signed-off-by: Sasha Levin Same here. Andrea > --- > drivers/hv/channel.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c > index fbdda9938039a..f9f04b5cd303f 100644 > --- a/drivers/hv/channel.c > +++ b/drivers/hv/channel.c > @@ -548,7 +548,7 @@ static int __vmbus_open(struct vmbus_channel *newchannel, > goto error_clean_ring; > > /* Create and init the channel open message */ > - open_info = kmalloc(sizeof(*open_info) + > + open_info = kzalloc(sizeof(*open_info) + > sizeof(struct vmbus_channel_open_channel), > GFP_KERNEL); > if (!open_info) { > @@ -674,7 +674,7 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, > u32 gpadl_handle) > unsigned long flags; > int ret; > > - info = kmalloc(sizeof(*info) + > + info = kzalloc(sizeof(*info) + > sizeof(struct vmbus_channel_gpadl_teardown), GFP_KERNEL); > if (!info) > return -ENOMEM; > -- > 2.27.0 >
Re: [PATCH AUTOSEL 5.10 41/56] Drivers: hv: vmbus: Resolve race condition in vmbus_onoffer_rescind()
On Wed, Feb 24, 2021 at 07:51:57AM -0500, Sasha Levin wrote: > From: "Andrea Parri (Microsoft)" > > [ Upstream commit e4d221b42354b2e2ddb9187a806afb651eee2cda ] > > An erroneous or malicious host could send multiple rescind messages for > a same channel. In vmbus_onoffer_rescind(), the guest maps the channel > ID to obtain a pointer to the channel object and it eventually releases > such object and associated data. The host could time rescind messages > and lead to an use-after-free. Add a new flag to the channel structure > to make sure that only one instance of vmbus_onoffer_rescind() can get > the reference to the channel object. > > Reported-by: Juan Vazquez > Signed-off-by: Andrea Parri (Microsoft) > Reviewed-by: Michael Kelley > Link: https://lore.kernel.org/r/20201209070827.29335-6-parri.and...@gmail.com > Signed-off-by: Wei Liu > Signed-off-by: Sasha Levin Same here. Andrea > --- > drivers/hv/channel_mgmt.c | 12 > include/linux/hyperv.h| 1 + > 2 files changed, 13 insertions(+) > > diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c > index 1d44bb635bb84..a9f58840f85dc 100644 > --- a/drivers/hv/channel_mgmt.c > +++ b/drivers/hv/channel_mgmt.c > @@ -1049,6 +1049,18 @@ static void vmbus_onoffer_rescind(struct > vmbus_channel_message_header *hdr) > > mutex_lock(_connection.channel_mutex); > channel = relid2channel(rescind->child_relid); > + if (channel != NULL) { > + /* > + * Guarantee that no other instance of vmbus_onoffer_rescind() > + * has got a reference to the channel object. Synchronize on > + * _connection.channel_mutex. > + */ > + if (channel->rescind_ref) { > + mutex_unlock(_connection.channel_mutex); > + return; > + } > + channel->rescind_ref = true; > + } > mutex_unlock(_connection.channel_mutex); > > if (channel == NULL) { > diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h > index 1ce131f29f3b4..376f0f9e19650 100644 > --- a/include/linux/hyperv.h > +++ b/include/linux/hyperv.h > @@ -786,6 +786,7 @@ struct vmbus_channel { > u8 monitor_bit; > > bool rescind; /* got rescind msg */ > + bool rescind_ref; /* got rescind msg, got channel reference */ > struct completion rescind_event; > > u32 ringbuffer_gpadlhandle; > -- > 2.27.0 >
Re: [PATCH AUTOSEL 5.4 30/40] Drivers: hv: vmbus: Resolve race condition in vmbus_onoffer_rescind()
On Wed, Feb 24, 2021 at 07:53:30AM -0500, Sasha Levin wrote: > From: "Andrea Parri (Microsoft)" > > [ Upstream commit e4d221b42354b2e2ddb9187a806afb651eee2cda ] > > An erroneous or malicious host could send multiple rescind messages for > a same channel. In vmbus_onoffer_rescind(), the guest maps the channel > ID to obtain a pointer to the channel object and it eventually releases > such object and associated data. The host could time rescind messages > and lead to an use-after-free. Add a new flag to the channel structure > to make sure that only one instance of vmbus_onoffer_rescind() can get > the reference to the channel object. > > Reported-by: Juan Vazquez > Signed-off-by: Andrea Parri (Microsoft) > Reviewed-by: Michael Kelley > Link: https://lore.kernel.org/r/20201209070827.29335-6-parri.and...@gmail.com > Signed-off-by: Wei Liu > Signed-off-by: Sasha Levin Same here. Andrea > --- > drivers/hv/channel_mgmt.c | 12 > include/linux/hyperv.h| 1 + > 2 files changed, 13 insertions(+) > > diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c > index 452307c79e4b9..dd4e890cf1b1d 100644 > --- a/drivers/hv/channel_mgmt.c > +++ b/drivers/hv/channel_mgmt.c > @@ -1048,6 +1048,18 @@ static void vmbus_onoffer_rescind(struct > vmbus_channel_message_header *hdr) > > mutex_lock(_connection.channel_mutex); > channel = relid2channel(rescind->child_relid); > + if (channel != NULL) { > + /* > + * Guarantee that no other instance of vmbus_onoffer_rescind() > + * has got a reference to the channel object. Synchronize on > + * _connection.channel_mutex. > + */ > + if (channel->rescind_ref) { > + mutex_unlock(_connection.channel_mutex); > + return; > + } > + channel->rescind_ref = true; > + } > mutex_unlock(_connection.channel_mutex); > > if (channel == NULL) { > diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h > index 67d9b5a374600..51e2134b32a21 100644 > --- a/include/linux/hyperv.h > +++ b/include/linux/hyperv.h > @@ -734,6 +734,7 @@ struct vmbus_channel { > u8 monitor_bit; > > bool rescind; /* got rescind msg */ > + bool rescind_ref; /* got rescind msg, got channel reference */ > struct completion rescind_event; > > u32 ringbuffer_gpadlhandle; > -- > 2.27.0 >
Re: [PATCH AUTOSEL 4.19 21/26] Drivers: hv: vmbus: Resolve race condition in vmbus_onoffer_rescind()
On Wed, Feb 24, 2021 at 07:54:29AM -0500, Sasha Levin wrote: > From: "Andrea Parri (Microsoft)" > > [ Upstream commit e4d221b42354b2e2ddb9187a806afb651eee2cda ] > > An erroneous or malicious host could send multiple rescind messages for > a same channel. In vmbus_onoffer_rescind(), the guest maps the channel > ID to obtain a pointer to the channel object and it eventually releases > such object and associated data. The host could time rescind messages > and lead to an use-after-free. Add a new flag to the channel structure > to make sure that only one instance of vmbus_onoffer_rescind() can get > the reference to the channel object. > > Reported-by: Juan Vazquez > Signed-off-by: Andrea Parri (Microsoft) > Reviewed-by: Michael Kelley > Link: https://lore.kernel.org/r/20201209070827.29335-6-parri.and...@gmail.com > Signed-off-by: Wei Liu > Signed-off-by: Sasha Levin Same here. Andrea > --- > drivers/hv/channel_mgmt.c | 12 > include/linux/hyperv.h| 1 + > 2 files changed, 13 insertions(+) > > diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c > index 7920b0d7e35a7..1322e799938af 100644 > --- a/drivers/hv/channel_mgmt.c > +++ b/drivers/hv/channel_mgmt.c > @@ -954,6 +954,18 @@ static void vmbus_onoffer_rescind(struct > vmbus_channel_message_header *hdr) > > mutex_lock(_connection.channel_mutex); > channel = relid2channel(rescind->child_relid); > + if (channel != NULL) { > + /* > + * Guarantee that no other instance of vmbus_onoffer_rescind() > + * has got a reference to the channel object. Synchronize on > + * _connection.channel_mutex. > + */ > + if (channel->rescind_ref) { > + mutex_unlock(_connection.channel_mutex); > + return; > + } > + channel->rescind_ref = true; > + } > mutex_unlock(_connection.channel_mutex); > > if (channel == NULL) { > diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h > index 35461d49d3aee..59525fe25abde 100644 > --- a/include/linux/hyperv.h > +++ b/include/linux/hyperv.h > @@ -736,6 +736,7 @@ struct vmbus_channel { > u8 monitor_bit; > > bool rescind; /* got rescind msg */ > + bool rescind_ref; /* got rescind msg, got channel reference */ > struct completion rescind_event; > > u32 ringbuffer_gpadlhandle; > -- > 2.27.0 >
Re: [PATCH AUTOSEL 5.11 51/67] Drivers: hv: vmbus: Resolve race condition in vmbus_onoffer_rescind()
On Wed, Feb 24, 2021 at 07:50:09AM -0500, Sasha Levin wrote: > From: "Andrea Parri (Microsoft)" > > [ Upstream commit e4d221b42354b2e2ddb9187a806afb651eee2cda ] > > An erroneous or malicious host could send multiple rescind messages for > a same channel. In vmbus_onoffer_rescind(), the guest maps the channel > ID to obtain a pointer to the channel object and it eventually releases > such object and associated data. The host could time rescind messages > and lead to an use-after-free. Add a new flag to the channel structure > to make sure that only one instance of vmbus_onoffer_rescind() can get > the reference to the channel object. > > Reported-by: Juan Vazquez > Signed-off-by: Andrea Parri (Microsoft) > Reviewed-by: Michael Kelley > Link: https://lore.kernel.org/r/20201209070827.29335-6-parri.and...@gmail.com > Signed-off-by: Wei Liu > Signed-off-by: Sasha Levin Sasha - This patch is one of a group of patches where a Linux guest running on Hyper-V will start assuming that hypervisor behavior might be malicious, and guards against such behavior. Because this is a new assumption, these patches are more properly treated as new functionality rather than as bug fixes. So I would propose that we *not* bring such patches back to stable branches. Thanks, Andrea > --- > drivers/hv/channel_mgmt.c | 12 > include/linux/hyperv.h| 1 + > 2 files changed, 13 insertions(+) > > diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c > index 1d44bb635bb84..a9f58840f85dc 100644 > --- a/drivers/hv/channel_mgmt.c > +++ b/drivers/hv/channel_mgmt.c > @@ -1049,6 +1049,18 @@ static void vmbus_onoffer_rescind(struct > vmbus_channel_message_header *hdr) > > mutex_lock(_connection.channel_mutex); > channel = relid2channel(rescind->child_relid); > + if (channel != NULL) { > + /* > + * Guarantee that no other instance of vmbus_onoffer_rescind() > + * has got a reference to the channel object. Synchronize on > + * _connection.channel_mutex. > + */ > + if (channel->rescind_ref) { > + mutex_unlock(_connection.channel_mutex); > + return; > + } > + channel->rescind_ref = true; > + } > mutex_unlock(_connection.channel_mutex); > > if (channel == NULL) { > diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h > index 5ddb479c4d4cb..ef3573e99d989 100644 > --- a/include/linux/hyperv.h > +++ b/include/linux/hyperv.h > @@ -803,6 +803,7 @@ struct vmbus_channel { > u8 monitor_bit; > > bool rescind; /* got rescind msg */ > + bool rescind_ref; /* got rescind msg, got channel reference */ > struct completion rescind_event; > > u32 ringbuffer_gpadlhandle; > -- > 2.27.0 >
Re: [PATCH AUTOSEL 5.11 50/67] Drivers: hv: vmbus: Initialize memory to be sent to the host
On Wed, Feb 24, 2021 at 07:50:08AM -0500, Sasha Levin wrote: > From: "Andrea Parri (Microsoft)" > > [ Upstream commit e99c4afbee07e9323e9191a20b24d74dbf815bdf ] > > __vmbus_open() and vmbus_teardown_gpadl() do not inizialite the memory > for the vmbus_channel_open_channel and the vmbus_channel_gpadl_teardown > objects they allocate respectively. These objects contain padding bytes > and fields that are left uninitialized and that are later sent to the > host, potentially leaking guest data. Zero initialize such fields to > avoid leaking sensitive information to the host. > > Reported-by: Juan Vazquez > Signed-off-by: Andrea Parri (Microsoft) > Reviewed-by: Michael Kelley > Link: https://lore.kernel.org/r/20201209070827.29335-2-parri.and...@gmail.com > Signed-off-by: Wei Liu > Signed-off-by: Sasha Levin Sasha - This patch is one of a group of patches where a Linux guest running on Hyper-V will start assuming that hypervisor behavior might be malicious, and guards against such behavior. Because this is a new assumption, these patches are more properly treated as new functionality rather than as bug fixes. So I would propose that we *not* bring such patches back to stable branches. Thanks, Andrea > --- > drivers/hv/channel.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c > index 6fb0c76bfbf81..0bd202de79600 100644 > --- a/drivers/hv/channel.c > +++ b/drivers/hv/channel.c > @@ -618,7 +618,7 @@ static int __vmbus_open(struct vmbus_channel *newchannel, > goto error_clean_ring; > > /* Create and init the channel open message */ > - open_info = kmalloc(sizeof(*open_info) + > + open_info = kzalloc(sizeof(*open_info) + > sizeof(struct vmbus_channel_open_channel), > GFP_KERNEL); > if (!open_info) { > @@ -745,7 +745,7 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, > u32 gpadl_handle) > unsigned long flags; > int ret; > > - info = kmalloc(sizeof(*info) + > + info = kzalloc(sizeof(*info) + > sizeof(struct vmbus_channel_gpadl_teardown), GFP_KERNEL); > if (!info) > return -ENOMEM; > -- > 2.27.0 >
[PATCH 2/2] arm: zynq: don't disable CONFIG_ARM_GLOBAL_TIMER due to CONFIG_CPU_FREQ anymore
Now ARM global timer driver could work even if it's source clock rate changes, so we don't need to disable that driver when cpu frequency scaling is in use. This cause Zynq arch to get support for timer delay and get_cycles(). Signed-off-by: Andrea Merello Cc: Patrice Chotard Cc: linux-kernel@vger.kernel.org Cc: linux-arm-ker...@lists.infradead.org Cc: Michal Simek Cc: Sören Brinkmann --- arch/arm/mach-zynq/Kconfig | 2 +- drivers/clocksource/Kconfig | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/arm/mach-zynq/Kconfig b/arch/arm/mach-zynq/Kconfig index 43fb941dcd07..a56748d671c4 100644 --- a/arch/arm/mach-zynq/Kconfig +++ b/arch/arm/mach-zynq/Kconfig @@ -6,7 +6,7 @@ config ARCH_ZYNQ select ARCH_SUPPORTS_BIG_ENDIAN select ARM_AMBA select ARM_GIC - select ARM_GLOBAL_TIMER if !CPU_FREQ + select ARM_GLOBAL_TIMER select CADENCE_TTC_TIMER select HAVE_ARM_SCU if SMP select HAVE_ARM_TWD if SMP diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig index 0f1c625275a0..421cd763bdb2 100644 --- a/drivers/clocksource/Kconfig +++ b/drivers/clocksource/Kconfig @@ -387,6 +387,7 @@ config ARM_GLOBAL_TIMER config ARM_GT_INITIAL_PRESCALER_VAL int "ARM global timer initial prescaler value" + default 2 if ARCH_ZYNQ default 1 depends on ARM_GLOBAL_TIMER help -- 2.17.1
[PATCH 0/2] Fix missing entropy on Zynq arch due to get_cycles() not supported
A real-world problem has been seen with a Zynq-based board running Debian 10, where ssh daemon takes a really long time to come up at boot. This is due to lack of random numbers. Since commit 50ee7529ec450 ("random: try to actively add entropy rather than passively wait for it") we try to generate entropy whenever we are in short of random numbers and someone needs them. This trick works only when CPU cycle counter is available. On ARM this means that get_cycles() works and in turn read_current_timer() works. Zynq HW includes two "cadence TTC" timers and the "ARM global timer". All these pieces of HW are fed by the CPU clock, which dynamically changes whenever CPU frequency scaling is enabled. In timer-cadence-ttc driver this scenario is handled by looking at parent clock changes and adjusting things when required. This is the only usable clocksource when CPU frequency scaling is in use. arm_global_timer driver is disabled in Kconfig when CPU_FREQ is enabled for Zynq arch. Unfortunately timer-cadence-ttc driver does register itself via register_current_timer_delay() and that ultimately ends up in get_cycles() to always return zero, causing the aforementioned lack of entropy problem. I believe that the reason for this is because Cadence TTC counter on Zynq silicon is only 16-bit wide. This patchset works around this by implementing in ARM global timer driver a mechanism to compensate for parent clock variations, similarly to what it's done in Cadence TTC timer driver, so that it can be used together with CPU frequency scaling on Zynq arch. This proved to finally fix the problem on my Zynq-based Z-turn board. Signed-off-by: Andrea Merello Cc: Patrice Chotard Cc: linux-kernel@vger.kernel.org Cc: linux-arm-ker...@lists.infradead.org Cc: Michal Simek Cc: Sören Brinkmann Andrea Merello (2): clocksource: arm_global_timer: implement rate compensation whenever source clock changes arm: zynq: don't disable CONFIG_ARM_GLOBAL_TIMER due to CONFIG_CPU_FREQ anymore arch/arm/mach-zynq/Kconfig | 2 +- drivers/clocksource/Kconfig| 14 +++ drivers/clocksource/arm_global_timer.c | 122 +++-- 3 files changed, 127 insertions(+), 11 deletions(-) -- 2.17.1
[PATCH 1/2] clocksource: arm_global_timer: implement rate compensation whenever source clock changes
This patch adds rate change notification support for the parent clock; should that clock change, then we try to adjust the our prescaler in order to compensate (i.e. we adjust to still get the same timer frequency). This is loosely based on what it's done in timer-cadence-ttc. timer-sun51, mips-gic-timer and smp_twd.c also seem to look at their parent clock rate and to perform some kind of adjustment whenever needed. In this particular case we have only one single counter and prescaler for all clocksource, clockevent and timer_delay, and we just update it for all (i.e. we don't let it go and call clockevents_update_freq() to notify to the kernel that our rate has changed). Note that, there is apparently no other way to fixup things, because once we call register_current_timer_delay(), specifying the timer rate, it seems that that rate is not supposed to change ever. In order for this mechanism to work, we have to make assumptions about how much the initial clock is supposed to eventually decrease from the initial one, and set our initial prescaler to a value that we can eventually decrease enough to compensate. We provide an option in KConfig for this. In case we end up in a situation in which we are not able to compensate the parent clock change, we fail returning NOTIFY_BAD. This fixes a real-world problem with Zynq arch not being able to use this driver and CPU_FREQ at the same time (because ARM global timer is fed by the CPU clock, which may keep changing when CPU_FREQ is enabled). Signed-off-by: Andrea Merello Cc: Patrice Chotard Cc: linux-kernel@vger.kernel.org Cc: linux-arm-ker...@lists.infradead.org Cc: Michal Simek Cc: Sören Brinkmann --- drivers/clocksource/Kconfig| 13 +++ drivers/clocksource/arm_global_timer.c | 122 +++-- 2 files changed, 125 insertions(+), 10 deletions(-) diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig index 14c7c4712478..0f1c625275a0 100644 --- a/drivers/clocksource/Kconfig +++ b/drivers/clocksource/Kconfig @@ -385,6 +385,19 @@ config ARM_GLOBAL_TIMER help This option enables support for the ARM global timer unit. +config ARM_GT_INITIAL_PRESCALER_VAL + int "ARM global timer initial prescaler value" + default 1 + depends on ARM_GLOBAL_TIMER + help + When the ARM global timer initializes, its current rate is declared + to the kernel and maintained forever. Should it's parent clock + change, the driver tries to fix the timer's internal prescaler. + On some machs (i.e. Zynq) the initial prescaler value thus poses + bounds about how much the parent clock is allowed to decrease or + increase wrt the initial clock value. + This affects CPU_FREQ max delta from the initial frequency. + config ARM_TIMER_SP804 bool "Support for Dual Timer SP804 module" if COMPILE_TEST depends on GENERIC_SCHED_CLOCK && CLKDEV_LOOKUP diff --git a/drivers/clocksource/arm_global_timer.c b/drivers/clocksource/arm_global_timer.c index 88b2d38a7a61..60a8047fd32e 100644 --- a/drivers/clocksource/arm_global_timer.c +++ b/drivers/clocksource/arm_global_timer.c @@ -31,6 +31,10 @@ #define GT_CONTROL_COMP_ENABLE BIT(1) /* banked */ #define GT_CONTROL_IRQ_ENABLE BIT(2) /* banked */ #define GT_CONTROL_AUTO_INCBIT(3) /* banked */ +#define GT_CONTROL_PRESCALER_SHIFT 8 +#define GT_CONTROL_PRESCALER_MAX0xF +#define GT_CONTROL_PRESCALER_MASK (GT_CONTROL_PRESCALER_MAX << \ +GT_CONTROL_PRESCALER_SHIFT) #define GT_INT_STATUS 0x0c #define GT_INT_STATUS_EVENT_FLAG BIT(0) @@ -39,6 +43,7 @@ #define GT_COMP1 0x14 #define GT_AUTO_INC0x18 +#define MAX_F_ERR 50 /* * We are expecting to be clocked by the ARM peripheral clock. * @@ -46,7 +51,8 @@ * the units for all operations. */ static void __iomem *gt_base; -static unsigned long gt_clk_rate; +struct notifier_block gt_clk_rate_change_nb; +static u32 gt_psv_new, gt_psv_bck, gt_target_rate; static int gt_ppi; static struct clock_event_device __percpu *gt_evt; @@ -96,7 +102,10 @@ static void gt_compare_set(unsigned long delta, int periodic) unsigned long ctrl; counter += delta; - ctrl = GT_CONTROL_TIMER_ENABLE; + ctrl = readl(gt_base + GT_CONTROL); + ctrl &= ~(GT_CONTROL_COMP_ENABLE | GT_CONTROL_IRQ_ENABLE | + GT_CONTROL_AUTO_INC | GT_CONTROL_AUTO_INC); + ctrl |= GT_CONTROL_TIMER_ENABLE; writel_relaxed(ctrl, gt_base + GT_CONTROL); writel_relaxed(lower_32_bits(counter), gt_base + GT_COMP0); writel_relaxed(upper_32_bits(counter), gt_base + GT_COMP1); @@ -123,7 +132,7 @@ static int gt_clockevent_shutdown(struct clock_event_device *evt) static int gt_clockevent_set_periodic(struct clock_event_device *evt) { - gt_compare_set
Regressions with VMBus/VSCs hardening changes
Hi all, I'm reporting two regressions following certain VMBus/VSCs hardening changes we've been discussing 'recently', unfortunately the first regression already touched/affects mainline while the second one is in hyperv-next: 1) [mainline] The first regression manifests with the following message (several): hv_vmbus: No request id available I could reliably reproduce such message/behavior by running the command: fio --name=seqwrite --rw=read --direct=1 --ioengine=libaio --bs=32k --numjobs=4 --size=2G --runtime=60 (the message is triggered when files are being created). I've bisected this regression to commit: 453de21c2b8281 ("scsi: storvsc: Use vmbus_requestor to generate transaction IDs for VMBus hardening") 2) [hyperv-next] The second regression manifests with various messages including: hv_netvsc 9c5f5000-0499-4b18-b2eb-a8d5c57c8774 eth0: Unknown nvsp packet type received 51966 hv_netvsc 9c5f5000-0499-4b18-b2eb-a8d5c57c8774 eth0: unhandled packet type 0, tid 0 hv_netvsc 9c5f5000-0499-4b18-b2eb-a8d5c57c8774 eth0: Incorrect transaction id hv_netvsc 9c5f5000-0499-4b18-b2eb-a8d5c57c8774 eth0: Invalid rndis_msg (buflen: 262, msg_len: 1728) The connection was then typically lost/reset by the peer. I could reproduce such behavior/messages by running the test: ntttcp -r -m 8,*, # receiver ntttcp -s -m 8,*, -ns -t 60 # sender I bisected this regression to commit: a8c3209998afb5 ("Drivers: hv: vmbus: Copy packets sent by Hyper-V out of the ring buffer") --- I am investigating but don't have fixes for these regressions now: given the 'timing' (-rc7 with the next merge window at the door...) I would propose to revert/drop the interested changes: 1) 453de21c2b8281 is part of the so called 'vmbus_requestor' series that was applied during the merge window for 5.11: e8b7db38449ac5 ("Drivers: hv: vmbus: Add vmbus_requestor data structure for VMBus hardening") 453de21c2b8281 ("scsi: storvsc: Use vmbus_requestor to generate transaction IDs for VMBus hardening") 4d18fcc95f5095 ("hv_netvsc: Use vmbus_requestor to generate transaction IDs for VMBus hardening") I could prepare/submit patches to revert such commits (asap but likely not before tomorrow/late Saturday - EU time). 2) IIUC a8c3209998afb5 could be dropped (after rebase) without further modi- fications to hyperv-next. Other suggestions/thoughts? Thanks, Andrea
[PATCH net-next v2] seg6: fool-proof the processing of SRv6 behavior attributes
The set of required attributes for a given SRv6 behavior is identified using a bitmap stored in an unsigned long, since the initial design of SRv6 networking in Linux. Recently the same approach has been used for identifying the optional attributes. However, the number of attributes supported by SRv6 behaviors depends on the size of the unsigned long type which changes with the architecture. Indeed, on a 64-bit architecture, an SRv6 behavior can support up to 64 attributes while on a 32-bit architecture it can support at most 32 attributes. To fool-proof the processing of SRv6 behaviors we verify, at compile time, that the set of all supported SRv6 attributes can be encoded into a bitmap stored in an unsigned long. Otherwise, kernel build fails forcing developers to reconsider adding a new attribute or extend the total number of supported attributes by the SRv6 behaviors. Moreover, we replace all patterns (1 << i) with the macro SEG6_F_ATTR(i) in order to address potential overflow issues caused by 32-bit signed arithmetic. Thanks to Colin Ian King for catching the overflow problem, providing a solution and inspiring this patch. Thanks to Jakub Kicinski for his useful suggestions during the design of this patch. v2: - remove the SEG6_LOCAL_MAX_SUPP which is not strictly needed: it can be derived from the unsigned long type. Thanks to David Ahern for pointing it out. Signed-off-by: Andrea Mayer --- net/ipv6/seg6_local.c | 67 +-- 1 file changed, 39 insertions(+), 28 deletions(-) diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c index b07f7c1c82a4..c2a0c78e84d4 100644 --- a/net/ipv6/seg6_local.c +++ b/net/ipv6/seg6_local.c @@ -31,6 +31,8 @@ #include #include +#define SEG6_F_ATTR(i) BIT(i) + struct seg6_local_lwt; /* callbacks used for customizing the creation and destruction of a behavior */ @@ -660,8 +662,8 @@ seg6_end_dt_mode seg6_end_dt6_parse_mode(struct seg6_local_lwt *slwt) unsigned long parsed_optattrs = slwt->parsed_optattrs; bool legacy, vrfmode; - legacy = !!(parsed_optattrs & (1 << SEG6_LOCAL_TABLE)); - vrfmode = !!(parsed_optattrs & (1 << SEG6_LOCAL_VRFTABLE)); + legacy = !!(parsed_optattrs & SEG6_F_ATTR(SEG6_LOCAL_TABLE)); + vrfmode = !!(parsed_optattrs & SEG6_F_ATTR(SEG6_LOCAL_VRFTABLE)); if (!(legacy ^ vrfmode)) /* both are absent or present: invalid DT6 mode */ @@ -883,32 +885,32 @@ static struct seg6_action_desc seg6_action_table[] = { }, { .action = SEG6_LOCAL_ACTION_END_X, - .attrs = (1 << SEG6_LOCAL_NH6), + .attrs = SEG6_F_ATTR(SEG6_LOCAL_NH6), .input = input_action_end_x, }, { .action = SEG6_LOCAL_ACTION_END_T, - .attrs = (1 << SEG6_LOCAL_TABLE), + .attrs = SEG6_F_ATTR(SEG6_LOCAL_TABLE), .input = input_action_end_t, }, { .action = SEG6_LOCAL_ACTION_END_DX2, - .attrs = (1 << SEG6_LOCAL_OIF), + .attrs = SEG6_F_ATTR(SEG6_LOCAL_OIF), .input = input_action_end_dx2, }, { .action = SEG6_LOCAL_ACTION_END_DX6, - .attrs = (1 << SEG6_LOCAL_NH6), + .attrs = SEG6_F_ATTR(SEG6_LOCAL_NH6), .input = input_action_end_dx6, }, { .action = SEG6_LOCAL_ACTION_END_DX4, - .attrs = (1 << SEG6_LOCAL_NH4), + .attrs = SEG6_F_ATTR(SEG6_LOCAL_NH4), .input = input_action_end_dx4, }, { .action = SEG6_LOCAL_ACTION_END_DT4, - .attrs = (1 << SEG6_LOCAL_VRFTABLE), + .attrs = SEG6_F_ATTR(SEG6_LOCAL_VRFTABLE), #ifdef CONFIG_NET_L3_MASTER_DEV .input = input_action_end_dt4, .slwt_ops = { @@ -920,30 +922,30 @@ static struct seg6_action_desc seg6_action_table[] = { .action = SEG6_LOCAL_ACTION_END_DT6, #ifdef CONFIG_NET_L3_MASTER_DEV .attrs = 0, - .optattrs = (1 << SEG6_LOCAL_TABLE) | - (1 << SEG6_LOCAL_VRFTABLE), + .optattrs = SEG6_F_ATTR(SEG6_LOCAL_TABLE) | + SEG6_F_ATTR(SEG6_LOCAL_VRFTABLE), .slwt_ops = { .build_state = seg6_end_dt6_build, }, #else - .attrs = (1 << SEG6_LOCAL_TABLE), + .attrs = SE
Re: [PATCH net-next] seg6: fool-proof the processing of SRv6 behavior attributes
Hi David, thanks for your time. On Wed, 3 Feb 2021 08:59:40 -0700 David Ahern wrote: > On 2/2/21 11:56 AM, Andrea Mayer wrote: > > diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c > > index b07f7c1c82a4..7cc50d506902 100644 > > --- a/net/ipv6/seg6_local.c > > +++ b/net/ipv6/seg6_local.c > > @@ -31,6 +31,9 @@ > > #include > > #include > > > > +#define SEG6_F_ATTR(i) BIT(i) > > +#define SEG6_LOCAL_MAX_SUPP32 > > + > > SEG6_LOCAL_MAX_SUPP should not be needed; it can be derived from the type: > Yes, we can avoid the SEG6_LOCAL_MAX_SUPP. Using the BITS_PER_TYPE macro seems better to me and it avoids nailing the value 32 as the limit. At compile time, the kernel build will fail if the total number of SRv6 attributes will exceed the number of bits available in the unsigned long bitmap (whatever is the reference architecture). Therefore, I'm going to follow your suggestion! > BUILD_BUG_ON(BITS_PER_TYPE(unsigned long) > SEG6_LOCAL_MAX) > I think there is an issue here because BITS_PER_TYPE(unsigned long) is greater than the SEG6_LOCAL_MAX (currently = 9). I think it should be like this: BUILD_BUG_ON(SEG6_LOCAL_MAX + 1 > BITS_PER_TYPE(unsigned long)) I will send a v2 with the changes discussed so far. Thank you, Andrea
[PATCH net-next 2/2] hv_netvsc: Load and store the proper (NBL_HASH_INFO) per-packet info
Fix the typo. Signed-off-by: Andrea Parri (Microsoft) Cc: "David S. Miller" Cc: Jakub Kicinski Cc: net...@vger.kernel.org Fixes: 0ba35fe91ce34f ("hv_netvsc: Copy packets sent by Hyper-V out of the receive buffer") --- drivers/net/hyperv/rndis_filter.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c index 6c48a4d627368..0c2ebe7ac6554 100644 --- a/drivers/net/hyperv/rndis_filter.c +++ b/drivers/net/hyperv/rndis_filter.c @@ -465,7 +465,7 @@ void rsc_add_data(struct netvsc_channel *nvchan, } nvchan->rsc.pktlen = len; if (hash_info != NULL) { - nvchan->rsc.csum_info = *csum_info; + nvchan->rsc.hash_info = *hash_info; nvchan->rsc.ppi_flags |= NVSC_RSC_HASH_INFO; } else { nvchan->rsc.ppi_flags &= ~NVSC_RSC_HASH_INFO; -- 2.25.1
[PATCH net] hv_netvsc: Reset the RSC count if NVSP_STAT_FAIL in netvsc_receive()
Commit 44144185951a0f ("hv_netvsc: Add validation for untrusted Hyper-V values") added validation to rndis_filter_receive_data() (and rndis_filter_receive()) which introduced NVSP_STAT_FAIL-scenarios where the count is not updated/reset. Fix this omission, and prevent similar scenarios from occurring in the future. Reported-by: Juan Vazquez Signed-off-by: Andrea Parri (Microsoft) Cc: "David S. Miller" Cc: Jakub Kicinski Cc: net...@vger.kernel.org Fixes: 44144185951a0f ("hv_netvsc: Add validation for untrusted Hyper-V values") --- drivers/net/hyperv/netvsc.c | 5 - drivers/net/hyperv/rndis_filter.c | 2 -- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 2350342b961ff..13bd48a75db76 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -1262,8 +1262,11 @@ static int netvsc_receive(struct net_device *ndev, ret = rndis_filter_receive(ndev, net_device, nvchan, data, buflen); - if (unlikely(ret != NVSP_STAT_SUCCESS)) + if (unlikely(ret != NVSP_STAT_SUCCESS)) { + /* Drop incomplete packet */ + nvchan->rsc.cnt = 0; status = NVSP_STAT_FAIL; + } } enq_receive_complete(ndev, net_device, q_idx, diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c index 598713c0d5a87..3aab2b867fc0d 100644 --- a/drivers/net/hyperv/rndis_filter.c +++ b/drivers/net/hyperv/rndis_filter.c @@ -509,8 +509,6 @@ static int rndis_filter_receive_data(struct net_device *ndev, return ret; drop: - /* Drop incomplete packet */ - nvchan->rsc.cnt = 0; return NVSP_STAT_FAIL; } -- 2.25.1
[PATCH net-next 1/2] hv_netvsc: Allocate the recv_buf buffers after NVSP_MSG1_TYPE_SEND_RECV_BUF
The recv_buf buffers are allocated in netvsc_device_add(). Later in netvsc_init_buf() the response to NVSP_MSG1_TYPE_SEND_RECV_BUF allows the host to set up a recv_section_size that could be bigger than the (default) value used for that allocation. The host-controlled value could be used by a malicious host to bypass the check on the packet's length in netvsc_receive() and hence to overflow the recv_buf buffer. Move the allocation of the recv_buf buffers into netvsc_init_but(). Reported-by: Juan Vazquez Signed-off-by: Andrea Parri (Microsoft) Cc: "David S. Miller" Cc: Jakub Kicinski Cc: net...@vger.kernel.org Fixes: 0ba35fe91ce34f ("hv_netvsc: Copy packets sent by Hyper-V out of the receive buffer") --- drivers/net/hyperv/netvsc.c | 18 +++--- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 0fba8257fc119..9db1ea3affbb3 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -311,7 +311,7 @@ static int netvsc_init_buf(struct hv_device *device, struct nvsp_message *init_packet; unsigned int buf_size; size_t map_words; - int ret = 0; + int i, ret = 0; /* Get receive buffer area. */ buf_size = device_info->recv_sections * device_info->recv_section_size; @@ -405,6 +405,16 @@ static int netvsc_init_buf(struct hv_device *device, goto cleanup; } + for (i = 0; i < VRSS_CHANNEL_MAX; i++) { + struct netvsc_channel *nvchan = _device->chan_table[i]; + + nvchan->recv_buf = kzalloc(net_device->recv_section_size, GFP_KERNEL); + if (nvchan->recv_buf == NULL) { + ret = -ENOMEM; + goto cleanup; + } + } + /* Setup receive completion ring. * Add 1 to the recv_section_cnt because at least one entry in a * ring buffer has to be empty. @@ -1549,12 +1559,6 @@ struct netvsc_device *netvsc_device_add(struct hv_device *device, for (i = 0; i < VRSS_CHANNEL_MAX; i++) { struct netvsc_channel *nvchan = _device->chan_table[i]; - nvchan->recv_buf = kzalloc(device_info->recv_section_size, GFP_KERNEL); - if (nvchan->recv_buf == NULL) { - ret = -ENOMEM; - goto cleanup2; - } - nvchan->channel = device->channel; nvchan->net_device = net_device; u64_stats_init(>tx_stats.syncp); -- 2.25.1
[PATCH net-next 0/2] Amend "hv_netvsc: Copy packets sent by Hyper-V out of the receive buffer"
Patch #2 also addresses the Smatch complaint reported here: https://lkml.kernel.org/r/YBp2oVIdMe+G%2FliJ@mwanda/ Thanks, Andrea Cc: "David S. Miller" Cc: Jakub Kicinski Cc: net...@vger.kernel.org Andrea Parri (Microsoft) (2): hv_netvsc: Allocate the recv_buf buffers after NVSP_MSG1_TYPE_SEND_RECV_BUF hv_netvsc: Load and store the proper (NBL_HASH_INFO) per-packet info drivers/net/hyperv/netvsc.c | 18 +++--- drivers/net/hyperv/rndis_filter.c | 2 +- 2 files changed, 12 insertions(+), 8 deletions(-) -- 2.25.1
Re: [PATCH v2 net-next] hv_netvsc: Copy packets sent by Hyper-V out of the receive buffer
On Tue, Feb 02, 2021 at 11:45:49AM -0800, Jakub Kicinski wrote: > On Tue, 2 Feb 2021 09:18:43 +0100 Andrea Parri wrote: > > Hi net maintainers, > > > > > > On Sat, Jan 30, 2021 at 12:50:06AM +, > > patchwork-bot+netdev...@kernel.org wrote: > > > Hello: > > > > > > This patch was applied to netdev/net-next.git (refs/heads/master): > > > > > > On Tue, 26 Jan 2021 17:29:07 +0100 you wrote: > > > > Pointers to receive-buffer packets sent by Hyper-V are used within the > > > > guest VM. Hyper-V can send packets with erroneous values or modify > > > > packet fields after they are processed by the guest. To defend against > > > > these scenarios, copy (sections of) the incoming packet after validating > > > > their length and offset fields in netvsc_filter_receive(). In this way, > > > > the packet can no longer be modified by the host. > > > > > > > > [...] > > > > > > Here is the summary with links: > > > - [v2,net-next] hv_netvsc: Copy packets sent by Hyper-V out of the > > > receive buffer > > > https://git.kernel.org/netdev/net-next/c/0ba35fe91ce3 > > > > I'd have some fixes on top of this and I'm wondering about the process: > > would > > you consider fixes/patches on top of this commit now? > > Fixes for bugs present in Linus's tree? > > You need to target the net tree, and give us instructions on how to > resolve the conflict which will arise from merging net into net-next. > > > would you rather prefer me to squash these fixes into a v3? other? > > Networking trees are immutable, and v2 was already applied. We could > do a revert, apply fix, apply v3, but we prefer to just handle the > merge conflict. Thanks for the clarification, Jakub. And sorry for the confusion; let me just send out the 'fixes'/patches (I have one targeting the net tree and two targeting the net-next tree, with no conflict between them), so that they can be reviewed and we can agree /discuss any further steps. Thanks, Andrea
[PATCH net-next] seg6: fool-proof the processing of SRv6 behavior attributes
The set of required attributes for a given SRv6 behavior is identified using a bitmap stored in an unsigned long, since the initial design of SRv6 networking in Linux. Recently the same approach has been used for identifying the optional attributes. However, the number of attributes supported by SRv6 behaviors depends on the size of the unsigned long type which changes with the architecture. Indeed, on a 64-bit architecture, an SRv6 behavior can support up to 64 attributes while on a 32-bit architecture it can support at most 32 attributes. To fool-proof the processing of SRv6 behaviors we will support at most 32 attributes independently from the reference architecture. Consequently, this patch aims to: - verify, at compile time, that the total number of attributes does not exceed the fixed value of 32. Otherwise, kernel build fails forcing developers to reconsider adding a new attribute or extend the total number of supported attributes by the SRv6 behaviors. - replace all patterns (1 << i) with the macro SEG6_F_ATTR(i) in order to address potential overflow issues caused by 32-bit signed arithmetic. Thanks to Colin Ian King for catching the overflow problem, providing a solution and inspiring this patch. Thanks to Jakub Kicinski for his useful suggestions during the design of this patch. Signed-off-by: Andrea Mayer --- net/ipv6/seg6_local.c | 68 +-- 1 file changed, 40 insertions(+), 28 deletions(-) diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c index b07f7c1c82a4..7cc50d506902 100644 --- a/net/ipv6/seg6_local.c +++ b/net/ipv6/seg6_local.c @@ -31,6 +31,9 @@ #include #include +#define SEG6_F_ATTR(i) BIT(i) +#define SEG6_LOCAL_MAX_SUPP32 + struct seg6_local_lwt; /* callbacks used for customizing the creation and destruction of a behavior */ @@ -660,8 +663,8 @@ seg6_end_dt_mode seg6_end_dt6_parse_mode(struct seg6_local_lwt *slwt) unsigned long parsed_optattrs = slwt->parsed_optattrs; bool legacy, vrfmode; - legacy = !!(parsed_optattrs & (1 << SEG6_LOCAL_TABLE)); - vrfmode = !!(parsed_optattrs & (1 << SEG6_LOCAL_VRFTABLE)); + legacy = !!(parsed_optattrs & SEG6_F_ATTR(SEG6_LOCAL_TABLE)); + vrfmode = !!(parsed_optattrs & SEG6_F_ATTR(SEG6_LOCAL_VRFTABLE)); if (!(legacy ^ vrfmode)) /* both are absent or present: invalid DT6 mode */ @@ -883,32 +886,32 @@ static struct seg6_action_desc seg6_action_table[] = { }, { .action = SEG6_LOCAL_ACTION_END_X, - .attrs = (1 << SEG6_LOCAL_NH6), + .attrs = SEG6_F_ATTR(SEG6_LOCAL_NH6), .input = input_action_end_x, }, { .action = SEG6_LOCAL_ACTION_END_T, - .attrs = (1 << SEG6_LOCAL_TABLE), + .attrs = SEG6_F_ATTR(SEG6_LOCAL_TABLE), .input = input_action_end_t, }, { .action = SEG6_LOCAL_ACTION_END_DX2, - .attrs = (1 << SEG6_LOCAL_OIF), + .attrs = SEG6_F_ATTR(SEG6_LOCAL_OIF), .input = input_action_end_dx2, }, { .action = SEG6_LOCAL_ACTION_END_DX6, - .attrs = (1 << SEG6_LOCAL_NH6), + .attrs = SEG6_F_ATTR(SEG6_LOCAL_NH6), .input = input_action_end_dx6, }, { .action = SEG6_LOCAL_ACTION_END_DX4, - .attrs = (1 << SEG6_LOCAL_NH4), + .attrs = SEG6_F_ATTR(SEG6_LOCAL_NH4), .input = input_action_end_dx4, }, { .action = SEG6_LOCAL_ACTION_END_DT4, - .attrs = (1 << SEG6_LOCAL_VRFTABLE), + .attrs = SEG6_F_ATTR(SEG6_LOCAL_VRFTABLE), #ifdef CONFIG_NET_L3_MASTER_DEV .input = input_action_end_dt4, .slwt_ops = { @@ -920,30 +923,30 @@ static struct seg6_action_desc seg6_action_table[] = { .action = SEG6_LOCAL_ACTION_END_DT6, #ifdef CONFIG_NET_L3_MASTER_DEV .attrs = 0, - .optattrs = (1 << SEG6_LOCAL_TABLE) | - (1 << SEG6_LOCAL_VRFTABLE), + .optattrs = SEG6_F_ATTR(SEG6_LOCAL_TABLE) | + SEG6_F_ATTR(SEG6_LOCAL_VRFTABLE), .slwt_ops = { .build_state = seg6_end_dt6_build, }, #else - .attrs = (1 << SEG6_LOCAL_TABLE), + .attrs = SEG6_F_ATTR(SEG6_LOCAL_TABLE), #endif
Re: [PATCH v2 net-next] hv_netvsc: Copy packets sent by Hyper-V out of the receive buffer
Hi net maintainers, On Sat, Jan 30, 2021 at 12:50:06AM +, patchwork-bot+netdev...@kernel.org wrote: > Hello: > > This patch was applied to netdev/net-next.git (refs/heads/master): > > On Tue, 26 Jan 2021 17:29:07 +0100 you wrote: > > Pointers to receive-buffer packets sent by Hyper-V are used within the > > guest VM. Hyper-V can send packets with erroneous values or modify > > packet fields after they are processed by the guest. To defend against > > these scenarios, copy (sections of) the incoming packet after validating > > their length and offset fields in netvsc_filter_receive(). In this way, > > the packet can no longer be modified by the host. > > > > [...] > > Here is the summary with links: > - [v2,net-next] hv_netvsc: Copy packets sent by Hyper-V out of the receive > buffer > https://git.kernel.org/netdev/net-next/c/0ba35fe91ce3 I'd have some fixes on top of this and I'm wondering about the process: would you consider fixes/patches on top of this commit now? would you rather prefer me to squash these fixes into a v3? other? Thanks, Andrea
[PATCH v3 hyperv-next 2/4] Drivers: hv: vmbus: Restrict vmbus_devices on isolated guests
Only the VSCs or ICs that have been hardened and that are critical for the successful adoption of Confidential VMs should be allowed if the guest is running isolated. This change reduces the footprint of the code that will be exercised by Confidential VMs and hence the exposure to bugs and vulnerabilities. Signed-off-by: Andrea Parri (Microsoft) --- drivers/hv/channel_mgmt.c | 38 ++ include/linux/hyperv.h| 1 + 2 files changed, 39 insertions(+) diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index 68950a1e4b638..f0ed730e2e4e4 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -31,101 +31,118 @@ const struct vmbus_device vmbus_devs[] = { { .dev_type = HV_IDE, HV_IDE_GUID, .perf_device = true, + .allowed_in_isolated = false, }, /* SCSI */ { .dev_type = HV_SCSI, HV_SCSI_GUID, .perf_device = true, + .allowed_in_isolated = true, }, /* Fibre Channel */ { .dev_type = HV_FC, HV_SYNTHFC_GUID, .perf_device = true, + .allowed_in_isolated = false, }, /* Synthetic NIC */ { .dev_type = HV_NIC, HV_NIC_GUID, .perf_device = true, + .allowed_in_isolated = true, }, /* Network Direct */ { .dev_type = HV_ND, HV_ND_GUID, .perf_device = true, + .allowed_in_isolated = false, }, /* PCIE */ { .dev_type = HV_PCIE, HV_PCIE_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* Synthetic Frame Buffer */ { .dev_type = HV_FB, HV_SYNTHVID_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* Synthetic Keyboard */ { .dev_type = HV_KBD, HV_KBD_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* Synthetic MOUSE */ { .dev_type = HV_MOUSE, HV_MOUSE_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* KVP */ { .dev_type = HV_KVP, HV_KVP_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* Time Synch */ { .dev_type = HV_TS, HV_TS_GUID, .perf_device = false, + .allowed_in_isolated = true, }, /* Heartbeat */ { .dev_type = HV_HB, HV_HEART_BEAT_GUID, .perf_device = false, + .allowed_in_isolated = true, }, /* Shutdown */ { .dev_type = HV_SHUTDOWN, HV_SHUTDOWN_GUID, .perf_device = false, + .allowed_in_isolated = true, }, /* File copy */ { .dev_type = HV_FCOPY, HV_FCOPY_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* Backup */ { .dev_type = HV_BACKUP, HV_VSS_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* Dynamic Memory */ { .dev_type = HV_DM, HV_DM_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* Unknown GUID */ { .dev_type = HV_UNKNOWN, .perf_device = false, + .allowed_in_isolated = false, }, }; @@ -903,6 +920,20 @@ find_primary_channel_by_offer(const struct vmbus_channel_offer_channel *offer) return channel; } +static bool vmbus_is_valid_device(const guid_t *guid) +{ + u16 i; + + if (!hv_is_isolation_supported()) + return true; + + for (i = 0; i < ARRAY_SIZE(vmbus_devs); i++) { + if (guid_equal(guid, _devs[i].guid)) + return vmbus_devs[i].allowed_in_isolated; + } + return false; +} + /* * vmbus_onoffer - Handler for channel offers from vmbus in parent partition. * @@ -917,6 +948,13 @@ static void vmbus_onoffer(struct vmbus_channel_message_header *hdr) trace_vmbus_onoffer(offer); + if (!vmbus_is_valid_device(>offer.if_type)) { + pr_err_ratelimited("Invalid offer %d from the host supporting isolation\n", + offer->child_relid); + atomic_dec(_connection.offer_in_progress); + return; + } + oldchannel = find_primary_channel_by_offer(offer); if (oldchannel != NULL) { diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index f0d48a368f131..e3426f8c12db9 100644 --- a/include/linux/hyperv.h +++ b/include/linux/hyperv.h @@ -789,6 +789,7 @@ struct vmbus_device { u16 dev_type; guid_t guid; bool perf_device; + bool allowed_in_isolated; }; #define VMBUS_DEFAULT_MAX_PKT_SIZE 4096 -- 2.25.1
[PATCH v3 hyperv-next 1/4] x86/hyperv: Load/save the Isolation Configuration leaf
If bit 22 of Group B Features is set, the guest has access to the Isolation Configuration CPUID leaf. On x86, the first four bits of EAX in this leaf provide the isolation type of the partition; we entail three isolation types: 'SNP' (hardware-based isolation), 'VBS' (software-based isolation), and 'NONE' (no isolation). Signed-off-by: Andrea Parri (Microsoft) Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Arnd Bergmann Cc: x...@kernel.org Cc: linux-a...@vger.kernel.org --- arch/x86/hyperv/hv_init.c | 15 +++ arch/x86/include/asm/hyperv-tlfs.h | 15 +++ arch/x86/kernel/cpu/mshyperv.c | 9 + include/asm-generic/hyperv-tlfs.h | 1 + include/asm-generic/mshyperv.h | 5 + 5 files changed, 45 insertions(+) diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index e04d90af4c27c..ccdfc6868cfc8 100644 --- a/arch/x86/hyperv/hv_init.c +++ b/arch/x86/hyperv/hv_init.c @@ -10,6 +10,7 @@ #include #include #include +#include #include #include #include @@ -528,3 +529,17 @@ bool hv_is_hibernation_supported(void) return acpi_sleep_state_supported(ACPI_STATE_S4); } EXPORT_SYMBOL_GPL(hv_is_hibernation_supported); + +enum hv_isolation_type hv_get_isolation_type(void) +{ + if (!(ms_hyperv.features_b & HV_ISOLATION)) + return HV_ISOLATION_TYPE_NONE; + return FIELD_GET(HV_ISOLATION_TYPE, ms_hyperv.isolation_config_b); +} +EXPORT_SYMBOL_GPL(hv_get_isolation_type); + +bool hv_is_isolation_supported(void) +{ + return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE; +} +EXPORT_SYMBOL_GPL(hv_is_isolation_supported); diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h index 6bf42aed387e3..6aed936e5e962 100644 --- a/arch/x86/include/asm/hyperv-tlfs.h +++ b/arch/x86/include/asm/hyperv-tlfs.h @@ -22,6 +22,7 @@ #define HYPERV_CPUID_ENLIGHTMENT_INFO 0x4004 #define HYPERV_CPUID_IMPLEMENT_LIMITS 0x4005 #define HYPERV_CPUID_NESTED_FEATURES 0x400A +#define HYPERV_CPUID_ISOLATION_CONFIG 0x400C #define HYPERV_CPUID_VIRT_STACK_INTERFACE 0x4081 #define HYPERV_VS_INTERFACE_EAX_SIGNATURE 0x31235356 /* "VS#1" */ @@ -122,6 +123,20 @@ #define HV_X64_NESTED_GUEST_MAPPING_FLUSH BIT(18) #define HV_X64_NESTED_MSR_BITMAP BIT(19) +/* HYPERV_CPUID_ISOLATION_CONFIG.EAX bits. */ +#define HV_PARAVISOR_PRESENT BIT(0) + +/* HYPERV_CPUID_ISOLATION_CONFIG.EBX bits. */ +#define HV_ISOLATION_TYPE GENMASK(3, 0) +#define HV_SHARED_GPA_BOUNDARY_ACTIVE BIT(5) +#define HV_SHARED_GPA_BOUNDARY_BITSGENMASK(11, 6) + +enum hv_isolation_type { + HV_ISOLATION_TYPE_NONE = 0, + HV_ISOLATION_TYPE_VBS = 1, + HV_ISOLATION_TYPE_SNP = 2 +}; + /* Hyper-V specific model specific registers (MSRs) */ /* MSR used to identify the guest OS. */ diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c index f628e3dc150f3..ea7bd8dff171c 100644 --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -225,6 +225,7 @@ static void __init ms_hyperv_init_platform(void) * Extract the features and hints */ ms_hyperv.features = cpuid_eax(HYPERV_CPUID_FEATURES); + ms_hyperv.features_b = cpuid_ebx(HYPERV_CPUID_FEATURES); ms_hyperv.misc_features = cpuid_edx(HYPERV_CPUID_FEATURES); ms_hyperv.hints= cpuid_eax(HYPERV_CPUID_ENLIGHTMENT_INFO); @@ -259,6 +260,14 @@ static void __init ms_hyperv_init_platform(void) x86_platform.calibrate_cpu = hv_get_tsc_khz; } + if (ms_hyperv.features_b & HV_ISOLATION) { + ms_hyperv.isolation_config_a = cpuid_eax(HYPERV_CPUID_ISOLATION_CONFIG); + ms_hyperv.isolation_config_b = cpuid_ebx(HYPERV_CPUID_ISOLATION_CONFIG); + + pr_info("Hyper-V: Isolation Config: Group A 0x%x, Group B 0x%x\n", + ms_hyperv.isolation_config_a, ms_hyperv.isolation_config_b); + } + if (ms_hyperv.hints & HV_X64_ENLIGHTENED_VMCS_RECOMMENDED) { ms_hyperv.nested_features = cpuid_eax(HYPERV_CPUID_NESTED_FEATURES); diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h index e73a11850055c..20d3cd9502043 100644 --- a/include/asm-generic/hyperv-tlfs.h +++ b/include/asm-generic/hyperv-tlfs.h @@ -89,6 +89,7 @@ #define HV_ACCESS_STATSBIT(8) #define HV_DEBUGGING BIT(11) #define HV_CPU_POWER_MANAGEMENTBIT(12) +#define HV_ISOLATION BIT(22) /* diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h index c57799684170c..dff58a3db5d5c 100644
[PATCH v3 hyperv-next 3/4] Drivers: hv: vmbus: Enforce 'VMBus version >= 5.2' on isolated guests
Restrict the protocol version(s) that will be negotiated with the host to be 5.2 or greater if the guest is running isolated. This reduces the footprint of the code that will be exercised by Confidential VMs and hence the exposure to bugs and vulnerabilities. Signed-off-by: Andrea Parri (Microsoft) --- drivers/hv/connection.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c index 11170d9a2e1a5..c83612cddb995 100644 --- a/drivers/hv/connection.c +++ b/drivers/hv/connection.c @@ -244,6 +244,13 @@ int vmbus_connect(void) break; } + if (hv_is_isolation_supported() && version < VERSION_WIN10_V5_2) { + pr_err("Invalid VMBus version %d.%d (expected >= %d.%d) from the host supporting isolation\n", + version >> 16, version & 0x, VERSION_WIN10_V5_2 >> 16, VERSION_WIN10_V5_2 & 0x); + ret = -EINVAL; + goto cleanup; + } + vmbus_proto_version = version; pr_info("Vmbus version:%d.%d\n", version >> 16, version & 0x); -- 2.25.1
[PATCH v3 hyperv-next 4/4] hv_netvsc: Restrict configurations on isolated guests
Restrict the NVSP protocol version(s) that will be negotiated with the host to be NVSP_PROTOCOL_VERSION_61 or greater if the guest is running isolated. Moreover, do not advertise the SR-IOV capability and ignore NVSP_MSG_4_TYPE_SEND_VF_ASSOCIATION messages in isolated guests, which are not supposed to support SR-IOV. This reduces the footprint of the code that will be exercised by Confidential VMs and hence the exposure to bugs and vulnerabilities. Signed-off-by: Andrea Parri (Microsoft) Acked-by: Jakub Kicinski Reviewed-by: Haiyang Zhang Cc: "David S. Miller" Cc: Jakub Kicinski Cc: net...@vger.kernel.org --- drivers/net/hyperv/netvsc.c | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 1510a236aa341..51005f2d4a821 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -22,6 +22,7 @@ #include #include +#include #include "hyperv_net.h" #include "netvsc_trace.h" @@ -544,7 +545,10 @@ static int negotiate_nvsp_ver(struct hv_device *device, init_packet->msg.v2_msg.send_ndis_config.capability.ieee8021q = 1; if (nvsp_ver >= NVSP_PROTOCOL_VERSION_5) { - init_packet->msg.v2_msg.send_ndis_config.capability.sriov = 1; + if (hv_is_isolation_supported()) + netdev_info(ndev, "SR-IOV not advertised by guests on the host supporting isolation\n"); + else + init_packet->msg.v2_msg.send_ndis_config.capability.sriov = 1; /* Teaming bit is needed to receive link speed updates */ init_packet->msg.v2_msg.send_ndis_config.capability.teaming = 1; @@ -591,6 +595,13 @@ static int netvsc_connect_vsp(struct hv_device *device, goto cleanup; } + if (hv_is_isolation_supported() && net_device->nvsp_version < NVSP_PROTOCOL_VERSION_61) { + netdev_err(ndev, "Invalid NVSP version 0x%x (expected >= 0x%x) from the host supporting isolation\n", + net_device->nvsp_version, NVSP_PROTOCOL_VERSION_61); + ret = -EPROTO; + goto cleanup; + } + pr_debug("Negotiated NVSP version:%x\n", net_device->nvsp_version); /* Send the ndis version */ @@ -1357,7 +1368,10 @@ static void netvsc_receive_inband(struct net_device *ndev, break; case NVSP_MSG4_TYPE_SEND_VF_ASSOCIATION: - netvsc_send_vf(ndev, nvmsg, msglen); + if (hv_is_isolation_supported()) + netdev_err(ndev, "Ignore VF_ASSOCIATION msg from the host supporting isolation\n"); + else + netvsc_send_vf(ndev, nvmsg, msglen); break; } } -- 2.25.1
[PATCH v3 hyperv-next 0/4] Drivers: hv: vmbus: Restrict devices and configurations on 'isolated' guests
Changes since v2 [1]: - improve/add logging (Michael Kelley) - rename 'hypercalls_features' to 'features_b' (Michael Kelley) - move VMBus and NVSC version checks after 'for' loop (Michael Kelley) - remove/inline helper functions (Michael Kelley) - other minor changes (Michael Kelley) Changes since v1 [2]: - improve/add logging (Haiyang Zhang) - move NVSC version check after version negotiation (Haiyang Zhang) [1] https://lkml.kernel.org/r/20210126115641.2527-1-parri.and...@gmail.com [1] https://lkml.kernel.org/r/20210119175841.22248-1-parri.and...@gmail.com Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Arnd Bergmann Cc: "David S. Miller" Cc: Jakub Kicinski Cc: x...@kernel.org Cc: linux-a...@vger.kernel.org Cc: net...@vger.kernel.org Andrea Parri (Microsoft) (4): x86/hyperv: Load/save the Isolation Configuration leaf Drivers: hv: vmbus: Restrict vmbus_devices on isolated guests Drivers: hv: vmbus: Enforce 'VMBus version >= 5.2' on isolated guests hv_netvsc: Restrict configurations on isolated guests arch/x86/hyperv/hv_init.c | 15 arch/x86/include/asm/hyperv-tlfs.h | 15 arch/x86/kernel/cpu/mshyperv.c | 9 +++ drivers/hv/channel_mgmt.c | 38 ++ drivers/hv/connection.c| 7 ++ drivers/net/hyperv/netvsc.c| 18 -- include/asm-generic/hyperv-tlfs.h | 1 + include/asm-generic/mshyperv.h | 5 include/linux/hyperv.h | 1 + 9 files changed, 107 insertions(+), 2 deletions(-) -- 2.25.1
[PATCH v2 net-next] hv_netvsc: Copy packets sent by Hyper-V out of the receive buffer
Pointers to receive-buffer packets sent by Hyper-V are used within the guest VM. Hyper-V can send packets with erroneous values or modify packet fields after they are processed by the guest. To defend against these scenarios, copy (sections of) the incoming packet after validating their length and offset fields in netvsc_filter_receive(). In this way, the packet can no longer be modified by the host. Reported-by: Juan Vazquez Signed-off-by: Andrea Parri (Microsoft) Cc: "David S. Miller" Cc: Jakub Kicinski Cc: net...@vger.kernel.org --- Changes since v1 [1]: - copy certain PPIs into the RSC pkt [1] https://lkml.kernel.org/r/20210126113847.1676-1-parri.and...@gmail.com drivers/net/hyperv/hyperv_net.h | 93 +++-- drivers/net/hyperv/netvsc.c | 20 +++ drivers/net/hyperv/netvsc_drv.c | 24 drivers/net/hyperv/rndis_filter.c | 99 +-- 4 files changed, 150 insertions(+), 86 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index 2a87cfa27ac02..e1a497d3c9ba4 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -105,9 +105,43 @@ struct ndis_recv_scale_param { /* NDIS_RECEIVE_SCALE_PARAMETERS */ u32 processor_masks_entry_size; }; -/* Fwd declaration */ -struct ndis_tcp_ip_checksum_info; -struct ndis_pkt_8021q_info; +struct ndis_tcp_ip_checksum_info { + union { + struct { + u32 is_ipv4:1; + u32 is_ipv6:1; + u32 tcp_checksum:1; + u32 udp_checksum:1; + u32 ip_header_checksum:1; + u32 reserved:11; + u32 tcp_header_offset:10; + } transmit; + struct { + u32 tcp_checksum_failed:1; + u32 udp_checksum_failed:1; + u32 ip_checksum_failed:1; + u32 tcp_checksum_succeeded:1; + u32 udp_checksum_succeeded:1; + u32 ip_checksum_succeeded:1; + u32 loopback:1; + u32 tcp_checksum_value_invalid:1; + u32 ip_checksum_value_invalid:1; + } receive; + u32 value; + }; +}; + +struct ndis_pkt_8021q_info { + union { + struct { + u32 pri:3; /* User Priority */ + u32 cfi:1; /* Canonical Format ID */ + u32 vlanid:12; /* VLAN ID */ + u32 reserved:16; + }; + u32 value; + }; +}; /* * Represent netvsc packet which contains 1 RNDIS and 1 ethernet frame @@ -194,7 +228,8 @@ int netvsc_send(struct net_device *net, struct sk_buff *skb, bool xdp_tx); void netvsc_linkstatus_callback(struct net_device *net, - struct rndis_message *resp); + struct rndis_message *resp, + void *data); int netvsc_recv_callback(struct net_device *net, struct netvsc_device *nvdev, struct netvsc_channel *nvchan); @@ -884,9 +919,10 @@ struct multi_recv_comp { #define NVSP_RSC_MAX 562 /* Max #RSC frags in a vmbus xfer page pkt */ struct nvsc_rsc { - const struct ndis_pkt_8021q_info *vlan; - const struct ndis_tcp_ip_checksum_info *csum_info; - const u32 *hash_info; + struct ndis_pkt_8021q_info vlan; + struct ndis_tcp_ip_checksum_info csum_info; + u32 hash_info; + u8 ppi_flags; /* valid/present bits for the above PPIs */ u8 is_last; /* last RNDIS msg in a vmtransfer_page */ u32 cnt; /* #fragments in an RSC packet */ u32 pktlen; /* Full packet length */ @@ -894,6 +930,10 @@ struct nvsc_rsc { u32 len[NVSP_RSC_MAX]; }; +#define NVSC_RSC_VLAN BIT(0) /* valid/present bit for 'vlan' */ +#define NVSC_RSC_CSUM_INFO BIT(1) /* valid/present bit for 'csum_info' */ +#define NVSC_RSC_HASH_INFO BIT(2) /* valid/present bit for 'hash_info' */ + struct netvsc_stats { u64 packets; u64 bytes; @@ -1002,6 +1042,7 @@ struct net_device_context { struct netvsc_channel { struct vmbus_channel *channel; struct netvsc_device *net_device; + void *recv_buf; /* buffer to copy packets out from the receive buffer */ const struct vmpacket_descriptor *desc; struct napi_struct napi; struct multi_send_data msd; @@ -1234,18 +1275,6 @@ struct rndis_pktinfo_id { u16 pkt_id; }; -struct ndis_pkt_8021q_info { - union { - struct { - u32 pri:3; /* User Priority */ - u32 cfi:1; /* Canonical Format ID */ - u32 vlanid:12; /* VLAN ID */ -
Re: [PATCH net-next] hv_netvsc: Copy packets sent by Hyper-V out of the receive buffer
On Tue, Jan 26, 2021 at 12:38:47PM +0100, Andrea Parri (Microsoft) wrote: > Pointers to receive-buffer packets sent by Hyper-V are used within the > guest VM. Hyper-V can send packets with erroneous values or modify > packet fields after they are processed by the guest. To defend against > these scenarios, copy (sections of) the incoming packet after validating > their length and offset fields in netvsc_filter_receive(). In this way, > the packet can no longer be modified by the host. > > Reported-by: Juan Vazquez > Signed-off-by: Andrea Parri (Microsoft) > Cc: "David S. Miller" > Cc: Jakub Kicinski > Cc: net...@vger.kernel.org Please ignore this submission, I'm sending a new version shortly... Sorry for the hassle. Andrea
[PATCH v2 1/4] x86/hyperv: Load/save the Isolation Configuration leaf
If bit 22 of Group B Features is set, the guest has access to the Isolation Configuration CPUID leaf. On x86, the first four bits of EAX in this leaf provide the isolation type of the partition; we entail three isolation types: 'SNP' (hardware-based isolation), 'VBS' (software-based isolation), and 'NONE' (no isolation). Signed-off-by: Andrea Parri (Microsoft) Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Arnd Bergmann Cc: x...@kernel.org Cc: linux-a...@vger.kernel.org --- arch/x86/hyperv/hv_init.c | 15 +++ arch/x86/include/asm/hyperv-tlfs.h | 15 +++ arch/x86/kernel/cpu/mshyperv.c | 9 + include/asm-generic/hyperv-tlfs.h | 1 + include/asm-generic/mshyperv.h | 5 + 5 files changed, 45 insertions(+) diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index e04d90af4c27c..dc94e95c57b98 100644 --- a/arch/x86/hyperv/hv_init.c +++ b/arch/x86/hyperv/hv_init.c @@ -10,6 +10,7 @@ #include #include #include +#include #include #include #include @@ -528,3 +529,17 @@ bool hv_is_hibernation_supported(void) return acpi_sleep_state_supported(ACPI_STATE_S4); } EXPORT_SYMBOL_GPL(hv_is_hibernation_supported); + +enum hv_isolation_type hv_get_isolation_type(void) +{ + if (!(ms_hyperv.hypercalls_features & HV_ISOLATION)) + return HV_ISOLATION_TYPE_NONE; + return FIELD_GET(HV_ISOLATION_TYPE, ms_hyperv.isolation_config_b); +} +EXPORT_SYMBOL_GPL(hv_get_isolation_type); + +bool hv_is_isolation_supported(void) +{ + return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE; +} +EXPORT_SYMBOL_GPL(hv_is_isolation_supported); diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h index 6bf42aed387e3..6aed936e5e962 100644 --- a/arch/x86/include/asm/hyperv-tlfs.h +++ b/arch/x86/include/asm/hyperv-tlfs.h @@ -22,6 +22,7 @@ #define HYPERV_CPUID_ENLIGHTMENT_INFO 0x4004 #define HYPERV_CPUID_IMPLEMENT_LIMITS 0x4005 #define HYPERV_CPUID_NESTED_FEATURES 0x400A +#define HYPERV_CPUID_ISOLATION_CONFIG 0x400C #define HYPERV_CPUID_VIRT_STACK_INTERFACE 0x4081 #define HYPERV_VS_INTERFACE_EAX_SIGNATURE 0x31235356 /* "VS#1" */ @@ -122,6 +123,20 @@ #define HV_X64_NESTED_GUEST_MAPPING_FLUSH BIT(18) #define HV_X64_NESTED_MSR_BITMAP BIT(19) +/* HYPERV_CPUID_ISOLATION_CONFIG.EAX bits. */ +#define HV_PARAVISOR_PRESENT BIT(0) + +/* HYPERV_CPUID_ISOLATION_CONFIG.EBX bits. */ +#define HV_ISOLATION_TYPE GENMASK(3, 0) +#define HV_SHARED_GPA_BOUNDARY_ACTIVE BIT(5) +#define HV_SHARED_GPA_BOUNDARY_BITSGENMASK(11, 6) + +enum hv_isolation_type { + HV_ISOLATION_TYPE_NONE = 0, + HV_ISOLATION_TYPE_VBS = 1, + HV_ISOLATION_TYPE_SNP = 2 +}; + /* Hyper-V specific model specific registers (MSRs) */ /* MSR used to identify the guest OS. */ diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c index f628e3dc150f3..0d4aaf6694d01 100644 --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -225,6 +225,7 @@ static void __init ms_hyperv_init_platform(void) * Extract the features and hints */ ms_hyperv.features = cpuid_eax(HYPERV_CPUID_FEATURES); + ms_hyperv.hypercalls_features = cpuid_ebx(HYPERV_CPUID_FEATURES); ms_hyperv.misc_features = cpuid_edx(HYPERV_CPUID_FEATURES); ms_hyperv.hints= cpuid_eax(HYPERV_CPUID_ENLIGHTMENT_INFO); @@ -259,6 +260,14 @@ static void __init ms_hyperv_init_platform(void) x86_platform.calibrate_cpu = hv_get_tsc_khz; } + if (ms_hyperv.hypercalls_features & HV_ISOLATION) { + ms_hyperv.isolation_config_a = cpuid_eax(HYPERV_CPUID_ISOLATION_CONFIG); + ms_hyperv.isolation_config_b = cpuid_ebx(HYPERV_CPUID_ISOLATION_CONFIG); + + pr_info("Hyper-V: Isolation Config: GroupA 0x%x, GroupB 0x%x\n", + ms_hyperv.isolation_config_a, ms_hyperv.isolation_config_b); + } + if (ms_hyperv.hints & HV_X64_ENLIGHTENED_VMCS_RECOMMENDED) { ms_hyperv.nested_features = cpuid_eax(HYPERV_CPUID_NESTED_FEATURES); diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h index e73a11850055c..20d3cd9502043 100644 --- a/include/asm-generic/hyperv-tlfs.h +++ b/include/asm-generic/hyperv-tlfs.h @@ -89,6 +89,7 @@ #define HV_ACCESS_STATSBIT(8) #define HV_DEBUGGING BIT(11) #define HV_CPU_POWER_MANAGEMENTBIT(12) +#define HV_ISOLATION BIT(22) /* diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h index c5779968417
[PATCH v2 0/4] Drivers: hv: vmbus: Restrict devices and configurations on 'isolated' guests
Changes since v1 [1]: - improve/add logging (Haiyang Zhang) - move NVSC version check after version negotiation (Haiyang Zhang) [1] https://lkml.kernel.org/r/20210119175841.22248-1-parri.and...@gmail.com Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Arnd Bergmann Cc: "David S. Miller" Cc: Jakub Kicinski Cc: x...@kernel.org Cc: linux-a...@vger.kernel.org Cc: net...@vger.kernel.org Andrea Parri (Microsoft) (4): x86/hyperv: Load/save the Isolation Configuration leaf Drivers: hv: vmbus: Restrict vmbus_devices on isolated guests Drivers: hv: vmbus: Enforce 'VMBus version >= 5.2' on isolated guests hv_netvsc: Restrict configurations on isolated guests arch/x86/hyperv/hv_init.c | 15 + arch/x86/include/asm/hyperv-tlfs.h | 15 + arch/x86/kernel/cpu/mshyperv.c | 9 drivers/hv/channel_mgmt.c | 36 ++ drivers/hv/connection.c| 13 +++ drivers/net/hyperv/netvsc.c| 27 +++--- include/asm-generic/hyperv-tlfs.h | 1 + include/asm-generic/mshyperv.h | 5 + include/linux/hyperv.h | 1 + 9 files changed, 119 insertions(+), 3 deletions(-) -- 2.25.1
[PATCH v2 4/4] hv_netvsc: Restrict configurations on isolated guests
Restrict the NVSP protocol version(s) that will be negotiated with the host to be NVSP_PROTOCOL_VERSION_61 or greater if the guest is running isolated. Moreover, do not advertise the SR-IOV capability and ignore NVSP_MSG_4_TYPE_SEND_VF_ASSOCIATION messages in isolated guests, which are not supposed to support SR-IOV. This reduces the footprint of the code that will be exercised by Confidential VMs and hence the exposure to bugs and vulnerabilities. Signed-off-by: Andrea Parri (Microsoft) Acked-by: Jakub Kicinski Cc: "David S. Miller" Cc: Jakub Kicinski Cc: net...@vger.kernel.org --- drivers/net/hyperv/netvsc.c | 27 --- 1 file changed, 24 insertions(+), 3 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 1510a236aa341..afd92b4aa21fe 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -22,6 +22,7 @@ #include #include +#include #include "hyperv_net.h" #include "netvsc_trace.h" @@ -544,7 +545,10 @@ static int negotiate_nvsp_ver(struct hv_device *device, init_packet->msg.v2_msg.send_ndis_config.capability.ieee8021q = 1; if (nvsp_ver >= NVSP_PROTOCOL_VERSION_5) { - init_packet->msg.v2_msg.send_ndis_config.capability.sriov = 1; + if (!hv_is_isolation_supported()) + init_packet->msg.v2_msg.send_ndis_config.capability.sriov = 1; + else + netdev_info(ndev, "SR-IOV not advertised by guests on the host supporting isolation\n"); /* Teaming bit is needed to receive link speed updates */ init_packet->msg.v2_msg.send_ndis_config.capability.teaming = 1; @@ -563,6 +567,13 @@ static int negotiate_nvsp_ver(struct hv_device *device, return ret; } +static bool nvsp_is_valid_version(u32 version) +{ + if (hv_is_isolation_supported()) + return version >= NVSP_PROTOCOL_VERSION_61; + return true; +} + static int netvsc_connect_vsp(struct hv_device *device, struct netvsc_device *net_device, const struct netvsc_device_info *device_info) @@ -579,12 +590,19 @@ static int netvsc_connect_vsp(struct hv_device *device, init_packet = _device->channel_init_pkt; /* Negotiate the latest NVSP protocol supported */ - for (i = ARRAY_SIZE(ver_list) - 1; i >= 0; i--) + for (i = ARRAY_SIZE(ver_list) - 1; i >= 0; i--) { if (negotiate_nvsp_ver(device, net_device, init_packet, ver_list[i]) == 0) { + if (!nvsp_is_valid_version(ver_list[i])) { + netdev_err(ndev, "Invalid NVSP version 0x%x (expected >= 0x%x) from the host with isolation supported\n", + ver_list[i], NVSP_PROTOCOL_VERSION_61); + ret = -EPROTO; + goto cleanup; + } net_device->nvsp_version = ver_list[i]; break; } + } if (i < 0) { ret = -EPROTO; @@ -1357,7 +1375,10 @@ static void netvsc_receive_inband(struct net_device *ndev, break; case NVSP_MSG4_TYPE_SEND_VF_ASSOCIATION: - netvsc_send_vf(ndev, nvmsg, msglen); + if (!hv_is_isolation_supported()) + netvsc_send_vf(ndev, nvmsg, msglen); + else + netdev_err(ndev, "Ignore VF_ASSOCIATION msg from the host supporting isolation\n"); break; } } -- 2.25.1
[PATCH v2 2/4] Drivers: hv: vmbus: Restrict vmbus_devices on isolated guests
Only the VSCs or ICs that have been hardened and that are critical for the successful adoption of Confidential VMs should be allowed if the guest is running isolated. This change reduces the footprint of the code that will be exercised by Confidential VMs and hence the exposure to bugs and vulnerabilities. Signed-off-by: Andrea Parri (Microsoft) --- drivers/hv/channel_mgmt.c | 36 include/linux/hyperv.h| 1 + 2 files changed, 37 insertions(+) diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index 68950a1e4b638..774ee19e3e90d 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -31,101 +31,118 @@ const struct vmbus_device vmbus_devs[] = { { .dev_type = HV_IDE, HV_IDE_GUID, .perf_device = true, + .allowed_in_isolated = false, }, /* SCSI */ { .dev_type = HV_SCSI, HV_SCSI_GUID, .perf_device = true, + .allowed_in_isolated = true, }, /* Fibre Channel */ { .dev_type = HV_FC, HV_SYNTHFC_GUID, .perf_device = true, + .allowed_in_isolated = false, }, /* Synthetic NIC */ { .dev_type = HV_NIC, HV_NIC_GUID, .perf_device = true, + .allowed_in_isolated = true, }, /* Network Direct */ { .dev_type = HV_ND, HV_ND_GUID, .perf_device = true, + .allowed_in_isolated = false, }, /* PCIE */ { .dev_type = HV_PCIE, HV_PCIE_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* Synthetic Frame Buffer */ { .dev_type = HV_FB, HV_SYNTHVID_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* Synthetic Keyboard */ { .dev_type = HV_KBD, HV_KBD_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* Synthetic MOUSE */ { .dev_type = HV_MOUSE, HV_MOUSE_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* KVP */ { .dev_type = HV_KVP, HV_KVP_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* Time Synch */ { .dev_type = HV_TS, HV_TS_GUID, .perf_device = false, + .allowed_in_isolated = true, }, /* Heartbeat */ { .dev_type = HV_HB, HV_HEART_BEAT_GUID, .perf_device = false, + .allowed_in_isolated = true, }, /* Shutdown */ { .dev_type = HV_SHUTDOWN, HV_SHUTDOWN_GUID, .perf_device = false, + .allowed_in_isolated = true, }, /* File copy */ { .dev_type = HV_FCOPY, HV_FCOPY_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* Backup */ { .dev_type = HV_BACKUP, HV_VSS_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* Dynamic Memory */ { .dev_type = HV_DM, HV_DM_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* Unknown GUID */ { .dev_type = HV_UNKNOWN, .perf_device = false, + .allowed_in_isolated = false, }, }; @@ -903,6 +920,20 @@ find_primary_channel_by_offer(const struct vmbus_channel_offer_channel *offer) return channel; } +static bool vmbus_is_valid_device(const guid_t *guid) +{ + u16 i; + + if (!hv_is_isolation_supported()) + return true; + + for (i = 0; i < ARRAY_SIZE(vmbus_devs); i++) { + if (guid_equal(guid, _devs[i].guid)) + return vmbus_devs[i].allowed_in_isolated; + } + return false; +} + /* * vmbus_onoffer - Handler for channel offers from vmbus in parent partition. * @@ -917,6 +948,11 @@ static void vmbus_onoffer(struct vmbus_channel_message_header *hdr) trace_vmbus_onoffer(offer); + if (!vmbus_is_valid_device(>offer.if_type)) { + atomic_dec(_connection.offer_in_progress); + return; + } + oldchannel = find_primary_channel_by_offer(offer); if (oldchannel != NULL) { diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index f0d48a368f131..e3426f8c12db9 100644 --- a/include/linux/hyperv.h +++ b/include/linux/hyperv.h @@ -789,6 +789,7 @@ struct vmbus_device { u16 dev_type; guid_t guid; bool perf_device; + bool allowed_in_isolated; }; #define VMBUS_DEFAULT_MAX_PKT_SIZE 4096 -- 2.25.1
[PATCH v2 3/4] Drivers: hv: vmbus: Enforce 'VMBus version >= 5.2' on isolated guests
Restrict the protocol version(s) that will be negotiated with the host to be 5.2 or greater if the guest is running isolated. This reduces the footprint of the code that will be exercised by Confidential VMs and hence the exposure to bugs and vulnerabilities. Signed-off-by: Andrea Parri (Microsoft) --- drivers/hv/connection.c | 13 + 1 file changed, 13 insertions(+) diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c index 11170d9a2e1a5..bcf4d7def6838 100644 --- a/drivers/hv/connection.c +++ b/drivers/hv/connection.c @@ -66,6 +66,13 @@ module_param(max_version, uint, S_IRUGO); MODULE_PARM_DESC(max_version, "Maximal VMBus protocol version which can be negotiated"); +static bool vmbus_is_valid_version(u32 version) +{ + if (hv_is_isolation_supported()) + return version >= VERSION_WIN10_V5_2; + return true; +} + int vmbus_negotiate_version(struct vmbus_channel_msginfo *msginfo, u32 version) { int ret = 0; @@ -233,6 +240,12 @@ int vmbus_connect(void) goto cleanup; version = vmbus_versions[i]; + + if (!vmbus_is_valid_version(version)) { + ret = -EINVAL; + goto cleanup; + } + if (version > max_version) continue; -- 2.25.1
[PATCH net-next] hv_netvsc: Copy packets sent by Hyper-V out of the receive buffer
Pointers to receive-buffer packets sent by Hyper-V are used within the guest VM. Hyper-V can send packets with erroneous values or modify packet fields after they are processed by the guest. To defend against these scenarios, copy (sections of) the incoming packet after validating their length and offset fields in netvsc_filter_receive(). In this way, the packet can no longer be modified by the host. Reported-by: Juan Vazquez Signed-off-by: Andrea Parri (Microsoft) Cc: "David S. Miller" Cc: Jakub Kicinski Cc: net...@vger.kernel.org --- drivers/net/hyperv/hyperv_net.h | 4 +- drivers/net/hyperv/netvsc.c | 20 + drivers/net/hyperv/netvsc_drv.c | 9 ++-- drivers/net/hyperv/rndis_filter.c | 74 +++ 4 files changed, 75 insertions(+), 32 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index 2a87cfa27ac02..2c0ab80d8ae4e 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -194,7 +194,8 @@ int netvsc_send(struct net_device *net, struct sk_buff *skb, bool xdp_tx); void netvsc_linkstatus_callback(struct net_device *net, - struct rndis_message *resp); + struct rndis_message *resp, + void *data); int netvsc_recv_callback(struct net_device *net, struct netvsc_device *nvdev, struct netvsc_channel *nvchan); @@ -1002,6 +1003,7 @@ struct net_device_context { struct netvsc_channel { struct vmbus_channel *channel; struct netvsc_device *net_device; + void *recv_buf; /* buffer to copy packets out from the receive buffer */ const struct vmpacket_descriptor *desc; struct napi_struct napi; struct multi_send_data msd; diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 6184e99c7f31f..0fba8257fc119 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -131,6 +131,7 @@ static void free_netvsc_device(struct rcu_head *head) for (i = 0; i < VRSS_CHANNEL_MAX; i++) { xdp_rxq_info_unreg(>chan_table[i].xdp_rxq); + kfree(nvdev->chan_table[i].recv_buf); vfree(nvdev->chan_table[i].mrc.slots); } @@ -1284,6 +1285,19 @@ static int netvsc_receive(struct net_device *ndev, continue; } + /* We're going to copy (sections of) the packet into nvchan->recv_buf; +* make sure that nvchan->recv_buf is large enough to hold the packet. +*/ + if (unlikely(buflen > net_device->recv_section_size)) { + nvchan->rsc.cnt = 0; + status = NVSP_STAT_FAIL; + netif_err(net_device_ctx, rx_err, ndev, + "Packet too big: buflen=%u recv_section_size=%u\n", + buflen, net_device->recv_section_size); + + continue; + } + data = recv_buf + offset; nvchan->rsc.is_last = (i == count - 1); @@ -1535,6 +1549,12 @@ struct netvsc_device *netvsc_device_add(struct hv_device *device, for (i = 0; i < VRSS_CHANNEL_MAX; i++) { struct netvsc_channel *nvchan = _device->chan_table[i]; + nvchan->recv_buf = kzalloc(device_info->recv_section_size, GFP_KERNEL); + if (nvchan->recv_buf == NULL) { + ret = -ENOMEM; + goto cleanup2; + } + nvchan->channel = device->channel; nvchan->net_device = net_device; u64_stats_init(>tx_stats.syncp); diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index ac20c432d4d8f..1ec27bbe267da 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -743,7 +743,8 @@ static netdev_tx_t netvsc_start_xmit(struct sk_buff *skb, * netvsc_linkstatus_callback - Link up/down notification */ void netvsc_linkstatus_callback(struct net_device *net, - struct rndis_message *resp) + struct rndis_message *resp, + void *data) { struct rndis_indicate_status *indicate = >msg.indicate_status; struct net_device_context *ndev_ctx = netdev_priv(net); @@ -757,6 +758,9 @@ void netvsc_linkstatus_callback(struct net_device *net, return; } + /* Copy the RNDIS indicate status into nvchan->recv_buf */ + memcpy(indicate, data + RNDIS_HEADER_SIZE, sizeof(*indicate)); + /* Update the physical link speed when changing to another vSwitch */ if (indicate->status
[PATCH v2] x86/entry: build thunk_$(BITS) only if CONFIG_PREEMPTION=y
With CONFIG_PREEMPTION disabled, arch/x86/entry/thunk_64.o is just an empty object file. With the newer binutils (tested with 2.35.90.20210113-1ubuntu1) the GNU assembler doesn't generate a symbol table for empty object files and objtool fails with the following error when a valid symbol table cannot be found: arch/x86/entry/thunk_64.o: warning: objtool: missing symbol table To prevent this from happening, build thunk_$(BITS).o only if CONFIG_PREEMPTION is enabled. BugLink: https://bugs.launchpad.net/bugs/1911359 Fixes: 320100a5ffe5 ("x86/entry: Remove the TRACE_IRQS cruft") Signed-off-by: Andrea Righi --- arch/x86/entry/Makefile | 3 ++- arch/x86/entry/thunk_32.S | 2 -- arch/x86/entry/thunk_64.S | 4 arch/x86/um/Makefile | 3 ++- 4 files changed, 4 insertions(+), 8 deletions(-) ChangeLog (v1 -> v2): - do not break UML build diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile index 08bf95dbc911..83c98dae74a6 100644 --- a/arch/x86/entry/Makefile +++ b/arch/x86/entry/Makefile @@ -21,12 +21,13 @@ CFLAGS_syscall_64.o += $(call cc-option,-Wno-override-init,) CFLAGS_syscall_32.o+= $(call cc-option,-Wno-override-init,) CFLAGS_syscall_x32.o += $(call cc-option,-Wno-override-init,) -obj-y := entry_$(BITS).o thunk_$(BITS).o syscall_$(BITS).o +obj-y := entry_$(BITS).o syscall_$(BITS).o obj-y += common.o obj-y += vdso/ obj-y += vsyscall/ +obj-$(CONFIG_PREEMPTION) += thunk_$(BITS).o obj-$(CONFIG_IA32_EMULATION) += entry_64_compat.o syscall_32.o obj-$(CONFIG_X86_X32_ABI) += syscall_x32.o diff --git a/arch/x86/entry/thunk_32.S b/arch/x86/entry/thunk_32.S index f1f96d4d8cd6..5997ec0b4b17 100644 --- a/arch/x86/entry/thunk_32.S +++ b/arch/x86/entry/thunk_32.S @@ -29,10 +29,8 @@ SYM_CODE_START_NOALIGN(\name) SYM_CODE_END(\name) .endm -#ifdef CONFIG_PREEMPTION THUNK preempt_schedule_thunk, preempt_schedule THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace EXPORT_SYMBOL(preempt_schedule_thunk) EXPORT_SYMBOL(preempt_schedule_notrace_thunk) -#endif diff --git a/arch/x86/entry/thunk_64.S b/arch/x86/entry/thunk_64.S index ccd32877a3c4..c7cf79be7231 100644 --- a/arch/x86/entry/thunk_64.S +++ b/arch/x86/entry/thunk_64.S @@ -36,14 +36,11 @@ SYM_FUNC_END(\name) _ASM_NOKPROBE(\name) .endm -#ifdef CONFIG_PREEMPTION THUNK preempt_schedule_thunk, preempt_schedule THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace EXPORT_SYMBOL(preempt_schedule_thunk) EXPORT_SYMBOL(preempt_schedule_notrace_thunk) -#endif -#ifdef CONFIG_PREEMPTION SYM_CODE_START_LOCAL_NOALIGN(.L_restore) popq %r11 popq %r10 @@ -58,4 +55,3 @@ SYM_CODE_START_LOCAL_NOALIGN(.L_restore) ret _ASM_NOKPROBE(.L_restore) SYM_CODE_END(.L_restore) -#endif diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile index 77f70b969d14..3113800da63a 100644 --- a/arch/x86/um/Makefile +++ b/arch/x86/um/Makefile @@ -27,7 +27,8 @@ else obj-y += syscalls_64.o vdso/ -subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o ../entry/thunk_64.o +subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o +subarch-$(CONFIG_PREEMPTION) += ../entry/thunk_64.o endif -- 2.29.2
Re: [PATCH 4/4] hv_netvsc: Restrict configurations on isolated guests
> > > > @@ -544,7 +545,8 @@ static int negotiate_nvsp_ver(struct hv_device > > > > *device, > > > > init_packet->msg.v2_msg.send_ndis_config.capability.ieee8021q = > > > > 1; > > > > > > > > if (nvsp_ver >= NVSP_PROTOCOL_VERSION_5) { > > > > - > > > > init_packet->msg.v2_msg.send_ndis_config.capability.sriov = > > > > 1; > > > > + if (!hv_is_isolation_supported()) > > > > + init_packet- > > > > >msg.v2_msg.send_ndis_config.capability.sriov = 1; > > > > > > Please also add a log there stating we don't support sriov in this case. > > Otherwise, > > > customers will ask why vf not showing up. > > > > IIUC, you're suggesting that I append something like: > > > > + else > > + netdev_info(ndev, "SR-IOV not advertised: isolation > > supported\n"); > > > > I've added this locally; please let me know if you had something else > > /better in mind. > > This message explains the failure reason better: > "SR-IOV not advertised by guests on the host supporting isolation" Applied. > > > > @@ -579,12 +588,17 @@ static int netvsc_connect_vsp(struct hv_device > > > > *device, > > > > init_packet = _device->channel_init_pkt; > > > > > > > > /* Negotiate the latest NVSP protocol supported */ > > > > - for (i = ARRAY_SIZE(ver_list) - 1; i >= 0; i--) > > > > + for (i = ARRAY_SIZE(ver_list) - 1; i >= 0; i--) { > > > > + if (!nvsp_is_valid_version(ver_list[i])) { > > > > + ret = -EPROTO; > > > > + goto cleanup; > > > > + } > > > > > > This code can catch the invalid, but cannot get the current host nvsp > > version. > > > I'd suggest move this check after version negotiation is done. So we can > > > log > > what's > > > the current host nvsp version, and why we fail it (the expected nvsp ver). > > > > Mmh, invalid versions are not negotiated. How about I simply add the > > following logging right before the above 'ret = -EPROTO' say? > > > > + netdev_err(ndev, "Invalid NVSP version %x > > (expected >= %x): isolation supported\n", > > + ver_list[i], NVSP_PROTOCOL_VERSION_61); > > > > (or something along these lines) > > The negotiation process runs from the latest to oldest. If the host is 5, > your code > will fail before trying v6.0, and log: > "Invalid NVSP version 6 (expected >= 60001): isolation supported" > This will make user think the NVSP version is 6.0. > > Since you will let the NIC fail and cleanup, there is no harm to check the > version > after negotiation. And this case is unexpected from a "normal" host. So I > suggest > move the check after negotiation is done, then we know the actual host nvsp > version that causing this issue. And we can bring the accurate info to host > team > for better diagnosability. Fair enough, will do. > Please point out this invalid version is caused by the host side, like this: > "Invalid NVSP version 0x5 (expected >= 0x60001) from the host with > isolation support" > Also please use "0x%x" for hexadecimal numbers. Sure. > > > > @@ -1357,7 +1371,8 @@ static void netvsc_receive_inband(struct > > > > net_device *ndev, > > > > break; > > > > > > > > case NVSP_MSG4_TYPE_SEND_VF_ASSOCIATION: > > > > - netvsc_send_vf(ndev, nvmsg, msglen); > > > > + if (!hv_is_isolation_supported()) > > > > + netvsc_send_vf(ndev, nvmsg, msglen); > > > > > > When the driver doesn't advertise SRIOV, this message is not expected. > > > Instead of ignore silently, we should log an error. > > > > I've appended: > > > > + else > > + netdev_err(ndev, "Unexpected VF message: > > isolation supported\n"); > > Please log the msg type: > "Ignore VF_ASSOCIATION msg from the host supporting isolation" Applied. Thanks, Andrea
Re: [tip: x86/entry] x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y
On Thu, Jan 21, 2021 at 09:52:01AM +0100, Andrea Righi wrote: > On Thu, Jan 21, 2021 at 08:49:28AM +0100, Ingo Molnar wrote: > > > > * tip-bot2 for Andrea Righi wrote: > > > > > The following commit has been merged into the x86/entry branch of tip: > > > > > > Commit-ID: e6d92b6680371ae1aeeb6c5eb2387fdc5d9a2c89 > > > Gitweb: > > > https://git.kernel.org/tip/e6d92b6680371ae1aeeb6c5eb2387fdc5d9a2c89 > > > Author:Andrea Righi > > > AuthorDate:Thu, 14 Jan 2021 12:48:35 +01:00 > > > Committer: Ingo Molnar > > > CommitterDate: Thu, 21 Jan 2021 08:11:52 +01:00 > > > > > > x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y > > > > > > With CONFIG_PREEMPTION disabled, arch/x86/entry/thunk_64.o is just an > > > empty object file. > > > > > > With the newer binutils (tested with 2.35.90.20210113-1ubuntu1) the GNU > > > assembler doesn't generate a symbol table for empty object files and > > > objtool fails with the following error when a valid symbol table cannot > > > be found: > > > > > > arch/x86/entry/thunk_64.o: warning: objtool: missing symbol table > > > > > > To prevent this from happening, build thunk_$(BITS).o only if > > > CONFIG_PREEMPTION is enabled. > > > > > > BugLink: https://bugs.launchpad.net/bugs/1911359 > > > > > > Fixes: 320100a5ffe5 ("x86/entry: Remove the TRACE_IRQS cruft") > > > Signed-off-by: Andrea Righi > > > Signed-off-by: Ingo Molnar > > > Cc: Borislav Petkov > > > Link: https://lore.kernel.org/r/YAAvk0UQelq0Ae7+@xps-13-7390 > > > > Hm, this fails to build on UML defconfig: > > > > > > /home/mingo/gcc/cross/lib/gcc/x86_64-linux/9.3.1/../../../../x86_64-linux/bin/ld: > > arch/x86/um/../entry/thunk_64.o: in function `preempt_schedule_thunk': > > /home/mingo/tip.cross/arch/x86/um/../entry/thunk_64.S:34: undefined > > reference to `preempt_schedule' > > > > /home/mingo/gcc/cross/lib/gcc/x86_64-linux/9.3.1/../../../../x86_64-linux/bin/ld: > > arch/x86/um/../entry/thunk_64.o: in function > > `preempt_schedule_notrace_thunk': > > /home/mingo/tip.cross/arch/x86/um/../entry/thunk_64.S:35: undefined > > reference to `preempt_schedule_notrace' > > > > Thanks, > > > > Ingo > > I've been able to reproduce it, I'm looking at this right now. Thanks! I see, basically UML selects ARCH_NO_PREEMPT, but in arch/x86/um/Makefile it explicitly includes thunk_$(BITS).o regardless. Considering that thunk_$(BITS) only contains preemption code now, we can probably drop it from the Makefile, or, to be more consistent with the x86 change, we could include it only if CONFIG_PREEMPTION is enabled (even if it would never be, because UML has ARCH_NO_PREEMPT). If it's unlikely that preemption will be enabled in UML one day I'd probably go with the former, otherwise I'd go with the latter, because it looks more consistent. Opinions? Thanks, -Andrea
Re: [tip: x86/entry] x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y
On Thu, Jan 21, 2021 at 08:49:28AM +0100, Ingo Molnar wrote: > > * tip-bot2 for Andrea Righi wrote: > > > The following commit has been merged into the x86/entry branch of tip: > > > > Commit-ID: e6d92b6680371ae1aeeb6c5eb2387fdc5d9a2c89 > > Gitweb: > > https://git.kernel.org/tip/e6d92b6680371ae1aeeb6c5eb2387fdc5d9a2c89 > > Author:Andrea Righi > > AuthorDate:Thu, 14 Jan 2021 12:48:35 +01:00 > > Committer: Ingo Molnar > > CommitterDate: Thu, 21 Jan 2021 08:11:52 +01:00 > > > > x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y > > > > With CONFIG_PREEMPTION disabled, arch/x86/entry/thunk_64.o is just an > > empty object file. > > > > With the newer binutils (tested with 2.35.90.20210113-1ubuntu1) the GNU > > assembler doesn't generate a symbol table for empty object files and > > objtool fails with the following error when a valid symbol table cannot > > be found: > > > > arch/x86/entry/thunk_64.o: warning: objtool: missing symbol table > > > > To prevent this from happening, build thunk_$(BITS).o only if > > CONFIG_PREEMPTION is enabled. > > > > BugLink: https://bugs.launchpad.net/bugs/1911359 > > > > Fixes: 320100a5ffe5 ("x86/entry: Remove the TRACE_IRQS cruft") > > Signed-off-by: Andrea Righi > > Signed-off-by: Ingo Molnar > > Cc: Borislav Petkov > > Link: https://lore.kernel.org/r/YAAvk0UQelq0Ae7+@xps-13-7390 > > Hm, this fails to build on UML defconfig: > > > /home/mingo/gcc/cross/lib/gcc/x86_64-linux/9.3.1/../../../../x86_64-linux/bin/ld: > arch/x86/um/../entry/thunk_64.o: in function `preempt_schedule_thunk': > /home/mingo/tip.cross/arch/x86/um/../entry/thunk_64.S:34: undefined > reference to `preempt_schedule' > > /home/mingo/gcc/cross/lib/gcc/x86_64-linux/9.3.1/../../../../x86_64-linux/bin/ld: > arch/x86/um/../entry/thunk_64.o: in function > `preempt_schedule_notrace_thunk': > /home/mingo/tip.cross/arch/x86/um/../entry/thunk_64.S:35: undefined > reference to `preempt_schedule_notrace' > > Thanks, > > Ingo I've been able to reproduce it, I'm looking at this right now. Thanks! -Andrea
[tip: x86/entry] x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y
The following commit has been merged into the x86/entry branch of tip: Commit-ID: e6d92b6680371ae1aeeb6c5eb2387fdc5d9a2c89 Gitweb: https://git.kernel.org/tip/e6d92b6680371ae1aeeb6c5eb2387fdc5d9a2c89 Author:Andrea Righi AuthorDate:Thu, 14 Jan 2021 12:48:35 +01:00 Committer: Ingo Molnar CommitterDate: Thu, 21 Jan 2021 08:11:52 +01:00 x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y With CONFIG_PREEMPTION disabled, arch/x86/entry/thunk_64.o is just an empty object file. With the newer binutils (tested with 2.35.90.20210113-1ubuntu1) the GNU assembler doesn't generate a symbol table for empty object files and objtool fails with the following error when a valid symbol table cannot be found: arch/x86/entry/thunk_64.o: warning: objtool: missing symbol table To prevent this from happening, build thunk_$(BITS).o only if CONFIG_PREEMPTION is enabled. BugLink: https://bugs.launchpad.net/bugs/1911359 Fixes: 320100a5ffe5 ("x86/entry: Remove the TRACE_IRQS cruft") Signed-off-by: Andrea Righi Signed-off-by: Ingo Molnar Cc: Borislav Petkov Link: https://lore.kernel.org/r/YAAvk0UQelq0Ae7+@xps-13-7390 --- arch/x86/entry/Makefile | 3 ++- arch/x86/entry/thunk_32.S | 2 -- arch/x86/entry/thunk_64.S | 4 3 files changed, 2 insertions(+), 7 deletions(-) diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile index 08bf95d..83c98da 100644 --- a/arch/x86/entry/Makefile +++ b/arch/x86/entry/Makefile @@ -21,12 +21,13 @@ CFLAGS_syscall_64.o += $(call cc-option,-Wno-override-init,) CFLAGS_syscall_32.o+= $(call cc-option,-Wno-override-init,) CFLAGS_syscall_x32.o += $(call cc-option,-Wno-override-init,) -obj-y := entry_$(BITS).o thunk_$(BITS).o syscall_$(BITS).o +obj-y := entry_$(BITS).o syscall_$(BITS).o obj-y += common.o obj-y += vdso/ obj-y += vsyscall/ +obj-$(CONFIG_PREEMPTION) += thunk_$(BITS).o obj-$(CONFIG_IA32_EMULATION) += entry_64_compat.o syscall_32.o obj-$(CONFIG_X86_X32_ABI) += syscall_x32.o diff --git a/arch/x86/entry/thunk_32.S b/arch/x86/entry/thunk_32.S index f1f96d4..5997ec0 100644 --- a/arch/x86/entry/thunk_32.S +++ b/arch/x86/entry/thunk_32.S @@ -29,10 +29,8 @@ SYM_CODE_START_NOALIGN(\name) SYM_CODE_END(\name) .endm -#ifdef CONFIG_PREEMPTION THUNK preempt_schedule_thunk, preempt_schedule THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace EXPORT_SYMBOL(preempt_schedule_thunk) EXPORT_SYMBOL(preempt_schedule_notrace_thunk) -#endif diff --git a/arch/x86/entry/thunk_64.S b/arch/x86/entry/thunk_64.S index 496b11e..9d543c4 100644 --- a/arch/x86/entry/thunk_64.S +++ b/arch/x86/entry/thunk_64.S @@ -31,14 +31,11 @@ SYM_FUNC_END(\name) _ASM_NOKPROBE(\name) .endm -#ifdef CONFIG_PREEMPTION THUNK preempt_schedule_thunk, preempt_schedule THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace EXPORT_SYMBOL(preempt_schedule_thunk) EXPORT_SYMBOL(preempt_schedule_notrace_thunk) -#endif -#ifdef CONFIG_PREEMPTION SYM_CODE_START_LOCAL_NOALIGN(__thunk_restore) popq %r11 popq %r10 @@ -53,4 +50,3 @@ SYM_CODE_START_LOCAL_NOALIGN(__thunk_restore) ret _ASM_NOKPROBE(__thunk_restore) SYM_CODE_END(__thunk_restore) -#endif
Re: [PATCH 4/4] hv_netvsc: Restrict configurations on isolated guests
> > @@ -544,7 +545,8 @@ static int negotiate_nvsp_ver(struct hv_device > > *device, > > init_packet->msg.v2_msg.send_ndis_config.capability.ieee8021q = 1; > > > > if (nvsp_ver >= NVSP_PROTOCOL_VERSION_5) { > > - init_packet->msg.v2_msg.send_ndis_config.capability.sriov = > > 1; > > + if (!hv_is_isolation_supported()) > > + init_packet- > > >msg.v2_msg.send_ndis_config.capability.sriov = 1; > > Please also add a log there stating we don't support sriov in this case. > Otherwise, > customers will ask why vf not showing up. IIUC, you're suggesting that I append something like: + else + netdev_info(ndev, "SR-IOV not advertised: isolation supported\n"); I've added this locally; please let me know if you had something else /better in mind. > > @@ -563,6 +565,13 @@ static int negotiate_nvsp_ver(struct hv_device > > *device, > > return ret; > > } > > > > +static bool nvsp_is_valid_version(u32 version) > > +{ > > + if (hv_is_isolation_supported()) > > + return version >= NVSP_PROTOCOL_VERSION_61; > > + return true; > Hosts support isolation should run nvsp 6.1+. This error is not expected. > Instead of fail silently, we should log an error to explain why it's failed, > and the current version and expected version. Please see my next comment below. > > +} > > + > > static int netvsc_connect_vsp(struct hv_device *device, > > struct netvsc_device *net_device, > > const struct netvsc_device_info *device_info) > > @@ -579,12 +588,17 @@ static int netvsc_connect_vsp(struct hv_device > > *device, > > init_packet = _device->channel_init_pkt; > > > > /* Negotiate the latest NVSP protocol supported */ > > - for (i = ARRAY_SIZE(ver_list) - 1; i >= 0; i--) > > + for (i = ARRAY_SIZE(ver_list) - 1; i >= 0; i--) { > > + if (!nvsp_is_valid_version(ver_list[i])) { > > + ret = -EPROTO; > > + goto cleanup; > > + } > > This code can catch the invalid, but cannot get the current host nvsp version. > I'd suggest move this check after version negotiation is done. So we can log > what's > the current host nvsp version, and why we fail it (the expected nvsp ver). Mmh, invalid versions are not negotiated. How about I simply add the following logging right before the above 'ret = -EPROTO' say? + netdev_err(ndev, "Invalid NVSP version %x (expected >= %x): isolation supported\n", + ver_list[i], NVSP_PROTOCOL_VERSION_61); (or something along these lines) > > @@ -1357,7 +1371,8 @@ static void netvsc_receive_inband(struct > > net_device *ndev, > > break; > > > > case NVSP_MSG4_TYPE_SEND_VF_ASSOCIATION: > > - netvsc_send_vf(ndev, nvmsg, msglen); > > + if (!hv_is_isolation_supported()) > > + netvsc_send_vf(ndev, nvmsg, msglen); > > When the driver doesn't advertise SRIOV, this message is not expected. > Instead of ignore silently, we should log an error. I've appended: + else + netdev_err(ndev, "Unexpected VF message: isolation supported\n"); Please let me know if I got this wrong. Thanks, Andrea
[PATCH 0/4] Drivers: hv: vmbus: Restrict devices and configurations on 'isolated' guests
Hi all, To reduce the footprint of the code that will be exercised, and hence the exposure to bugs and vulnerabilities, restrict configurations and devices on 'isolated' VMs. Specs of the Isolation Configuration leaf (cf. patch #1) were derived from internal discussions with the Hyper-V team and, AFAICT, they are not publicly available yet. The series has some minor/naming conflict with on-going work aimed at enabling SNP VMs on Hyper-V[1]; such conflicts can be addressed later at the right time. Applies to hyperv-next. Thanks, Andrea [1] https://github.com/lantianyu/linux # cvm Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Arnd Bergmann Cc: "David S. Miller" Cc: Jakub Kicinski Cc: x...@kernel.org Cc: linux-a...@vger.kernel.org Cc: net...@vger.kernel.org Andrea Parri (Microsoft) (4): x86/hyperv: Load/save the Isolation Configuration leaf Drivers: hv: vmbus: Restrict vmbus_devices on isolated guests Drivers: hv: vmbus: Enforce 'VMBus version >= 5.2' on isolated guests hv_netvsc: Restrict configurations on isolated guests arch/x86/hyperv/hv_init.c | 15 + arch/x86/include/asm/hyperv-tlfs.h | 15 + arch/x86/kernel/cpu/mshyperv.c | 9 drivers/hv/channel_mgmt.c | 36 ++ drivers/hv/connection.c| 13 +++ drivers/net/hyperv/netvsc.c| 21 ++--- include/asm-generic/hyperv-tlfs.h | 1 + include/asm-generic/mshyperv.h | 5 + include/linux/hyperv.h | 1 + 9 files changed, 113 insertions(+), 3 deletions(-) -- 2.25.1
[PATCH 2/4] Drivers: hv: vmbus: Restrict vmbus_devices on isolated guests
Only the VSCs or ICs that have been hardened and that are critical for the successful adoption of Confidential VMs should be allowed if the guest is running isolated. This change reduces the footprint of the code that will be exercised by Confidential VMs and hence the exposure to bugs and vulnerabilities. Signed-off-by: Andrea Parri (Microsoft) --- drivers/hv/channel_mgmt.c | 36 include/linux/hyperv.h| 1 + 2 files changed, 37 insertions(+) diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index 68950a1e4b638..774ee19e3e90d 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -31,101 +31,118 @@ const struct vmbus_device vmbus_devs[] = { { .dev_type = HV_IDE, HV_IDE_GUID, .perf_device = true, + .allowed_in_isolated = false, }, /* SCSI */ { .dev_type = HV_SCSI, HV_SCSI_GUID, .perf_device = true, + .allowed_in_isolated = true, }, /* Fibre Channel */ { .dev_type = HV_FC, HV_SYNTHFC_GUID, .perf_device = true, + .allowed_in_isolated = false, }, /* Synthetic NIC */ { .dev_type = HV_NIC, HV_NIC_GUID, .perf_device = true, + .allowed_in_isolated = true, }, /* Network Direct */ { .dev_type = HV_ND, HV_ND_GUID, .perf_device = true, + .allowed_in_isolated = false, }, /* PCIE */ { .dev_type = HV_PCIE, HV_PCIE_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* Synthetic Frame Buffer */ { .dev_type = HV_FB, HV_SYNTHVID_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* Synthetic Keyboard */ { .dev_type = HV_KBD, HV_KBD_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* Synthetic MOUSE */ { .dev_type = HV_MOUSE, HV_MOUSE_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* KVP */ { .dev_type = HV_KVP, HV_KVP_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* Time Synch */ { .dev_type = HV_TS, HV_TS_GUID, .perf_device = false, + .allowed_in_isolated = true, }, /* Heartbeat */ { .dev_type = HV_HB, HV_HEART_BEAT_GUID, .perf_device = false, + .allowed_in_isolated = true, }, /* Shutdown */ { .dev_type = HV_SHUTDOWN, HV_SHUTDOWN_GUID, .perf_device = false, + .allowed_in_isolated = true, }, /* File copy */ { .dev_type = HV_FCOPY, HV_FCOPY_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* Backup */ { .dev_type = HV_BACKUP, HV_VSS_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* Dynamic Memory */ { .dev_type = HV_DM, HV_DM_GUID, .perf_device = false, + .allowed_in_isolated = false, }, /* Unknown GUID */ { .dev_type = HV_UNKNOWN, .perf_device = false, + .allowed_in_isolated = false, }, }; @@ -903,6 +920,20 @@ find_primary_channel_by_offer(const struct vmbus_channel_offer_channel *offer) return channel; } +static bool vmbus_is_valid_device(const guid_t *guid) +{ + u16 i; + + if (!hv_is_isolation_supported()) + return true; + + for (i = 0; i < ARRAY_SIZE(vmbus_devs); i++) { + if (guid_equal(guid, _devs[i].guid)) + return vmbus_devs[i].allowed_in_isolated; + } + return false; +} + /* * vmbus_onoffer - Handler for channel offers from vmbus in parent partition. * @@ -917,6 +948,11 @@ static void vmbus_onoffer(struct vmbus_channel_message_header *hdr) trace_vmbus_onoffer(offer); + if (!vmbus_is_valid_device(>offer.if_type)) { + atomic_dec(_connection.offer_in_progress); + return; + } + oldchannel = find_primary_channel_by_offer(offer); if (oldchannel != NULL) { diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index f0d48a368f131..e3426f8c12db9 100644 --- a/include/linux/hyperv.h +++ b/include/linux/hyperv.h @@ -789,6 +789,7 @@ struct vmbus_device { u16 dev_type; guid_t guid; bool perf_device; + bool allowed_in_isolated; }; #define VMBUS_DEFAULT_MAX_PKT_SIZE 4096 -- 2.25.1
[PATCH 4/4] hv_netvsc: Restrict configurations on isolated guests
Restrict the NVSP protocol version(s) that will be negotiated with the host to be NVSP_PROTOCOL_VERSION_61 or greater if the guest is running isolated. Moreover, do not advertise the SR-IOV capability and ignore NVSP_MSG_4_TYPE_SEND_VF_ASSOCIATION messages in isolated guests, which are not supposed to support SR-IOV. This reduces the footprint of the code that will be exercised by Confidential VMs and hence the exposure to bugs and vulnerabilities. Signed-off-by: Andrea Parri (Microsoft) Cc: "David S. Miller" Cc: Jakub Kicinski Cc: net...@vger.kernel.org --- drivers/net/hyperv/netvsc.c | 21 ++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 1510a236aa341..8027d553cb67d 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -22,6 +22,7 @@ #include #include +#include #include "hyperv_net.h" #include "netvsc_trace.h" @@ -544,7 +545,8 @@ static int negotiate_nvsp_ver(struct hv_device *device, init_packet->msg.v2_msg.send_ndis_config.capability.ieee8021q = 1; if (nvsp_ver >= NVSP_PROTOCOL_VERSION_5) { - init_packet->msg.v2_msg.send_ndis_config.capability.sriov = 1; + if (!hv_is_isolation_supported()) + init_packet->msg.v2_msg.send_ndis_config.capability.sriov = 1; /* Teaming bit is needed to receive link speed updates */ init_packet->msg.v2_msg.send_ndis_config.capability.teaming = 1; @@ -563,6 +565,13 @@ static int negotiate_nvsp_ver(struct hv_device *device, return ret; } +static bool nvsp_is_valid_version(u32 version) +{ + if (hv_is_isolation_supported()) + return version >= NVSP_PROTOCOL_VERSION_61; + return true; +} + static int netvsc_connect_vsp(struct hv_device *device, struct netvsc_device *net_device, const struct netvsc_device_info *device_info) @@ -579,12 +588,17 @@ static int netvsc_connect_vsp(struct hv_device *device, init_packet = _device->channel_init_pkt; /* Negotiate the latest NVSP protocol supported */ - for (i = ARRAY_SIZE(ver_list) - 1; i >= 0; i--) + for (i = ARRAY_SIZE(ver_list) - 1; i >= 0; i--) { + if (!nvsp_is_valid_version(ver_list[i])) { + ret = -EPROTO; + goto cleanup; + } if (negotiate_nvsp_ver(device, net_device, init_packet, ver_list[i]) == 0) { net_device->nvsp_version = ver_list[i]; break; } + } if (i < 0) { ret = -EPROTO; @@ -1357,7 +1371,8 @@ static void netvsc_receive_inband(struct net_device *ndev, break; case NVSP_MSG4_TYPE_SEND_VF_ASSOCIATION: - netvsc_send_vf(ndev, nvmsg, msglen); + if (!hv_is_isolation_supported()) + netvsc_send_vf(ndev, nvmsg, msglen); break; } } -- 2.25.1
[PATCH 1/4] x86/hyperv: Load/save the Isolation Configuration leaf
If bit 22 of Group B Features is set, the guest has access to the Isolation Configuration CPUID leaf. On x86, the first four bits of EAX in this leaf provide the isolation type of the partition; we entail three isolation types: 'SNP' (hardware-based isolation), 'VBS' (software-based isolation), and 'NONE' (no isolation). Signed-off-by: Andrea Parri (Microsoft) Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Arnd Bergmann Cc: x...@kernel.org Cc: linux-a...@vger.kernel.org --- arch/x86/hyperv/hv_init.c | 15 +++ arch/x86/include/asm/hyperv-tlfs.h | 15 +++ arch/x86/kernel/cpu/mshyperv.c | 9 + include/asm-generic/hyperv-tlfs.h | 1 + include/asm-generic/mshyperv.h | 5 + 5 files changed, 45 insertions(+) diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index e04d90af4c27c..dc94e95c57b98 100644 --- a/arch/x86/hyperv/hv_init.c +++ b/arch/x86/hyperv/hv_init.c @@ -10,6 +10,7 @@ #include #include #include +#include #include #include #include @@ -528,3 +529,17 @@ bool hv_is_hibernation_supported(void) return acpi_sleep_state_supported(ACPI_STATE_S4); } EXPORT_SYMBOL_GPL(hv_is_hibernation_supported); + +enum hv_isolation_type hv_get_isolation_type(void) +{ + if (!(ms_hyperv.hypercalls_features & HV_ISOLATION)) + return HV_ISOLATION_TYPE_NONE; + return FIELD_GET(HV_ISOLATION_TYPE, ms_hyperv.isolation_config_b); +} +EXPORT_SYMBOL_GPL(hv_get_isolation_type); + +bool hv_is_isolation_supported(void) +{ + return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE; +} +EXPORT_SYMBOL_GPL(hv_is_isolation_supported); diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h index 6bf42aed387e3..6aed936e5e962 100644 --- a/arch/x86/include/asm/hyperv-tlfs.h +++ b/arch/x86/include/asm/hyperv-tlfs.h @@ -22,6 +22,7 @@ #define HYPERV_CPUID_ENLIGHTMENT_INFO 0x4004 #define HYPERV_CPUID_IMPLEMENT_LIMITS 0x4005 #define HYPERV_CPUID_NESTED_FEATURES 0x400A +#define HYPERV_CPUID_ISOLATION_CONFIG 0x400C #define HYPERV_CPUID_VIRT_STACK_INTERFACE 0x4081 #define HYPERV_VS_INTERFACE_EAX_SIGNATURE 0x31235356 /* "VS#1" */ @@ -122,6 +123,20 @@ #define HV_X64_NESTED_GUEST_MAPPING_FLUSH BIT(18) #define HV_X64_NESTED_MSR_BITMAP BIT(19) +/* HYPERV_CPUID_ISOLATION_CONFIG.EAX bits. */ +#define HV_PARAVISOR_PRESENT BIT(0) + +/* HYPERV_CPUID_ISOLATION_CONFIG.EBX bits. */ +#define HV_ISOLATION_TYPE GENMASK(3, 0) +#define HV_SHARED_GPA_BOUNDARY_ACTIVE BIT(5) +#define HV_SHARED_GPA_BOUNDARY_BITSGENMASK(11, 6) + +enum hv_isolation_type { + HV_ISOLATION_TYPE_NONE = 0, + HV_ISOLATION_TYPE_VBS = 1, + HV_ISOLATION_TYPE_SNP = 2 +}; + /* Hyper-V specific model specific registers (MSRs) */ /* MSR used to identify the guest OS. */ diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c index f628e3dc150f3..0d4aaf6694d01 100644 --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -225,6 +225,7 @@ static void __init ms_hyperv_init_platform(void) * Extract the features and hints */ ms_hyperv.features = cpuid_eax(HYPERV_CPUID_FEATURES); + ms_hyperv.hypercalls_features = cpuid_ebx(HYPERV_CPUID_FEATURES); ms_hyperv.misc_features = cpuid_edx(HYPERV_CPUID_FEATURES); ms_hyperv.hints= cpuid_eax(HYPERV_CPUID_ENLIGHTMENT_INFO); @@ -259,6 +260,14 @@ static void __init ms_hyperv_init_platform(void) x86_platform.calibrate_cpu = hv_get_tsc_khz; } + if (ms_hyperv.hypercalls_features & HV_ISOLATION) { + ms_hyperv.isolation_config_a = cpuid_eax(HYPERV_CPUID_ISOLATION_CONFIG); + ms_hyperv.isolation_config_b = cpuid_ebx(HYPERV_CPUID_ISOLATION_CONFIG); + + pr_info("Hyper-V: Isolation Config: GroupA 0x%x, GroupB 0x%x\n", + ms_hyperv.isolation_config_a, ms_hyperv.isolation_config_b); + } + if (ms_hyperv.hints & HV_X64_ENLIGHTENED_VMCS_RECOMMENDED) { ms_hyperv.nested_features = cpuid_eax(HYPERV_CPUID_NESTED_FEATURES); diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h index e73a11850055c..20d3cd9502043 100644 --- a/include/asm-generic/hyperv-tlfs.h +++ b/include/asm-generic/hyperv-tlfs.h @@ -89,6 +89,7 @@ #define HV_ACCESS_STATSBIT(8) #define HV_DEBUGGING BIT(11) #define HV_CPU_POWER_MANAGEMENTBIT(12) +#define HV_ISOLATION BIT(22) /* diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h index c5779968417
[PATCH 3/4] Drivers: hv: vmbus: Enforce 'VMBus version >= 5.2' on isolated guests
Restrict the protocol version(s) that will be negotiated with the host to be 5.2 or greater if the guest is running isolated. This reduces the footprint of the code that will be exercised by Confidential VMs and hence the exposure to bugs and vulnerabilities. Signed-off-by: Andrea Parri (Microsoft) --- drivers/hv/connection.c | 13 + 1 file changed, 13 insertions(+) diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c index 11170d9a2e1a5..bcf4d7def6838 100644 --- a/drivers/hv/connection.c +++ b/drivers/hv/connection.c @@ -66,6 +66,13 @@ module_param(max_version, uint, S_IRUGO); MODULE_PARM_DESC(max_version, "Maximal VMBus protocol version which can be negotiated"); +static bool vmbus_is_valid_version(u32 version) +{ + if (hv_is_isolation_supported()) + return version >= VERSION_WIN10_V5_2; + return true; +} + int vmbus_negotiate_version(struct vmbus_channel_msginfo *msginfo, u32 version) { int ret = 0; @@ -233,6 +240,12 @@ int vmbus_connect(void) goto cleanup; version = vmbus_versions[i]; + + if (!vmbus_is_valid_version(version)) { + ret = -EINVAL; + goto cleanup; + } + if (version > max_version) continue; -- 2.25.1
Re: [PATCH v2] hv_netvsc: Add (more) validation for untrusted Hyper-V values
On Sun, Jan 17, 2021 at 03:10:32PM +, Wei Liu wrote: > On Sat, Jan 16, 2021 at 02:02:01PM +0100, Andrea Parri wrote: > > On Fri, Jan 15, 2021 at 08:30:22PM -0800, Jakub Kicinski wrote: > > > On Thu, 14 Jan 2021 21:26:28 +0100 Andrea Parri (Microsoft) wrote: > > > > For additional robustness in the face of Hyper-V errors or malicious > > > > behavior, validate all values that originate from packets that Hyper-V > > > > has sent to the guest. Ensure that invalid values cannot cause indexing > > > > off the end of an array, or subvert an existing validation via integer > > > > overflow. Ensure that outgoing packets do not have any leftover guest > > > > memory that has not been zeroed out. > > > > > > > > Reported-by: Juan Vazquez > > > > Signed-off-by: Andrea Parri (Microsoft) > > > > Cc: "David S. Miller" > > > > Cc: Jakub Kicinski > > > > Cc: Alexei Starovoitov > > > > Cc: Daniel Borkmann > > > > Cc: Andrii Nakryiko > > > > Cc: Martin KaFai Lau > > > > Cc: Song Liu > > > > Cc: Yonghong Song > > > > Cc: John Fastabend > > > > Cc: KP Singh > > > > Cc: net...@vger.kernel.org > > > > Cc: b...@vger.kernel.org > > > > --- > > > > Applies to 5.11-rc3 (and hyperv-next). > > > > > > So this is for hyperv-next or should we take it via netdev trees? > > > > No preference, either way is good for me. > > To be clear: There is no dependency on any patch in hyperv-next, right? > > That's my understanding, but I would like to confirm it. Well, I wrote that this *applies* to hyperv-next... but that's indeed the only 'dependency' I can think of. Hope this helps. Thanks, Andrea
Re: [PATCH v2] hv_netvsc: Add (more) validation for untrusted Hyper-V values
On Fri, Jan 15, 2021 at 08:30:22PM -0800, Jakub Kicinski wrote: > On Thu, 14 Jan 2021 21:26:28 +0100 Andrea Parri (Microsoft) wrote: > > For additional robustness in the face of Hyper-V errors or malicious > > behavior, validate all values that originate from packets that Hyper-V > > has sent to the guest. Ensure that invalid values cannot cause indexing > > off the end of an array, or subvert an existing validation via integer > > overflow. Ensure that outgoing packets do not have any leftover guest > > memory that has not been zeroed out. > > > > Reported-by: Juan Vazquez > > Signed-off-by: Andrea Parri (Microsoft) > > Cc: "David S. Miller" > > Cc: Jakub Kicinski > > Cc: Alexei Starovoitov > > Cc: Daniel Borkmann > > Cc: Andrii Nakryiko > > Cc: Martin KaFai Lau > > Cc: Song Liu > > Cc: Yonghong Song > > Cc: John Fastabend > > Cc: KP Singh > > Cc: net...@vger.kernel.org > > Cc: b...@vger.kernel.org > > --- > > Applies to 5.11-rc3 (and hyperv-next). > > So this is for hyperv-next or should we take it via netdev trees? No preference, either way is good for me. Thanks, Andrea
[PATCH v2] hv_netvsc: Add (more) validation for untrusted Hyper-V values
For additional robustness in the face of Hyper-V errors or malicious behavior, validate all values that originate from packets that Hyper-V has sent to the guest. Ensure that invalid values cannot cause indexing off the end of an array, or subvert an existing validation via integer overflow. Ensure that outgoing packets do not have any leftover guest memory that has not been zeroed out. Reported-by: Juan Vazquez Signed-off-by: Andrea Parri (Microsoft) Cc: "David S. Miller" Cc: Jakub Kicinski Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Andrii Nakryiko Cc: Martin KaFai Lau Cc: Song Liu Cc: Yonghong Song Cc: John Fastabend Cc: KP Singh Cc: net...@vger.kernel.org Cc: b...@vger.kernel.org --- Applies to 5.11-rc3 (and hyperv-next). Changes since v1 (Juan Vazquez): - Improve validation in rndis_set_link_state() and rndis_get_ppi() - Remove memory/skb leak in netvsc_alloc_recv_skb() drivers/net/hyperv/netvsc.c | 3 +- drivers/net/hyperv/netvsc_bpf.c | 6 ++ drivers/net/hyperv/netvsc_drv.c | 18 +++- drivers/net/hyperv/rndis_filter.c | 171 +++--- 4 files changed, 136 insertions(+), 62 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 1510a236aa341..d9324961e0d64 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -887,6 +887,7 @@ static inline int netvsc_send_pkt( int ret; u32 ring_avail = hv_get_avail_to_write_percent(_channel->outbound); + memset(, 0, sizeof(struct nvsp_message)); nvmsg.hdr.msg_type = NVSP_MSG1_TYPE_SEND_RNDIS_PKT; if (skb) rpkt->channel_type = 0; /* 0 is RMC_DATA */ @@ -1306,7 +1307,7 @@ static void netvsc_send_table(struct net_device *ndev, sizeof(union nvsp_6_message_uber); /* Boundary check for all versions */ - if (offset > msglen - count * sizeof(u32)) { + if (msglen < count * sizeof(u32) || offset > msglen - count * sizeof(u32)) { netdev_err(ndev, "Received send-table offset too big:%u\n", offset); return; diff --git a/drivers/net/hyperv/netvsc_bpf.c b/drivers/net/hyperv/netvsc_bpf.c index 440486d9c999e..11f0588a88843 100644 --- a/drivers/net/hyperv/netvsc_bpf.c +++ b/drivers/net/hyperv/netvsc_bpf.c @@ -37,6 +37,12 @@ u32 netvsc_run_xdp(struct net_device *ndev, struct netvsc_channel *nvchan, if (!prog) goto out; + /* Ensure that the below memcpy() won't overflow the page buffer. */ + if (len > ndev->mtu + ETH_HLEN) { + act = XDP_DROP; + goto out; + } + /* allocate page buffer for data */ page = alloc_page(GFP_ATOMIC); if (!page) { diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index f32f28311d573..e5501c1a0cbd4 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -760,6 +760,16 @@ void netvsc_linkstatus_callback(struct net_device *net, if (indicate->status == RNDIS_STATUS_LINK_SPEED_CHANGE) { u32 speed; + /* Validate status_buf_offset */ + if (indicate->status_buflen < sizeof(speed) || + indicate->status_buf_offset < sizeof(*indicate) || + resp->msg_len - RNDIS_HEADER_SIZE < indicate->status_buf_offset || + resp->msg_len - RNDIS_HEADER_SIZE - indicate->status_buf_offset + < indicate->status_buflen) { + netdev_err(net, "invalid rndis_indicate_status packet\n"); + return; + } + speed = *(u32 *)((void *)indicate + indicate->status_buf_offset) / 1; ndev_ctx->speed = speed; @@ -865,8 +875,14 @@ static struct sk_buff *netvsc_alloc_recv_skb(struct net_device *net, */ if (csum_info && csum_info->receive.ip_checksum_value_invalid && csum_info->receive.ip_checksum_succeeded && - skb->protocol == htons(ETH_P_IP)) + skb->protocol == htons(ETH_P_IP)) { + /* Check that there is enough space to hold the IP header. */ + if (skb_headlen(skb) < sizeof(struct iphdr)) { + kfree_skb(skb); + return NULL; + } netvsc_comp_ipcsum(skb); + } /* Do L4 checksum offload if enabled and present. */ if (csum_info && (net->features & NETIF_F_RXCSUM)) { diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c index 7e6dee2f02a43..68091a9a5070d 100644 --- a/drivers/net/hyperv/rndis_filter.c +++ b/drivers/net/hyperv/rndis_filter.c @@ -131,6
[PATCH] x86/entry: build thunk_$(BITS) only if CONFIG_PREEMPTION=y
With CONFIG_PREEMPTION disabled, arch/x86/entry/thunk_64.o is just an empty object file. With the newer binutils (tested with 2.35.90.20210113-1ubuntu1) the GNU assembler doesn't generate a symbol table for empty object files and objtool fails with the following error when a valid symbol table cannot be found: arch/x86/entry/thunk_64.o: warning: objtool: missing symbol table To prevent this from happening, build thunk_$(BITS).o only if CONFIG_PREEMPTION is enabled. BugLink: https://bugs.launchpad.net/bugs/1911359 Fixes: 320100a5ffe5 ("x86/entry: Remove the TRACE_IRQS cruft") Signed-off-by: Andrea Righi --- arch/x86/entry/Makefile | 3 ++- arch/x86/entry/thunk_32.S | 2 -- arch/x86/entry/thunk_64.S | 4 3 files changed, 2 insertions(+), 7 deletions(-) diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile index 08bf95dbc911..83c98dae74a6 100644 --- a/arch/x86/entry/Makefile +++ b/arch/x86/entry/Makefile @@ -21,12 +21,13 @@ CFLAGS_syscall_64.o += $(call cc-option,-Wno-override-init,) CFLAGS_syscall_32.o+= $(call cc-option,-Wno-override-init,) CFLAGS_syscall_x32.o += $(call cc-option,-Wno-override-init,) -obj-y := entry_$(BITS).o thunk_$(BITS).o syscall_$(BITS).o +obj-y := entry_$(BITS).o syscall_$(BITS).o obj-y += common.o obj-y += vdso/ obj-y += vsyscall/ +obj-$(CONFIG_PREEMPTION) += thunk_$(BITS).o obj-$(CONFIG_IA32_EMULATION) += entry_64_compat.o syscall_32.o obj-$(CONFIG_X86_X32_ABI) += syscall_x32.o diff --git a/arch/x86/entry/thunk_32.S b/arch/x86/entry/thunk_32.S index f1f96d4d8cd6..5997ec0b4b17 100644 --- a/arch/x86/entry/thunk_32.S +++ b/arch/x86/entry/thunk_32.S @@ -29,10 +29,8 @@ SYM_CODE_START_NOALIGN(\name) SYM_CODE_END(\name) .endm -#ifdef CONFIG_PREEMPTION THUNK preempt_schedule_thunk, preempt_schedule THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace EXPORT_SYMBOL(preempt_schedule_thunk) EXPORT_SYMBOL(preempt_schedule_notrace_thunk) -#endif diff --git a/arch/x86/entry/thunk_64.S b/arch/x86/entry/thunk_64.S index ccd32877a3c4..c7cf79be7231 100644 --- a/arch/x86/entry/thunk_64.S +++ b/arch/x86/entry/thunk_64.S @@ -36,14 +36,11 @@ SYM_FUNC_END(\name) _ASM_NOKPROBE(\name) .endm -#ifdef CONFIG_PREEMPTION THUNK preempt_schedule_thunk, preempt_schedule THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace EXPORT_SYMBOL(preempt_schedule_thunk) EXPORT_SYMBOL(preempt_schedule_notrace_thunk) -#endif -#ifdef CONFIG_PREEMPTION SYM_CODE_START_LOCAL_NOALIGN(.L_restore) popq %r11 popq %r10 @@ -58,4 +55,3 @@ SYM_CODE_START_LOCAL_NOALIGN(.L_restore) ret _ASM_NOKPROBE(.L_restore) SYM_CODE_END(.L_restore) -#endif -- 2.29.2
Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse
On Sat, Jan 09, 2021 at 05:37:09PM -0800, Linus Torvalds wrote: > On Sat, Jan 9, 2021 at 5:19 PM Linus Torvalds > wrote: > > > > And no, I didn't make the UFFDIO_WRITEPROTECT code take the mmap_sem > > for writing. For whoever wants to look at that, it's > > mwriteprotect_range() in mm/userfaultfd.c and the fix is literally to > > turn the read-lock (and unlock) into a write-lock (and unlock). > > Oh, and if it wasn't obvious, we'll have to debate what to do with > trying to mprotect() a pinned page. Do we just ignore the pinned page > (the way my clear_refs patch did)? Or do we turn it into -EBUSY? Or > what? Agreed, I assume mprotect would have the same effect. mprotect in parallel of a read or recvmgs may be undefined, so I didn't bring it up, but it was pretty clear. The moment the write bit is cleared (no matter why and from who) and the PG lock relased, if there's any GUP pin, GUP currently loses synchrony. In any case I intended to help exercising the new page_count logic with the testcase, possibly to make it behave better somehow, no matter how. I admit I'm also wondering myself the exact semantics of O_DIRECT on clear_refs or uffd-wp tracking, but the point is that losing reads and getting unexpected data in the page, still doesn't look a good behavior and it had to be at least checked. To me ultimately the useful use case that is become impossible with page_count isn't even clear_refs nor uffd-wp. The useful case that I can see zero fundamental flaws in it, is a RDMA or some other device computing in pure readonly DMA on the data while a program runs normally and produces it. It could be even a framebuffer that doesn't care about coherency. You may want to occasionally wrprotect the memory under readonly long term GUP pin for consistency even against bugs of the program itself. Why should wrprotecting make the device lose synchrony? And kind of performance we gain to the normal useful cases by breaking the special case? Is there a benchmark showing it? > So it's not *just* the locking that needs to be fixed. But just take a > look at that suggested clear_refs patch of mine - it sure isn't > complicated. If we can skip the wrprotection it's fairly easy, I fully agree, even then it still looks more difficult than using page_mapcount in do_wp_page in my view, so I also don't see the simplification. And overall the amount of kernel code had a net increase as result. Thanks, Andrea
Re: [PATCH 1/1] mm: restore full accuracy in COW page reuse
Hello, On Sat, Jan 09, 2021 at 07:44:35PM -0500, Andrea Arcangeli wrote: > allowing a child to corrupt memory in the parent. That's a problem > that could happen not-maliciously too. So the scenario described I updated the above partly quoted sentence since in the previous version it didn't have full accuracy: https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=fc5a76b1c14e5e6cdc64ece306fc03773662d98a "However since a single transient GUP pin on a tail page, would elevate the page_count for all other tail pages (unlike the mapcount which is subpage granular), the COW page reuse inaccuracy would then cross different vmas and the effect would happen at a distance in vma of different processes. A single GUP pin taken on a subpage mapped in a different process could trigger 511 false positive COWs copies in the local process, after a fork()." This a best effort to try to document all side effects, but it'd be great to hear from Kirill too on the above detail to have confirmation. Thanks and have a great weekend everyone, Andrea
Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse
Hello Linus, On Sat, Jan 09, 2021 at 05:19:51PM -0800, Linus Torvalds wrote: > +#define is_cow_mapping(flags) (((flags) & (VM_SHARED | VM_MAYWRITE)) == > VM_MAYWRITE) > + > +static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long > addr, pte_t pte) > +{ > + struct page *page; > + > + if (!is_cow_mapping(vma->vm_flags)) > + return false; > + if (likely(!atomic_read(>vm_mm->has_pinned))) > + return false; > + page = vm_normal_page(vma, addr, pte); > + if (!page) > + return false; > + if (page_mapcount(page) != 1) > + return false; > + return page_maybe_dma_pinned(page); > +} I just don't see the simplification coming from 09854ba94c6aad7886996bfbee2530b3d8a7f4f4. Instead of checking page_mapcount above as an optimization, to me it looks much simpler to check it in a single place, in do_wp_page, that in addition of optimizing away the superfluous copy, would optimize away the above complexity as well. And I won't comment if it's actually safe to skip random pages or not. All I know is for mprotect and uffd-wp, definitely the above approach wouldn't work. In addition I dislike has_pinned and FOLL_PIN. has_pinned450 include/linux/mm_types.h * for instance during page table copying for fork(). has_pinned 1021 kernel/fork.c atomic64_set(>pinned_vm, 0); has_pinned 1239 mm/gup.c atomic_set(>has_pinned, 1); has_pinned 2579 mm/gup.c atomic_set(>mm->has_pinned, 1); has_pinned 1099 mm/huge_memory.c atomic_read(_mm->has_pinned) && has_pinned 1213 mm/huge_memory.c atomic_read(_mm->has_pinned) && has_pinned819 mm/memory.c if (likely(!atomic_read(_mm->has_pinned))) Are we spending 32bit in mm_struct atomic_t just to call atomic_set(1) on it? Why isn't it a MMF_HAS_PINNED that already can be set atomically under mmap_read_lock too? There's bit left free there, we didn't run out yet to justify wasting another 31 bits. I hope I'm overlooking something. The existence of FOLL_LONGTERM is good and makes a difference at times for writeback if it's on a MAP_SHARED, or it makes difference during GUP to do a page_migrate before taking the pin, but for the whole rest of the VM it's irrelevant if it's long or short term, so I'm also concerned from what Jason mentioned about long term pins being treated differently within the VM. That to me looks fundamentally as flawed as introducing inaccuracies in do_wp_page, that even ignoring the performance implications caused by the inaccuracy, simply prevent to do some useful things. I obviously agree all common workloads with no GUP pins and no clear_refs and no uffd, are way more important to be optimal, but I haven't seen a single benchmark showing the benefit of not taking the PG_lock before a page copy or any other runtime benefit coming from page_count in do_wp_page. To the contrary now I see additional branches in fork and in various other paths. The only thing again where I see page_count provides a tangible benefit, is the vmsplice attack from child to parent, but that has not been fully fixed and if page_count is added to fix it in all COW faults, it'll introduce extra inefficiency to the the very common important workloads, not only to the special GUP/clear_refs/uffd-wp workloads as your patch above shows. Thanks, Andrea
[PATCH 0/1] mm: restore full accuracy in COW page reuse
Hello Andrew and everyone, Once we agree that COW page reuse requires full accuracy, the next step is to re-apply 17839856fd588f4ab6b789f482ed3ffd7c403e1f and to return going in that direction. Who is going to orthogonally secure vmsplice, Andy, Jason, Jens? Once vmsplice is secured from taking unconstrained unprivileged long term GUP pins, the attack from child to parent can still happen in theory, but statistically speaking once that huge window is closed, it won't be a practical concern, so it'll give us time to perfect the full solution by closing all windows the VM core. vmsplice has to be orthogonally fixed anyway, even if all windows were closed in VM core first. Unfortunately it's still not clear exactly what failed with 17839856fd588f4ab6b789f482ed3ffd7c403e1f but the whole point is that we need to discuss that together. Thanks, Andrea // SPDX-License-Identifier: GPL-3.0-or-later /* * reproducer for v5.11 O_DIRECT mm corruption with page_count * instead of mapcount in do_wp_page. * * Copyright (C) 2021 Red Hat, Inc. * * gcc -O2 -o page_count_do_wp_page page_count_do_wp_page.c -lpthread * page_count_do_wp_page ./whateverfile * * NOTE: CONFIG_SOFT_DIRTY=y is required in the kernel config. */ #define _GNU_SOURCE #include #include #include #include #include #include #include #include #define PAGE_SIZE (1UL<<12) #define HARDBLKSIZE 512 static void* writer(void *_mem) { char *mem = (char *)_mem; for(;;) { usleep(random() % 1000); mem[PAGE_SIZE-1] = 0; } } static void* background_soft_dirty(void *data) { long fd = (long) data; for (;;) if (write(fd, "4", 1) != 1) perror("write soft dirty"), exit(1); } int main(int argc, char *argv[]) { if (argc < 2) printf("%s ", argv[0]), exit(1); char path[PAGE_SIZE]; strcpy(path, "/proc/"); sprintf(path + strlen(path), "%d", getpid()); strcat(path, "/clear_refs"); long soft_dirty_fd = open(path, O_WRONLY); if (soft_dirty_fd < 0) perror("open clear_refs"), exit(1); char *mem; if (posix_memalign((void **), PAGE_SIZE, PAGE_SIZE*3)) perror("posix_memalign"), exit(1); /* THP is not using page_count so it would not corrupt memory */ if (madvise(mem, PAGE_SIZE, MADV_NOHUGEPAGE)) perror("madvise"), exit(1); bzero(mem, PAGE_SIZE * 3); memset(mem + PAGE_SIZE * 2, 0xff, HARDBLKSIZE); /* * This is not specific to O_DIRECT. Even if O_DIRECT was * forced to use PAGE_SIZE minimum granularity for reads, a * recvmsg would create the same issue since it also use * iov_iter_get_pages internally to create transient GUP pins * on anon memory. */ int fd = open(argv[1], O_DIRECT|O_CREAT|O_RDWR|O_TRUNC, 0600); if (fd < 0) perror("open"), exit(1); if (write(fd, mem, PAGE_SIZE) != PAGE_SIZE) perror("write"), exit(1); pthread_t soft_dirty; if (pthread_create(_dirty, NULL, background_soft_dirty, (void *)soft_dirty_fd)) perror("soft_dirty"), exit(1); pthread_t thread; if (pthread_create(, NULL, writer, mem)) perror("pthread_create"), exit(1); bool skip_memset = true; while (1) { if (pread(fd, mem, HARDBLKSIZE, 0) != HARDBLKSIZE) perror("read"), exit(1); if (memcmp(mem, mem+PAGE_SIZE, HARDBLKSIZE)) { if (memcmp(mem, mem+PAGE_SIZE*2, PAGE_SIZE)) { if (skip_memset) printf("unexpected memory " "corruption detected\n"); else printf("memory corruption detected, " "dumping page\n"); int end = PAGE_SIZE; if (!memcmp(mem+HARDBLKSIZE, mem+PAGE_SIZE, PAGE_SIZE-HARDBLKSIZE)) end = HARDBLKSIZE; for (int i = 0; i < end; i++) printf("%x", mem[i]); printf("\n"); } else printf("memory corruption detected\n"); } skip_memset = !skip_memset; if (!skip_memset) memset(me
[PATCH 1/1] mm: restore full accuracy in COW page reuse
This reverts commit 09854ba94c6aad7886996bfbee2530b3d8a7f4f4. This reverts commit be068f29034fb00530a053d18b8cf140c32b12b3. This reverts commit 1a0cf26323c80e2f1c58fc04f15686de61bfab0c. This example below uses O_DIRECT, on a process under clear_refs. There's no FOLL_LONGTERM involvement. All GUP pins are transient. fork is never called and even when clear_refs is cleared by an external program, fork() would not be involved. thread0 thread1 other process (or thread 3) = = === read syscall O_DIRECT read DMA to vaddr+0 len = 512 bytes GUP(FOLL_WRITE) DMA writing to RAM started clear_refs pte_wrprotect write vaddrA+512 page_count == 2 wp_copy_page read syscall returns read lost Notwithstanding the fact that failing O_DIRECT at sub-PAGE_SIZE granularity is an ABI break by itself, this is of course not just about O_DIRECT. recvmsg kernel TLS and plenty of other GUP FOLL_WRITE iov_iter_get_pages users would write to the memory with sub-PAGE_SIZE granularity, depending on the msghdr iov tiny buffers. Those recvmsg are already silently lost too as well as the above O_DIRECT already get lost. The fact COW must not happen too much is documented in commit 6d0a07edd17cfc12fdc1f36de8072fa17cc3666f: == This will provide fully accuracy to the mapcount calculation in the write protect faults, so page pinning will not get broken by false positive copy-on-writes. == And in the current comment above the THP mapcount: == * [..] we only use * page_trans_huge_mapcount() in the copy-on-write faults where we * need full accuracy to avoid breaking page pinning, [..] == Quote from the Link tagged: == In other words, the page_count check in do_wp_page, extends the old fork vs gup vs threads race, so that it also happens within a single thread without fork() involvement. === In addition FOLL_LONGTERM breaks for readonly DMA, despite FOLL_LONGTERM pages are now treated specially and copied in fork(). FOLL_LONGTERM is in no way special with regard to the VM core and it can't be. From the VM standpoint all GUP in are equal and breaking one, means breaking them all the same. It is not possible to resolve this by looking only at FOLL_LONGTERM, that only hides the problem, but it doesn't fully resolve it as shown above. In addition this avoids storms of extra false positive COWs if the THP are shared by different processes, which causes extra overhead, but the several performance considerations are only a secondary concern. After this commit, the copy in fork() for FOLL_LONGTERM also must be extended to all GUP pins including transient pins or we shouldn't even copy FOLL_LONGTERM pinned pages. Treating FOLL_LONGTERM any different from transient GUP pin is sign that something is fundamentally flawed. FOLL_LONGTERM can be treated different in writeback and during GUP runtime itself, but not in the VM-core when deciding when to COW or not to COW an anonymous page. This revert causes no theoretical security regression because THP is not yet using page_count in do_huge_pmd_wp_page. The alternative of this patch would be to replace the mapcount with the page_count in do_huge_pmd_wp_page too in order to really close the vmsplice attack from child to parent that way. However since a single transient GUP pin on a tail page, would elevate the page_count for all other tail pages (unlike the mapcount which is subpage granular), the above "fork-less" race would span over a HPAGE_PMD_SIZE range, it would even cross different vmas and the effect would happen at a distance in vma of different processes allowing a child to corrupt memory in the parent. That's a problem that could happen not-maliciously too. So the scenario described above, if extended to THP new refcounting, would be more concerning than the vmsplice issue, that can be tackled first in vmsplice itself, and only at a second stage in the COW code. Link: https://lkml.kernel.org/r/20210107200402.31095-1-aarca...@redhat.com Cc: sta...@kernel.org Fixes: 09854ba94c6a ("mm: do_wp_page() simplification") Signed-off-by: Andrea Arcangeli --- include/linux/ksm.h | 7 ++ mm/ksm.c| 25 +++ mm/memory.c | 59 - 3 files changed, 74 insertions(+), 17 deletions(-) diff --git a/include/linux/ksm.h b/include/linux/ksm.h index 161e8164abcf..e48b1e453ff5 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -53,6 +53,8 @@ struct page *ksm_might_need_to_copy(struct page *page, void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc); void ksm_migrate_page(struct page *newpage, struct page *oldpage);
Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy
and mmu notifier, is if the hardware is capable of handling it. There is no real difference other than that. > mmu notifier users who are using hmm_range_fault() do not ever take any > page references when doing their work, that seems like the right > approach, for a shadow mmu? They all can do like HMM or you can take the FOLL_GET as long as you remember put_page. Jerome also intended to optimize the KVM fault like that, but like said above, we were naked on that attempt. If there is the pin or not makes zero semantical difference, it's purely an optimization when there is no pin, and it's a bugcheck safety feature if there is the pin. By the time it can make a runtime difference if there is the pin or not, put_page has been called already. > > Nothing at all can go wrong, unless wp_copy_page suddenly makes the > > secondary MMU go out of sync the moment you wrprotect the page with > > clear_refs. > > To be honest, I've read most of this discussion, and the prior one, > between you and Linus carefully, but I still don't understand what > clear_refs is about or how KVM's use of mmu notifiers got broken. This > is probably because I'm only a little familiar with those areas :\ KVM use of mmu notifier is not related to this. clear_refs simply can wrprotect the page. Of any process. echo .. >/proc/self/clear_refs. Then you check in /proc/self/pagemap looking for soft dirty (or something like that). The point is that if you do echo ... >/proc/self/clear_refs on your pid, that has any FOLL_LONGTERM on its mm, it'll just cause your device driver to go out of sync with the mm. It'll see the old pages, before the spurious COWs. The CPU will use new pages (the spurious COWs). > Is it actually broken or just inefficient? If wp_copy_page is going > more often than it should the secondary mmu should still fully track > that? It's about the DMA going out of sync and losing view of the mm. In addition the TLB flush broke with the mmu_read_lock but that can be fixed somehow. The TLB flush, still only because of the spurious COWs, has now to cope with the fact that there can be spurious wp_page_copy right after wrprotecting a read-write page. Before that couldn't happen, fork() couldn't run since it takes mmap_write_lock, so if the pte was writable and transitioned to non-writable it'd mean it was a exclusive page and it would be guaranteed re-used, so the stale TLB would keep writing in place. The stale TLB is the exact same equivalent of your FOLL_LONGTERM, except it's the window the CPU has on the old page, the FOLL_LONGTERM is the window the PCI device has on the old page. The spurious COW is what makes TLB and PCI device go out of sync reading and writing to the old page, while the CPU moved on to a new page. The issue is really similar. > To be clear, here I am only talking about pin_user_pages. We now have > logic to tell if a page is under pin_user_pages(FOLL_LONGTERM) or not, > and that is what is driving the copy on fork logic. fork() wrprotects and like every other wrprotect, was just falling in the above scenario. > secondary-mmu drivers using mmu notifier should not trigger this logic > and should not restrict write protect. That's a great point. I didn't think the mmu notifier will invalidate the secondary MMU and ultimately issue a GUP after the wp_copy_page to keep it in sync. The funny thing that doesn't make sense is that wp_copy_page will only be invoked because the PIN was left by KVM on the page for that extra safety I was talking about earlier. Are we forced to drop all the page pins to be able to wrprotect the memory without being flooded by immediate COWs? So the ultimate breakpoint, is the FOLL_LONGTERM and no mmu notifier to go out of sync on a wrprotect, which can happen if the device is doing a readonly access long term. I quote you earlier: "A long term page pin must use pin_user_pages(), and either FOLL_LONGTERM|FOLL_WRITE (write mode) FOLL_LONGTERM|FOLL_FORCE|FOLL_WRITE (read mode)" You clearly contemplate the existance of a read mode, long term. That is also completely compatible with wrprotection. Why should we pick a model that forbids this to work? What do we get back from it? I only see unnecessary risk and inefficiencies coming back from it. > > Ultimately, what do we really gain from all this breakage? > > Well, the clean definition of pin_user_pages(FOLL_LONGTERM) is very > positive for DMA drivers working in that area. I was referring to page_count in do_wp_page, not pin_user_pages sorry for the confusion. Thanks, Andrea
Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending
On Fri, Jan 08, 2021 at 11:25:21AM -0800, Linus Torvalds wrote: > On Fri, Jan 8, 2021 at 9:53 AM Andrea Arcangeli wrote: > > > > Do you intend to eventually fix the zygote vmsplice case or not? > > Because in current upstream it's not fixed currently using the > > enterprise default config. > > Is this the hugepage case? Neither of your patches actually touched > that, so I've forgotten the details. The two patches only fixed the TLB flushing deferral in clear_refs and uffd-wp. So I didn't actually try to fix the hugepage case by adding the page_count checks there too. I could try to do that at least it'd be consistent but I still would try to find an alternate solution later. > > Irrelevant special case as in: long term GUP pin on the memory? > > Irrelevant special case in that > > (a) an extra COW shouldn't be a correctness issue unless somebody > does something horribly wrong (and obviously the code that hasn't > taken the mmap_lock for writing are then examples of that) > > and > > (b) it's not a performance issue either unless you can find a real > load that does it. > > Hmm? For b) I don't have an hard time to imagine `ps` hanging for seconds, if clear_refs is touched on a 4T mm, but b) is not the main concern. Having to rely on a) is the main concern and it's not about tlb flushes but the long term GUP pins. Thanks, Andrea
Re: [PATCH] x86/vm86/32: Remove VM86_SCREEN_BITMAP support
On Fri, Jan 08, 2021 at 10:55:05AM -0800, Andy Lutomirski wrote: > The implementation was rather buggy. It unconditionally marked PTEs > released the mmap lock before flushing the TLB, which could allow a racing > CoW operation to falsely believe that the underlying memory was not > writable. > > I can't find any users at all of this mechanism, so just remove it. Reviewed-by: Andrea Arcangeli
Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy
On Fri, Jan 08, 2021 at 10:31:24AM -0800, Andy Lutomirski wrote: > Can we just remove vmsplice() support? We could make it do a normal The single case I've seen vmsplice used so far, that was really cool is localhost live migration of qemu. However despite really cool, it wasn't merged in the end, and I don't recall exactly why. There are even more efficient (but slightly more complex) ways to do that than vmsplice: using MAP_SHARED gigapages or MAP_SHARED tmpfs with THP opted-in in the tmpfs mount, as guest physical memory instead of anon memory and finding a way not having it cleared by kexec, so you can also upgrade the host kernel and not just qemu... is a way more optimal way to PIN and move all pages through the pipe and still having to pay a superfluous copy on destination. My guess why it's not popular, and I may be completely wrong on this since I basically never used vmsplice (other than to proof of concept DoS my phone to verify the long term GUP pin exploit works), is that vmsplice is a more efficient, but not the most efficient option. Exactly like in the live migration in place, it's always more efficient to share a tmpfs THP backed region and have true zero copy, than going through a pipe that still does one copy at the receiving end. It may also be simpler and it's not dependent on F_SETPIPE_SIZE obscure tunings. So in the end it's still too slow for apps that requires maximum performance, and not worth the extra work for those that don't. I love vmsplice conceptually, just I'd rather prefer an luser cannot run it. > copy, thereby getting rid of a fair amount of nastiness and potential > attacks. Even ignoring issues relating to the length of time that the > vmsplice reference is alive, we also have whatever problems could be > caused by a malicious or misguided user vmsplice()ing some memory and > then modifying it. Sorry to ask but I'm curious, what also goes wrong if the user modifies memory under GUP pin from vmsplice? That's not obvious to see. Thanks, Andrea
Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy
On Fri, Jan 08, 2021 at 02:19:45PM -0400, Jason Gunthorpe wrote: > On Fri, Jan 08, 2021 at 12:00:36PM -0500, Andrea Arcangeli wrote: > > > The majority cannot be converted to notifiers because they are DMA > > > based. Every one of those is an ABI for something, and does not expect > > > extra privilege to function. It would be a major breaking change to > > > have pin_user_pages require some cap. > > > > ... what makes them safe is to be transient GUP pin and not long > > term. > > > > Please note the "long term" in the underlined line. > > Many of them are long term, though only 50 or so have been marked > specifically with FOLL_LONGTERM. I don't see how we can make such a > major ABI break. io_uring is one of those indeed and I already flagged it. This isn't a black and white issue, kernel memory is also pinned but it's not in movable pageblocks... How do you tell the VM in GUP to migrate memory to a non movable pageblock before pinning it? Because that's what it should do to create less breakage. For example iommu obviously need to be privileged, if your argument that it's enough to use the right API to take long term pins unconstrained, that's not the case. Pins are pins and prevent moving or freeing the memory, their effect is the same and again worse than mlock on many levels. (then I know on preempt-rt should behave like a pin, and that's fine, you disable all features for such purpose there) io_uring is fine in comparison to vmpslice but still not perfect, because it does the RLIMIT_MEMLOCK accounting but unfortunately, is tangibly unreliable since a pin can cost 2m or 1G (now 1G is basically privileged so it doesn't hurt to get the accounting wrong in such case, but it's still technically mixing counting apples as oranges). Maybe io_uring could keep not doing mmu notifier, I'd need more investigation to be sure, but what's the point of keeping it VM-breaking when it doesn't need to? Is io_uring required to setup the ring at high frequency? > Looking at it, vmsplice() is simply wrong. A long term page pin must > use pin_user_pages(), and either FOLL_LONGTERM|FOLL_WRITE (write mode) > FOLL_LONGTERM|FOLL_FORCE|FOLL_WRITE (read mode) > > ie it must COW and it must reject cases that are not longterm safe, > like DAX and CMA and so on. > > These are the well established rules, vmsplice does not get a pass Where are the established rules written down? pin_user_pages.rst doesn't even make a mention of FOLL_FORCE or FOLL_WRITE at all, mm.h same thing. In any case, the extra flags required in FOLL_LONGTERM should be implied by FOLL_LONGTERM itself, once it enters the gup code, because it's not cool having to FOLL_WRITE in all drivers for a GUP(write=0), let alone having to specify FOLL_FORCE for just a read. But this is going offtopic. > simply because it is using the CPU to memory copy as its "DMA". vmsplice can't find all put_pages that may release the pages when the pipe is read, or it'd be at least be able to do the unreliable RLIMIT_MEMLOCK accounting. I'm glad we agree vmsplice is unsafe. The PR for the seccomp filter is open so if you don't mind, I'll link your review as confirmation. > > speaking in practice. io_uring has similar concern but it can use mmu > > notifier, so it can totally fix it and be 100% safe from this. > > IIRC io_uring does use FOLL_LONGTERM and FOLL_WRITE.. Right it's one of those 50. FOLL_WRITE won't magically allow the memory to be swapped or migrated. To make another example a single unprivileged pin on the movable zone, can break memhotunplug unless you use the mmu notifier. Every other advanced feature falls apart. So again, if an unprivileged syscalls allows a very limited number of pages, maybe it checks also if it got a THP or a gigapage page from the pin, it sets its own limit, maybe again it's not a big concern. vmsplice currently with zero privilege allows this: 2 0 1074432 9589344 13548 132186040 4 172 2052 9997 5 2 93 0 0 -> vmsplice reproducer started here 1 0 1074432 8538184 13548 132582000 0 0 1973 8838 4 3 93 0 0 1 0 1074432 8538436 13548 132552400 0 0 1730 8168 4 2 94 0 0 1 0 1074432 8539096 13556 132188000 072 1811 8640 3 2 95 0 0 0 0 1074432 8539348 13564 132202800 036 1936 8684 4 2 95 0 0 -> vmsplice killed here 1 0 1074432 9586720 13564 132224800 0 0 1893 8514 4 2 94 0 0 That's ~1G that goes away for each task and I didn't even check if it's all THP pages getting in there, the rss is 3MB despite 1G is taken down in GUP pins with zero privilege: 1512 pts/25 S 0:00 0 0 133147 3044 0.0 ./vmsplice Again memcg is robust so it's not a concern for the host, the attack remains contained in the per-memcg OOM kil
Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending
On Fri, Jan 08, 2021 at 09:39:56AM -0800, Linus Torvalds wrote: > page_count() is simply the right and efficient thing to do. > > You talk about all these theoretical inefficiencies for cases like > zygote and page pinning, which have never ever been seen except as a > possible attack vector. Do you intend to eventually fix the zygote vmsplice case or not? Because in current upstream it's not fixed currently using the enterprise default config. > Stop talking about irrelevant things. Stop trying to "optimize" things > that never happen and don't matter. > > Instead, what matters is the *NORMAL* VM flow. > > Things like COW. > > Things like "oh, now that we check just the page count, we don't even > need the page lock for the common case any more". > > > For the long term, I can't see how using page_count in do_wp_page is a > > tenable proposition, > > I think you should re-calibrate your expectations, and accept that > page_count() is the right thing to do, and live with it. > > And instead of worrying about irrelevant special-case code, start Irrelevant special case as in: long term GUP pin on the memory? Or irrelevant special case as in: causing secondary MMU to hit silent data loss if a pte is ever wrprotected (arch code, vm86, whatever, all under mmap_write_lock of course). > worrying about the code that gets triggered tens of thousands of times > a second, on regular loads, without anybody doing anything odd or > special at all, just running plain and normal shell scripts or any > other normal traditional load. > > Those irrelevant special cases should be simple and work, not badly > optimized to the point where they are buggy. And they are MUCH LESS > IMPORTANT than the normal VM code, so if somebody does something odd, > and it's slow, then that is the problem for the _odd_ case, not for > the normal codepaths. > > This is why I refuse to add crazy new special cases to core code. Make > the rules simple and straightforward, and make the code VM work well. New special cases? which new cases? There's nothing new here besides the zygote that wasn't fully fixed with 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 and is actually the only new case I can imagine where page_count actually isn't a regression. All old cases that you seem to refer as irrelevant and are in production in v4.18, I don't see anything new here. Even for the pure COW case with zero GUP involvement an hugepage with cows happening in different processes, would forever hit wp_copy_page since count is always > 1 despite mapcount can be 1 for all subpages. A simple app doing fork/exec would forever copy all memory in the parent even after the exec is finished. Thanks, Andrea
Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy
On Fri, Jan 08, 2021 at 09:36:49AM -0400, Jason Gunthorpe wrote: > On Thu, Jan 07, 2021 at 04:45:33PM -0500, Andrea Arcangeli wrote: > > On Thu, Jan 07, 2021 at 04:25:25PM -0400, Jason Gunthorpe wrote: > > > On Thu, Jan 07, 2021 at 03:04:00PM -0500, Andrea Arcangeli wrote: > > > > > > > vmsplice syscall API is insecure allowing long term GUP PINs without ^ > > > > privilege. > > > > > > Lots of places are relying on pin_user_pages long term pins of memory, > > > and cannot be converted to notifiers. > > > > > > I don't think it is reasonable to just declare that insecure and > > > requires privileges, it is a huge ABI break. > > > > Where's that ABI? Are there specs or a code example in kernel besides > > vmsplice itself? > > If I understand you right, you are trying to say that the 193 > pin_user_pages() callers cannot exist as unpriv any more? 193, 1k 1m or their number in general, won't just make them safe... > The majority cannot be converted to notifiers because they are DMA > based. Every one of those is an ABI for something, and does not expect > extra privilege to function. It would be a major breaking change to > have pin_user_pages require some cap. ... what makes them safe is to be transient GUP pin and not long term. Please note the "long term" in the underlined line. O_DIRECT is perfectly ok to be unprivileged obviously. The VM can wait, eventually it goes away. Even a swapout is not an instant event and can be hold off by any number of other things besides a transient GUP pin. It can be hold off by PG_lock just to make an example. mlock however is long term, persistent, vmsplice takes persistent and can pin way too much memory for each mm, that doesn't feel safe. The more places doing stuff like that, the more likely one causes a safety issue, not the other way around it in fact. > > The whole zygote issue wouldn't even register if the child had the > > exact same credentials of the parent. Problem is the child dropped > > privileges and went with a luser id, that clearly cannot ptrace the > > parent, and so if long term unprivileged GUP pins are gone from the > > equation, what remains that the child can do is purely theoretical > > even before commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f. > > Sorry, I'm not sure I've found a good explanation how ptrace and GUP > are interacting to become a security problem. ptrace is not involved. What I meant by mentioning ptrace, is that if the child can ptrace the parent, then it doesn't matter if it can also do the below, so the security concern is zero in such case. With O_DIRECT or any transient pin you will never munmap while O_DIRECT is in flight, if you munmap it's undefined what happens in such case anyway. It is a theoretical security issue made practical by vmsplice API that allows to enlarge the window to years of time (not guaranteed milliseconds), to wait for the parent to trigger the wp_page_reuse. Remove vmsplice and the security issue in theory remains, but removed vmsplice it becomes irrelevant statistically speaking in practice. io_uring has similar concern but it can use mmu notifier, so it can totally fix it and be 100% safe from this. The scheduler disclosure date was 2020-08-25 so I can freely explain the case that motivated all these changes. case A) if !fork() { // in child mmap one page vmsplice takes gup pin long term on such page munmap one page // mapcount == 1 (parent mm) // page_count == 2 (gup in child, and parent mm) } else { parent writes to the page // mapcount == 1, wp_page_reuse } parent did a COW with mapcount == 1 so the parent will take over a page that is still GUP pinned in the child. That's the security issue because in this case the GUP pin is malicious. Now imagine this case B) mmap one page RDMA or any secondary MMU takes a long term GUP pin munmap one page // mapcount == 1 (parent mm) // page_count == 2 (gup in RDMA, and parent mm) How does the VM can tell between the two different cases? It can't. The current page_count in do_wp_page treats both cases the same and because page_count is 2 in both cases, it calls wp_page_copy in both cases breaking-COW in both cases. However, you know full well in the second case it is a feature and not a bug, that wp_page_reuse is called instead, and in fact it has to be called or it's a bug (and that's the bug page_count in do_wp_page introduces). So page_count in do_wp_page is breaking all valid users, to take care of the purely theoretical security issue that isn't a practical concern if only vmsplice is secured at least as good as mlock. page_count in do_wp_page is fundamentally flawed for all long term GUP pin done by secondary MMUs attach