On 11/27/23 14:08, Peter Zijlstra wrote:
> On Mon, Nov 27, 2023 at 10:48:29AM -0600, Madhavan T. Venkataraman wrote:
>> Apologies for the late reply. I was on vacation. Please see my response 
>> below:
>>
>> On 11/13/23 02:19, Peter Zijlstra wrote:
>>> On Sun, Nov 12, 2023 at 09:23:24PM -0500, Mickaël Salaün wrote:
>>>> From: Madhavan T. Venkataraman <madve...@linux.microsoft.com>
>>>>
>>>> X86 uses a function called __text_poke() to modify executable code. This
>>>> patching function is used by many features such as KProbes and FTrace.
>>>>
>>>> Update the permissions counters for the text page so that write
>>>> permissions can be temporarily established in the EPT to modify the
>>>> instructions in that page.
>>>>
>>>> Cc: Borislav Petkov <b...@alien8.de>
>>>> Cc: Dave Hansen <dave.han...@linux.intel.com>
>>>> Cc: H. Peter Anvin <h...@zytor.com>
>>>> Cc: Ingo Molnar <mi...@redhat.com>
>>>> Cc: Kees Cook <keesc...@chromium.org>
>>>> Cc: Madhavan T. Venkataraman <madve...@linux.microsoft.com>
>>>> Cc: Mickaël Salaün <m...@digikod.net>
>>>> Cc: Paolo Bonzini <pbonz...@redhat.com>
>>>> Cc: Sean Christopherson <sea...@google.com>
>>>> Cc: Thomas Gleixner <t...@linutronix.de>
>>>> Cc: Vitaly Kuznetsov <vkuzn...@redhat.com>
>>>> Cc: Wanpeng Li <wanpen...@tencent.com>
>>>> Signed-off-by: Madhavan T. Venkataraman <madve...@linux.microsoft.com>
>>>> ---
>>>>
>>>> Changes since v1:
>>>> * New patch
>>>> ---
>>>>  arch/x86/kernel/alternative.c |  5 ++++
>>>>  arch/x86/mm/heki.c            | 49 +++++++++++++++++++++++++++++++++++
>>>>  include/linux/heki.h          | 14 ++++++++++
>>>>  3 files changed, 68 insertions(+)
>>>>
>>>> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
>>>> index 517ee01503be..64fd8757ba5c 100644
>>>> --- a/arch/x86/kernel/alternative.c
>>>> +++ b/arch/x86/kernel/alternative.c
>>>> @@ -18,6 +18,7 @@
>>>>  #include <linux/mmu_context.h>
>>>>  #include <linux/bsearch.h>
>>>>  #include <linux/sync_core.h>
>>>> +#include <linux/heki.h>
>>>>  #include <asm/text-patching.h>
>>>>  #include <asm/alternative.h>
>>>>  #include <asm/sections.h>
>>>> @@ -1801,6 +1802,7 @@ static void *__text_poke(text_poke_f func, void 
>>>> *addr, const void *src, size_t l
>>>>     */
>>>>    pgprot = __pgprot(pgprot_val(PAGE_KERNEL) & ~_PAGE_GLOBAL);
>>>>  
>>>> +  heki_text_poke_start(pages, cross_page_boundary ? 2 : 1, pgprot);
>>>>    /*
>>>>     * The lock is not really needed, but this allows to avoid open-coding.
>>>>     */
>>>> @@ -1865,7 +1867,10 @@ static void *__text_poke(text_poke_f func, void 
>>>> *addr, const void *src, size_t l
>>>>    }
>>>>  
>>>>    local_irq_restore(flags);
>>>> +
>>>>    pte_unmap_unlock(ptep, ptl);
>>>> +  heki_text_poke_end(pages, cross_page_boundary ? 2 : 1, pgprot);
>>>> +
>>>>    return addr;
>>>>  }
>>>
>>> This makes no sense, we already use a custom CR3 with userspace alias
>>> for the actual pages to write to, why are you then frobbing permissions
>>> on that *again* ?
>>
>> Today, the permissions for a guest page in the extended page table
>> (EPT) are RWX (unless permissions are restricted for some specific
>> reason like for shadow page table pages). In this Heki feature, we
>> don't allow RWX by default in the EPT. We only allow those permissions
>> in the EPT that the guest page actually needs.  E.g., for a text page,
>> it is R_X in both the guest page table and the EPT.
> 
> To what end? If you always mirror what the guest does, you've not
> actually gained anything.
> 
>> For text patching, the above code establishes an alternate mapping in
>> the guest page table that is RW_ so that the text can be patched. That
>> needs to be reflected in the EPT so that the EPT permissions will
>> change from R_X to RWX. In other words, RWX is allowed only as
>> necessary. At the end of patching, the EPT permissions are restored to
>> R_X.
>>
>> Does that address your comment?
> 
> No, if you want to mirror the native PTEs why don't you hook into the
> paravirt page-table muck and get all that for free?
> 
> Also, this is the user range, are you saying you're also playing these
> daft games with user maps?

I think that we should have done a better job of communicating the threat model 
in Heki and
how we are trying to address it. I will correct that here. I think this will 
help answer
many questions. Some of these questions also came up in the LPC when we 
presented this.
Apologies for the slightly long answer. It is for everyone's benefit. Bear with 
me.

Threat Model
------------

In the threat model in Heki, the attacker is a user space attacker who exploits
a kernel vulnerability to gain more privileges or bypass the kernel's access
control and self-protection mechanisms. 

In the context of the guest page table, one of the things that the threat model 
translates
to is a hacker gaining access to a guest page with RWX permissions. E.g., by 
adding execute
permissions to a writable page or by adding write permissions to an executable 
page.

Today, the permissions for a guest page in the extended page table are RWX by
default. So, if a hacker manages to establish RWX for a page in the guest page
table, then that is all he needs to do some damage.

How to defeat the threat
------------------------

To defeat this, we need to establish the correct permissions for a guest page
in the extended page table as well. That is, R_X for a text page, R__ for a
read-only page and RW_ for a writable page. The only exception is a guest page
that is mapped via multiple mappings with different permissions in each
mapping. In that case, the collective permissions across all mappings needs
to be established in the extended page table so that all mappings can work.

Mechanism
---------

To achieve all this, Heki finds all the kernel mappings at the end of kernel
boot and reflects their permissions in the extended page table before
kicking off the init process.

During runtime, permissions on a guest page can change because of genuine
kernel operations:
        - vmap/vunmap
        - text patching for FTrace, Kprobes, etc
        - set_memory_*()

In each of these cases as well, the permissions need to be reflected in the
extended page table. In summary, the extended page table permissions mirror
the guest page table ones.

Authentication
--------------

The above approach addresses the case where a hacker tries to directly
modify a guest page table entry. It doesn't matter since the extended page
table permissions are not changed.

Now, the question is - what if a hacker manages to use the Heki primitives
and establish the permissions he wants in the extended page table? All of
this work is for nothing!!

The answer is - authentication. If an entity outside the guest can validate
or authenticate each guest request to change extended page table permissions,
then it can tell a genuine request from an attack. We are planning to make the
VMM that entity. In the case of a genuine request, the VMM will call the
hypervisor and establish the correct permissions in the extended page table.
If the VMM thinks that it is an attack, it will have the hypervisor send an
exception to the guest.

In the current version (RFC V2), we don't have authentication in place. The
VMM is not in the picture yet. This is WIP. We are hoping to implement some
authentication in V3 and improve it as we go forward. So, in V2, we only have
the mechanisms in place.

So, I agree that it is kinda hard to see the value of Heki without 
authentication.

Kernel Lockdown
---------------

But, we must provide at least some security in V2. Otherwise, it is useless.

So, we have implemented what we call a kernel lockdown. At the end of kernel
boot, Heki establishes permissions in the extended page table as mentioned
before. Also, it adds an immutable attribute for kernel text and kernel RO data.
Beyond that point, guest requests that attempt to modify permissions on any of
the immutable pages will be denied.

This means that features like FTrace and KProbes will not work on kernel text
in V2. This is a temporary limitation. Once authentication is in place, the
limitation will go away.

Additional information
----------------------

The following primitives call Heki functions to reflect guest page table
permissions in the extended page table:

heki_arch_late_init()
        This calls heki_protect().
        
        This is called at the end of kernel boot to protect all guest kernel
        mappings at that point.

vmap_range_noflush()
vmap_small_pages_range_noflush()
        These functions call heki_map().

        These are the lowest level functions called from different places
        to map something in the kernel address space.

set_memory_nx()
set_memory_rw()
        These functions call heki_change_page_attr_set().

set_memory_x()
set_memory_ro()
set_memory_rox()
        These functions call heki_change_page_attr_clear().

__text_poke()
        This function is called by various features to patch text.
        This calls heki_text_poke_start() and heki_text_poke_end().

        heki_text_poke_start() is called to add write permissions to the
        extended page table so that text can be patched. heki_text_poke_end()
        is called to revert write permissions in the extended page table.

User mappings
-------------

All of this work is only for protecting guest kernel mappings. So, the idea is
- the VMM+Hypervisor protect the integrity of the kernel and the kernel
protects the integrity of user land. So, user pages continue to have RWX in the
extended page table. The MBEC feature makes it possible to have separate execute
permission bits in the page table entry for user and kernel.

One final thing
---------------

Peter mentioned the following:

"if you want to mirror the native PTEs why don't you hook into the
paravirt page-table muck and get all that for free?"

We did consider using a shadow page table kind of approach so that guest page 
table
modifications can be intercepted and reflected in the page table entry. We did 
not
do this for two reasons:

- there are bits in the page table entry that are not permission bits. We would 
like
  the guest kernel to be able to modify them directly.

- we cannot tell a genuine request from an attack.

That said, we are still considering making the guest page table read only for 
added
security. This is WIP. I do not have details yet.

If I have not addressed any comment, please let me know.

As always, thanks for your comments.

Madhavan T Venkataraman





Reply via email to