[RFC PATCH v1 2/2] virt: memctl: add Yuanchu and Pasha as maintainers

2024-05-13 Thread Yuanchu Xie
The memctl driver lives under drivers/virt/memctl. We specify
maintainers for the driver.

Signed-off-by: Yuanchu Xie 
---
 MAINTAINERS | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 61117c3afa80..5793df2b2f5e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14057,6 +14057,13 @@ F: mm/memblock.c
 F: mm/mm_init.c
 F: tools/testing/memblock/
 
+MEMCTL GUEST DRIVER
+M: Yuanchu Xie 
+M: Pasha Tatashin 
+L: linux-kernel@vger.kernel.org
+S: Supported
+F: drivers/virt/memctl/
+
 MEMORY CONTROLLER DRIVERS
 M: Krzysztof Kozlowski 
 L: linux-kernel@vger.kernel.org
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog




[RFC PATCH v1 1/2] virt: memctl: control guest physical memory properties

2024-05-13 Thread Yuanchu Xie
Memctl provides a way for the guest to control its physical memory
properties, and enables optimizations and security features. For
example, the guest can provide information to the host where parts of a
hugepage may be unbacked, or sensitive data may not be swapped out, etc.

Memctl allows guests to manipulate its gPTE entries in the SLAT, and
also some other properties of the memory map the back's host memory.
This is achieved by using the KVM_CAP_SYNC_MMU capability. When this
capability is available, the changes in the backing of the memory region
on the host are automatically reflected into the guest. For example, an
mmap() or madvise() that affects the region will be made visible
immediately.

There are two components of the implementation: the guest Linux driver
and Virtual Machine Monitor (VMM) device. A guest-allocated shared
buffer is negotiated per-cpu through a few PCI MMIO registers, the VMM
device assigns a unique command for each per-cpu buffer. The guest
writes its memctl request in the per-cpu buffer, then writes the
corresponding command into the command register, calling into the VMM
device to perform the memctl request.

The synchronous per-cpu shared buffer approach avoids the kick and busy
waiting that the guest would have to do with virtio virtqueue transport.

We provide both kernel and userspace APIs
Kernel API
long memctl_vmm_call(__u64 func_code, __u64 addr, __u64 length, __u64 arg,
 struct memctl_buf *buf);

Kernel drivers can take advantage of the memctl calls to provide
paravirtualization of kernel stacks or page zeroing.

User API
>From the userland, the memctl guest driver is controlled via ioctl(2)
call. It requires CAP_SYS_ADMIN.

ioctl(fd, MEMCTL_IOCTL, union memctl_vmm *memctl_vmm);

Guest userland applications can tag VMAs and guest hugepages, or advise
the host on how to handle sensitive guest pages.

Supported function codes and their use cases:
MEMCTL_FREE/REMOVE/DONTNEED/PAGEOUT. For the guest. One can reduce the
struct page and page table lookup overhead by using hugepages backed by
smaller pages on the host. These memctl commands can allow for partial
freeing of private guest hugepages to save memory. They also allow
kernel memory, such as kernel stacks and task_structs to be
paravirtualized.

MEMCTL_UNMERGEABLE is useful for security, when the VM does not want to
share its backing pages.
The same with MADV_DONTDUMP, so sensitive pages are not included in a
dump.
MLOCK/UNLOCK can advise the host that sensitive information is not
swapped out on the host.

MEMCTL_MPROTECT_NONE/R/W/RW. For guest stacks backed by hugepages, stack
guard pages can be handled in the host and memory can be saved in the
hugepage.

MEMCTL_SET_VMA_ANON_NAME is useful for observability and debugging how
guest memory is being mapped on the host.

Sample program making use of MEMCTL_SET_VMA_ANON_NAME and
MEMCTL_DONTNEED:
https://github.com/Dummyc0m/memctl-set-anon-vma-name/tree/main
https://github.com/Dummyc0m/memctl-set-anon-vma-name/tree/dontneed

The VMM implementation is being proposed for Cloud Hypervisor:
https://github.com/Dummyc0m/cloud-hypervisor/

Cloud Hypervisor issue:
https://github.com/cloud-hypervisor/cloud-hypervisor/issues/6318

Signed-off-by: Yuanchu Xie 
---
 .../userspace-api/ioctl/ioctl-number.rst  |   2 +
 drivers/virt/Kconfig  |   2 +
 drivers/virt/Makefile |   1 +
 drivers/virt/memctl/Kconfig   |  10 +
 drivers/virt/memctl/Makefile  |   2 +
 drivers/virt/memctl/memctl.c  | 425 ++
 include/linux/memctl.h|  27 ++
 include/uapi/linux/memctl.h   |  81 
 8 files changed, 550 insertions(+)
 create mode 100644 drivers/virt/memctl/Kconfig
 create mode 100644 drivers/virt/memctl/Makefile
 create mode 100644 drivers/virt/memctl/memctl.c
 create mode 100644 include/linux/memctl.h
 create mode 100644 include/uapi/linux/memctl.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst 
b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 457e16f06e04..789d1251c0be 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -368,6 +368,8 @@ Code  Seq#Include File  
 Comments
 0xCD  01 linux/reiserfs_fs.h
 0xCE  01-02  uapi/linux/cxl_mem.hCompute 
Express Link Memory Devices
 0xCF  02 fs/smb/client/cifs_ioctl.h
+0xDA  00 linux/memctl.h  Memctl 
Device
+ 

 0xDB  00-0F  drivers/char/mwave/mwavepub.h
 0xDD  00-3F  ZFCP 
device driver see drivers/s390/scsi/
  

di

Re: [PATCH] modules: Drop the .export_symbol section from the final modules

2024-05-13 Thread Masahiro Yamada
On Sun, May 12, 2024 at 7:42 AM Luis Chamberlain  wrote:
>
> On Wed, Apr 17, 2024 at 01:35:30PM +0800, wang...@lemote.com wrote:
> > From: Wang Yao 
> >
> > Commit ddb5cdbafaaa ("kbuild: generate KSYMTAB entries by modpost")
> > forget drop the .export_symbol section from the final modules.
> >
> > Signed-off-by: Wang Yao 
>
> Masahiro, commit ddb5cdbafaaa ("kbuild: generate KSYMTAB entries by
> modpost") was your change, wanna address / take it through your
> tree? It makes sense to me though.


Yes, applied now.

Thanks for the reminder.





>
>   Luis
>
> > ---
> >  scripts/module.lds.S | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/scripts/module.lds.S b/scripts/module.lds.S
> > index bf5bcf2836d8..89ff01a22634 100644
> > --- a/scripts/module.lds.S
> > +++ b/scripts/module.lds.S
> > @@ -13,6 +13,7 @@ SECTIONS {
> >   /DISCARD/ : {
> >   *(.discard)
> >   *(.discard.*)
> > + *(.export_symbol)
> >   }
> >
> >   __ksymtab   0 : { *(SORT(___ksymtab+*)) }
> > --
> > 2.27.0
> >



-- 
Best Regards
Masahiro Yamada



Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline

2024-05-13 Thread Thomas Gleixner
On Mon, May 13 2024 at 10:43, Dongli Zhang wrote:
> On 5/13/24 5:44 AM, Thomas Gleixner wrote:
>> On Fri, May 10 2024 at 12:06, Dongli Zhang wrote:
>> Any interrupt which is affine to an outgoing CPU is migrated and
>> eventually pending moves are enforced:
>> 
>> cpu_down()
>>   ...
>>   cpu_disable_common()
>> fixup_irqs()
>>   irq_migrate_all_off_this_cpu()
>> migrate_one_irq()
>>   irq_force_complete_move()
>> free_moved_vector();
>> 
>> No?
>
> I noticed this and finally abandoned the solution to fix at migrate_one_irq(),
> because:
>
> 1. The objective of migrate_one_irq()-->irq_force_complete_move() looks to
> cleanup before irq_do_set_affinity().
>
> 2. The irq_needs_fixup() may return false so that irq_force_complete_move() 
> does
> not get the chance to trigger.
>
> 3. Even irq_force_complete_move() is triggered, it exits early if
> apicd->prev_vector==0.

But that's not the case, really.

> The apicd->prev_vector can be cleared by __vector_schedule_cleanup() because
> cpu_disable_common() releases the vector_lock after CPU is flagged offline.

Nothing can schedule vector cleanup at that point because _all_ other
CPUs spin in stop_machine() with interrupts disabled and therefore
cannot handle interrupts which might invoke it.

So it does not matter whether the vector lock is dropped or not in
cpu_disable_common().

> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -1035,8 +1035,6 @@ static void __vector_schedule_cleanup(struct
> apic_chip_data *apicd)
> cl->timer.expires = jiffies + 1;
> add_timer_on(&cl->timer, cpu);
> }
> -   } else {
> -   apicd->prev_vector = 0; // or print a warning

This really wants to be a warning.

>> In fact irq_complete_move() should never see apicd->move_in_progress
>> with apicd->prev_cpu pointing to an offline CPU.
>
> I think it is possible. The fact that a CPU is offline doesn't indicate
> fixup_irqs() has already been triggered. The vector_lock is released after CPU
> is flagged offline.

No.

stop_machine()
_ALL_ CPUs rendevouz and spin with interrupts disabled

Outgoing CPU invokes cpu_disable_common()

So it does not matter at all whether vector lock is dropped before
fixup_irqs() is invoked. The new target CPU _cannot_ handle the
interrupt at that point and invoke irq_complete_move().

>> If you can trigger that case, then there is something fundamentally
>> wrong with the CPU hotplug interrupt migration code and that needs to be
>> investigated and fixed.
>> 
>
> I can easily reproduce the issue.

Good.

> I will fix in the interrupt migration code.

You need a proper explanation for the problem first otherwise you can't
fix it.

I understand the failure mode by now. What happens is:

1) Interrupt is affine to CPU11

2) Affinity is set to CPU10

3) Interrupt is raised and handled on CPU11

   irq_move_masked_irq()
 irq_do_set_affinity()
   apicd->prev_cpu = 11;
   apicd->move_in_progress = true;

4) CPU11 goes offline

   irq_needs_fixup() returns false because effective affinity points
   already to CPU 10, so irq_force_complete_move() is not invoked.

5) Interrupt is raised and handled on CPU10

   irq_complete_move()
__vector_schedule_cleanup()
   if (cpu_online(apicd->prev_cpu))<- observes offline

See? So this has nothing to do with vector lock being dropped.

> diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
> index 1ed2b1739363..5ecd072a34fe 100644
> --- a/kernel/irq/cpuhotplug.c
> +++ b/kernel/irq/cpuhotplug.c
> @@ -69,6 +69,14 @@ static bool migrate_one_irq(struct irq_desc *desc)
> return false;
> }
>
> +   /*
> +* Complete an eventually pending irq move cleanup. If this
> +* interrupt was moved in hard irq context, then the vectors need
> +* to be cleaned up. It can't wait until this interrupt actually
> +* happens and this CPU was involved.
> +*/
> +   irq_force_complete_move(desc);

You cannot do that here because it is only valid when the interrupt is
affine to the outgoing CPU.

In the problem case the interrupt was affine to the outgoing CPU, but
the core code does not know that it has not been cleaned up yet. It does
not even know that the interrupt was affine to the outgoing CPU before.

So in principle we could just go and do:

} else {
-   apicd->prev_vector = 0;
+   free_moved_vector(apicd);
}
raw_spin_unlock(&vector_lock);

but that would not give enough information and would depend on the
interrupt to actually arrive.

irq_force_complete_move() already describes this case, but obviously it
does not work because the core code does not invoke it in that
situation.

So yes, moving the invocation of irq_force_complete_move() before the
irq_needs_fixup() call makes sense, but it wants this to actually work
correctly:

Re: [PATCH v1 1/1] Input: gpio-keys - expose wakeup keys in sysfs

2024-05-13 Thread Dmitry Torokhov
Hi Guido,

On Thu, May 09, 2024 at 02:00:28PM +0200, Guido Günther wrote:
> This helps user space to figure out which keys should be used to unidle a
> device. E.g on phones the volume rocker should usually not unblank the
> screen.

How exactly this is supposed to be used? We have "disabled" keys and
switches attribute because this function can be controlled at runtime
from userspace while wakeup control is a static device setting.

Kernel also does not really know if the screen should be unblanked or
not, if a button or switch is configured for wake up the kernel will go
through wakeup process all the same and then userspace can decide if it
should stay woken up or not.

Thanks.

-- 
Dmitry



Re: [PATCHv5 bpf-next 7/8] selftests/x86: Add return uprobe shadow stack test

2024-05-13 Thread Jiri Olsa
On Mon, May 13, 2024 at 06:45:07PM +0900, Masami Hiramatsu wrote:
> On Tue,  7 May 2024 12:53:20 +0200
> Jiri Olsa  wrote:
> 
> > Adding return uprobe test for shadow stack and making sure it's
> > working properly. Borrowed some of the code from bpf selftests.
> 
> Hi Jiri,
> 
> I can not find "SKIP" result in this change. If CONFIG_UPROBES=n,
> this should skip uprobe test.

ah it should be detected by parse_uint_from_file returning ENOENT
or something like that.. will add that

thanks,
jirka

> 
> Thank you,
> 
> > 
> > Signed-off-by: Jiri Olsa 
> > ---
> >  .../testing/selftests/x86/test_shadow_stack.c | 142 ++
> >  1 file changed, 142 insertions(+)
> > 
> > diff --git a/tools/testing/selftests/x86/test_shadow_stack.c 
> > b/tools/testing/selftests/x86/test_shadow_stack.c
> > index 757e6527f67e..1b919baa999b 100644
> > --- a/tools/testing/selftests/x86/test_shadow_stack.c
> > +++ b/tools/testing/selftests/x86/test_shadow_stack.c
> > @@ -34,6 +34,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  /*
> >   * Define the ABI defines if needed, so people can run the tests
> > @@ -681,6 +682,141 @@ int test_32bit(void)
> > return !segv_triggered;
> >  }
> >  
> > +static int parse_uint_from_file(const char *file, const char *fmt)
> > +{
> > +   int err, ret;
> > +   FILE *f;
> > +
> > +   f = fopen(file, "re");
> > +   if (!f) {
> > +   err = -errno;
> > +   printf("failed to open '%s': %d\n", file, err);
> > +   return err;
> > +   }
> > +   err = fscanf(f, fmt, &ret);
> > +   if (err != 1) {
> > +   err = err == EOF ? -EIO : -errno;
> > +   printf("failed to parse '%s': %d\n", file, err);
> > +   fclose(f);
> > +   return err;
> > +   }
> > +   fclose(f);
> > +   return ret;
> > +}
> > +
> > +static int determine_uprobe_perf_type(void)
> > +{
> > +   const char *file = "/sys/bus/event_source/devices/uprobe/type";
> > +
> > +   return parse_uint_from_file(file, "%d\n");
> > +}
> > +
> > +static int determine_uprobe_retprobe_bit(void)
> > +{
> > +   const char *file = 
> > "/sys/bus/event_source/devices/uprobe/format/retprobe";
> > +
> > +   return parse_uint_from_file(file, "config:%d\n");
> > +}
> > +
> > +static ssize_t get_uprobe_offset(const void *addr)
> > +{
> > +   size_t start, end, base;
> > +   char buf[256];
> > +   bool found = false;
> > +   FILE *f;
> > +
> > +   f = fopen("/proc/self/maps", "r");
> > +   if (!f)
> > +   return -errno;
> > +
> > +   while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &base) 
> > == 4) {
> > +   if (buf[2] == 'x' && (uintptr_t)addr >= start && 
> > (uintptr_t)addr < end) {
> > +   found = true;
> > +   break;
> > +   }
> > +   }
> > +
> > +   fclose(f);
> > +
> > +   if (!found)
> > +   return -ESRCH;
> > +
> > +   return (uintptr_t)addr - start + base;
> > +}
> > +
> > +static __attribute__((noinline)) void uretprobe_trigger(void)
> > +{
> > +   asm volatile ("");
> > +}
> > +
> > +/*
> > + * This test setups return uprobe, which is sensitive to shadow stack
> > + * (crashes without extra fix). After executing the uretprobe we fail
> > + * the test if we receive SIGSEGV, no crash means we're good.
> > + *
> > + * Helper functions above borrowed from bpf selftests.
> > + */
> > +static int test_uretprobe(void)
> > +{
> > +   const size_t attr_sz = sizeof(struct perf_event_attr);
> > +   const char *file = "/proc/self/exe";
> > +   int bit, fd = 0, type, err = 1;
> > +   struct perf_event_attr attr;
> > +   struct sigaction sa = {};
> > +   ssize_t offset;
> > +
> > +   type = determine_uprobe_perf_type();
> > +   if (type < 0)
> > +   return 1;
> > +
> > +   offset = get_uprobe_offset(uretprobe_trigger);
> > +   if (offset < 0)
> > +   return 1;
> > +
> > +   bit = determine_uprobe_retprobe_bit();
> > +   if (bit < 0)
> > +   return 1;
> > +
> > +   sa.sa_sigaction = segv_gp_handler;
> > +   sa.sa_flags = SA_SIGINFO;
> > +   if (sigaction(SIGSEGV, &sa, NULL))
> > +   return 1;
> > +
> > +   /* Setup return uprobe through perf event interface. */
> > +   memset(&attr, 0, attr_sz);
> > +   attr.size = attr_sz;
> > +   attr.type = type;
> > +   attr.config = 1 << bit;
> > +   attr.config1 = (__u64) (unsigned long) file;
> > +   attr.config2 = offset;
> > +
> > +   fd = syscall(__NR_perf_event_open, &attr, 0 /* pid */, -1 /* cpu */,
> > +-1 /* group_fd */, PERF_FLAG_FD_CLOEXEC);
> > +   if (fd < 0)
> > +   goto out;
> > +
> > +   if (sigsetjmp(jmp_buffer, 1))
> > +   goto out;
> > +
> > +   ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK);
> > +
> > +   /*
> > +* This either segfaults and goes through sigsetjmp above
> > +* or succeeds and we're good.
> > +*/
> > +   uretprobe_trigger();
> > +
> > +   printf("[OK]\tUretprobe test\n");
> > +   err = 0;
> > +
> > +out:
> > +   ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_S

Re: [PATCHv5 bpf-next 6/8] x86/shstk: Add return uprobe support

2024-05-13 Thread Jiri Olsa
On Mon, May 13, 2024 at 05:12:31PM +, Edgecombe, Rick P wrote:
> On Mon, 2024-05-13 at 18:50 +0900, Masami Hiramatsu wrote:
> > > I guess it's doable, we'd need to keep both trampolines around, because
> > > shadow stack is enabled by app dynamically and use one based on the
> > > state of shadow stack when uretprobe is installed
> > > 
> > > so you're worried the optimized syscall path could be somehow exploited
> > > to add data on shadow stack?
> 
> Shadow stack allows for modification to the shadow stack only through a few
> limited ways (call, ret, etc). The kernel has the ability to write through
> shadow stack protections (for example when pushing and popping signal frames),
> but the ways in which it does this are limited in order to try to prevent
> providing extra capabilities to attackers wanting to craft their own shadow
> stacks.
> 
> But the HW features have optional abilities to allow extra patterns of shadow
> stack modification for userspace as well. This can facilitate unusual patterns
> of stack modification (like in this series). For, x86 there is the ability to
> allow an instruction (called WRSS) such that userspace can also write 
> arbitrary
> data to the shadow stack. Arm has something likes that, plus an instruction to
> push to the shadow stack.
> 
> There was some debate about whether to use these features, as glibc could not
> perfectly match compatibility for features that play with the stack like
> longjmp(). As in, without using those extra HW capabilities, some apps would
> require modifications to work with shadow stack.
> 
> There has been a lot of design tension between security, performance and
> compatibility in figuring out how to fit this feature into existing software. 
> In
> the end the consensus was to not use these extra HW capabilities, and lean
> towards security in the implementation. To try to summarize the debate, this 
> was
> because we could get pretty close to compatibility without enabling these 
> extra
> features.
> 
> So since this solution does something like enabling these extra capabilities 
> in
> software that were purposely disabled in HW, it raises eyebrows. Glibc has 
> some
> operations that now have extra steps because of shadow stack. So if we could 
> do
> something that was still functional, but slower and more secure, then it seems
> roughly in line with the tradeoffs we have gone with so far.

so at the moment the patch 6 changes shadow stack for

1) current uretprobe which are not working at the moment and we change
   the top value of shadow stack with shstk_push_frame
2) optimized uretprobe which needs to push new frame on shadow stack
   with shstk_update_last_frame

I think we should do 1) and have current uretprobe working with shadow
stack, which is broken at the moment

I'm ok with not using optimized uretprobe when shadow stack is detected
as enabled and we go with current uretprobe in that case

would this work for you?

thanks,
jirka

> 
> But shadow stack is not in widespread use yet, so whether we have the final
> tradeoffs settled is still open I think. For example, other libcs have 
> expressed
> interest in using WRSS.
> 
> I'm also not clear on the typical use of uretprobes (debugging vs production).
> And whether shadow stack + debugging + production will happen seems pretty
> unknown.
> 
> > 
> > Good point. For the security concerning (e.g. leaking sensitive information
> > from secure process which uses shadow stack), we need another limitation
> > which prohibits probing such process even for debugging. But I think that
> > needs another series of patches. We also need to discuss when it should be
> > prohibited and how (e.g. audit interface? SELinux?).
> > But I think this series is just optimizing currently available uprobes with
> > a new syscall. I don't think it changes such security concerning.
> 
> Patch 6 adds support for shadow stack for uretprobes. Currently there is no
> support.
> 
> Peterz had asked that the new solution consider shadow stack support, so I 
> think
> that is how this series grew kind of two goals: new faster uretprobes and
> initial shadow stack support.
> 
> 



Re: [PATCH RT 0/1] Linux v4.19.312-rt134-rc3

2024-05-13 Thread Daniel Wagner
On Mon, May 13, 2024 at 08:59:35AM GMT, Sebastian Andrzej Siewior wrote:
> On 2024-05-07 17:16:47 [+0200], Daniel Wagner wrote:
> > Dear RT Folks,
> > 
> > This is the RT stable review cycle of patch 4.19.312-rt134-rc3.
> > 
> > Please scream at me if I messed something up. Please test the patches
> > too.
> >
> > The -rc release is also available on kernel.org
> 
> I do have to complain a bit. The whole diff contains only a diff against
> localversion while the content of interest not here. But but I guess it
> is what we talked about so... but for everyone else it looks like
> version increment.

Sure, the merge diff should also be included. This feature missing in
the tooling. I'll add it, so next time it should also contain also the
merge diff.



Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline

2024-05-13 Thread Dongli Zhang
Hi Thomas,

On 5/13/24 5:44 AM, Thomas Gleixner wrote:
> On Fri, May 10 2024 at 12:06, Dongli Zhang wrote:
>> The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of
>> interrupt affinity reconfiguration via procfs. Instead, the change is
>> deferred until the next instance of the interrupt being triggered on the
>> original CPU.
>>
>> When the interrupt next triggers on the original CPU, the new affinity is
>> enforced within __irq_move_irq(). A vector is allocated from the new CPU,
>> but if the old vector on the original CPU remains online, it is not
>> immediately reclaimed. Instead, apicd->move_in_progress is flagged, and the
>> reclaiming process is delayed until the next trigger of the interrupt on
>> the new CPU.
>>
>> Upon the subsequent triggering of the interrupt on the new CPU,
>> irq_complete_move() adds a task to the old CPU's vector_cleanup list if it
>> remains online. Subsequently, the timer on the old CPU iterates over its
>> vector_cleanup list, reclaiming vectors.
>>
>> However, if the old CPU is offline before the interrupt triggers again on
>> the new CPU, irq_complete_move() simply resets both apicd->move_in_progress
>> and apicd->prev_vector to 0. Consequently, the vector remains unreclaimed
>> in vector_matrix, resulting in a CPU vector leak.
> 
> I doubt that.
> 
> Any interrupt which is affine to an outgoing CPU is migrated and
> eventually pending moves are enforced:
> 
> cpu_down()
>   ...
>   cpu_disable_common()
> fixup_irqs()
>   irq_migrate_all_off_this_cpu()
> migrate_one_irq()
>   irq_force_complete_move()
> free_moved_vector();
> 
> No?

I noticed this and finally abandoned the solution to fix at migrate_one_irq(),
because:

1. The objective of migrate_one_irq()-->irq_force_complete_move() looks to
cleanup before irq_do_set_affinity().

2. The irq_needs_fixup() may return false so that irq_force_complete_move() does
not get the chance to trigger.

3. Even irq_force_complete_move() is triggered, it exits early if
apicd->prev_vector==0.

The apicd->prev_vector can be cleared by __vector_schedule_cleanup() because
cpu_disable_common() releases the vector_lock after CPU is flagged offline.

void cpu_disable_common(void)
{
int cpu = smp_processor_id();

remove_siblinginfo(cpu);

/* It's now safe to remove this processor from the online map */
lock_vector_lock();
remove_cpu_from_maps(cpu); ---> CPU is flagged offline
unlock_vector_lock();  ---> release the vector_lock here!
fixup_irqs();
lapic_offline();
}


Therefore, the bugfix may become something like (just to demo the idea):

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 185738c72766..247a53fe9ada 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -1035,8 +1035,6 @@ static void __vector_schedule_cleanup(struct
apic_chip_data *apicd)
cl->timer.expires = jiffies + 1;
add_timer_on(&cl->timer, cpu);
}
-   } else {
-   apicd->prev_vector = 0; // or print a warning
}
raw_spin_unlock(&vector_lock);
 }
diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
index 1ed2b1739363..5ecd072a34fe 100644
--- a/kernel/irq/cpuhotplug.c
+++ b/kernel/irq/cpuhotplug.c
@@ -69,6 +69,14 @@ static bool migrate_one_irq(struct irq_desc *desc)
return false;
}

+   /*
+* Complete an eventually pending irq move cleanup. If this
+* interrupt was moved in hard irq context, then the vectors need
+* to be cleaned up. It can't wait until this interrupt actually
+* happens and this CPU was involved.
+*/
+   irq_force_complete_move(desc);
+
/*
 * No move required, if:
 * - Interrupt is per cpu
@@ -87,14 +95,6 @@ static bool migrate_one_irq(struct irq_desc *desc)
return false;
}

-   /*
-* Complete an eventually pending irq move cleanup. If this
-* interrupt was moved in hard irq context, then the vectors need
-* to be cleaned up. It can't wait until this interrupt actually
-* happens and this CPU was involved.
-*/
-   irq_force_complete_move(desc);
-
/*
 * If there is a setaffinity pending, then try to reuse the pending
 * mask, so the last change of the affinity does not get lost. If



That's why I modify only the __vector_schedule_cleanup() as it looked simple.

I will fix in the CPU hotplug interrupt migration code.

> 
> In fact irq_complete_move() should never see apicd->move_in_progress
> with apicd->prev_cpu pointing to an offline CPU.

I think it is possible. The fact that a CPU is offline doesn't indicate
fixup_irqs() has already been triggered. The vector_lock is released after CPU
is flagged offline.

> 
> The CPU offline case in __vector_schedule_cleanup() should not even
> exist or at least just emit a warni

Re: [PATCH 1/3] tracing: Remove dependency of saved_cmdlines_buffer on PID_MAX_DEFAULT

2024-05-13 Thread Michal Koutný
On Tue, Apr 09, 2024 at 11:01:26AM GMT, Steven Rostedt  
wrote:
> > -   tpid = pid & (PID_MAX_DEFAULT - 1);
> > +   tpid = pid % PID_MAP_SIZE;
> 
> Does that compile to the same? This is a fast path.

I didn't check.
If fast is the intetion, I would change it to something
like BUILD_BUG_ON(!(PID_MAP_SIZE % 2)) and keep the bit operation
without reliance on compiler optimizations.

Thanks for the response (I may not follow up on this single commit
though).

Michal


signature.asc
Description: PGP signature


Re: [PATCH 2/3] kernel/pid: Remove default pid_max value

2024-05-13 Thread Michal Koutný
On Mon, Apr 08, 2024 at 04:58:18PM GMT, Michal Koutný  wrote:
> The kernel provides mechanisms, while it should not imply policies --
> default pid_max seems to be an example of the policy that does not fit
> all. At the same time pid_max must have some value assigned, so use the
> end of the allowed range -- pid_max_max.
> 
> This change thus increases initial pid_max from 32k to 4M (x86_64
> defconfig).

Out of curiosity I dug out the commit
acdc721fe26d ("[PATCH] pid-max-2.5.33-A0") v2.5.34~5
that introduced the 32k default. The commit message doesn't say why such
a sudden change though.
Previously, the limit was 1G of pids (i.e. effectively no default limit
like the intention of this series).

Honestly, I expected more enthusiasm or reasons against removing the
default value of pid_max. Is this really not of interest to anyone?

(Thanks, Andrew, for your responses. I don't plan to pursue this further
should there be no more interest in having less default limit values in
kernel.)

Regards,
Michal


signature.asc
Description: PGP signature


Re: [PATCHv5 bpf-next 6/8] x86/shstk: Add return uprobe support

2024-05-13 Thread Edgecombe, Rick P
On Mon, 2024-05-13 at 18:50 +0900, Masami Hiramatsu wrote:
> > I guess it's doable, we'd need to keep both trampolines around, because
> > shadow stack is enabled by app dynamically and use one based on the
> > state of shadow stack when uretprobe is installed
> > 
> > so you're worried the optimized syscall path could be somehow exploited
> > to add data on shadow stack?

Shadow stack allows for modification to the shadow stack only through a few
limited ways (call, ret, etc). The kernel has the ability to write through
shadow stack protections (for example when pushing and popping signal frames),
but the ways in which it does this are limited in order to try to prevent
providing extra capabilities to attackers wanting to craft their own shadow
stacks.

But the HW features have optional abilities to allow extra patterns of shadow
stack modification for userspace as well. This can facilitate unusual patterns
of stack modification (like in this series). For, x86 there is the ability to
allow an instruction (called WRSS) such that userspace can also write arbitrary
data to the shadow stack. Arm has something likes that, plus an instruction to
push to the shadow stack.

There was some debate about whether to use these features, as glibc could not
perfectly match compatibility for features that play with the stack like
longjmp(). As in, without using those extra HW capabilities, some apps would
require modifications to work with shadow stack.

There has been a lot of design tension between security, performance and
compatibility in figuring out how to fit this feature into existing software. In
the end the consensus was to not use these extra HW capabilities, and lean
towards security in the implementation. To try to summarize the debate, this was
because we could get pretty close to compatibility without enabling these extra
features.

So since this solution does something like enabling these extra capabilities in
software that were purposely disabled in HW, it raises eyebrows. Glibc has some
operations that now have extra steps because of shadow stack. So if we could do
something that was still functional, but slower and more secure, then it seems
roughly in line with the tradeoffs we have gone with so far.

But shadow stack is not in widespread use yet, so whether we have the final
tradeoffs settled is still open I think. For example, other libcs have expressed
interest in using WRSS.

I'm also not clear on the typical use of uretprobes (debugging vs production).
And whether shadow stack + debugging + production will happen seems pretty
unknown.

> 
> Good point. For the security concerning (e.g. leaking sensitive information
> from secure process which uses shadow stack), we need another limitation
> which prohibits probing such process even for debugging. But I think that
> needs another series of patches. We also need to discuss when it should be
> prohibited and how (e.g. audit interface? SELinux?).
> But I think this series is just optimizing currently available uprobes with
> a new syscall. I don't think it changes such security concerning.

Patch 6 adds support for shadow stack for uretprobes. Currently there is no
support.

Peterz had asked that the new solution consider shadow stack support, so I think
that is how this series grew kind of two goals: new faster uretprobes and
initial shadow stack support.




Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline

2024-05-13 Thread Thomas Gleixner
On Fri, May 10 2024 at 12:06, Dongli Zhang wrote:
> The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of
> interrupt affinity reconfiguration via procfs. Instead, the change is
> deferred until the next instance of the interrupt being triggered on the
> original CPU.
>
> When the interrupt next triggers on the original CPU, the new affinity is
> enforced within __irq_move_irq(). A vector is allocated from the new CPU,
> but if the old vector on the original CPU remains online, it is not
> immediately reclaimed. Instead, apicd->move_in_progress is flagged, and the
> reclaiming process is delayed until the next trigger of the interrupt on
> the new CPU.
>
> Upon the subsequent triggering of the interrupt on the new CPU,
> irq_complete_move() adds a task to the old CPU's vector_cleanup list if it
> remains online. Subsequently, the timer on the old CPU iterates over its
> vector_cleanup list, reclaiming vectors.
>
> However, if the old CPU is offline before the interrupt triggers again on
> the new CPU, irq_complete_move() simply resets both apicd->move_in_progress
> and apicd->prev_vector to 0. Consequently, the vector remains unreclaimed
> in vector_matrix, resulting in a CPU vector leak.

I doubt that.

Any interrupt which is affine to an outgoing CPU is migrated and
eventually pending moves are enforced:

cpu_down()
  ...
  cpu_disable_common()
fixup_irqs()
  irq_migrate_all_off_this_cpu()
migrate_one_irq()
  irq_force_complete_move()
free_moved_vector();

No?

In fact irq_complete_move() should never see apicd->move_in_progress
with apicd->prev_cpu pointing to an offline CPU.

The CPU offline case in __vector_schedule_cleanup() should not even
exist or at least just emit a warning.

If you can trigger that case, then there is something fundamentally
wrong with the CPU hotplug interrupt migration code and that needs to be
investigated and fixed.

Thanks,

tglx




Re: [PATCH v1 1/1] Input: gpio-keys - expose wakeup keys in sysfs

2024-05-13 Thread Geert Uytterhoeven
Hi Guido,

On Thu, May 9, 2024 at 2:00 PM Guido Günther  wrote:
> This helps user space to figure out which keys should be used to unidle a
> device. E.g on phones the volume rocker should usually not unblank the
> screen.
>
> Signed-off-by: Guido Günther 

Thanks for your patch!
This is indeed a useful feature.

Reviewed-by: Geert Uytterhoeven 
Tested-by: Geert Uytterhoeven 

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds



Re: [RFC PATCH v4 0/5] DAMON based tiered memory management for CXL memory

2024-05-13 Thread Honggyu Kim
Hi SeongJae,

Thanks very much for your work!  It got delayed due to the priority
changes in my workplace for building another heterogeneous memory
allocator.
https://github.com/skhynix/hmsdk/wiki/hmalloc

On Sun, 12 May 2024 10:54:42 -0700 SeongJae Park  wrote:
> There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> posted at [1].
> 
> It says there is no implementation of the demote/promote DAMOS action
> are made.  This RFC is about its implementation for physical address
> space.
> 
> Changes from RFC v3
> (https://lore.kernel.org/20240405060858.2818-1-honggyu@sk.com):

This link cannot be opened.  I will share the link again here.
https://lore.kernel.org/all/20240405060858.2818-1-honggyu@sk.com

>   0. updated from v3 and posted by SJ on behalf of Hunggyu under his
>  approval.
>   1. Do not reuse damon_pa_pageout() and drop 'enum migration_mode'
>   2. Drop vmstat change

I haven't checked whether I can collect useful information without
vmstat, but the changes look good in general except for that.

>   3. Drop unnecessary page reference check

I will compare this patch series with my previous v3 patchset and get
back to you later maybe next week.  Sorry, I will have another break
this week.

Thanks,
Honggyu



Re: [PATCH v2 6/6] module: add install target for modules.builtin.ranges

2024-05-13 Thread Masahiro Yamada
On Mon, May 13, 2024 at 2:22 PM Masahiro Yamada  wrote:
>
> On Sun, May 12, 2024 at 7:59 AM Kris Van Hees  
> wrote:
> >
> > When CONFIG_BUILTIN_MODULE_RANGES is enabled, the modules.builtin.ranges
> > file should be installed in the module install location.
> >
> > Signed-off-by: Kris Van Hees 
> > Reviewed-by: Nick Alcock 
> > ---
> > Changes since v1:
> >  - Renamed CONFIG_BUILTIN_RANGES to CONFIG_BUILTIN_MODULE_RANGES
> > ---
> >  scripts/Makefile.modinst | 5 +
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/scripts/Makefile.modinst b/scripts/Makefile.modinst
> > index b45586aa1de49..e185dacae892a 100644
> > --- a/scripts/Makefile.modinst
> > +++ b/scripts/Makefile.modinst
> > @@ -36,6 +36,11 @@ install-y += $(addprefix $(MODLIB)/, modules.builtin 
> > modules.builtin.modinfo mod
> >  $(addprefix $(MODLIB)/, modules.builtin modules.builtin.modinfo 
> > modules.builtin.objs): $(MODLIB)/%: % FORCE
> > $(call cmd,install)
> >
> > +install-$(CONFIG_BUILTIN_MODULE_RANGES) += $(MODLIB)/modules.builtin.ranges
> > +
> > +$(MODLIB)/modules.builtin.ranges: modules.builtin.ranges FORCE
> > +   $(call cmd,install)
> > +
>
> Why add this to $(addprefix ...) part?


I meant, why don't you add this to $(addprefix ...)



Like this:


$(addprefix $(MODLIB)/, modules.builtin modules.builtin.modinfo
modules.builtin.ranges): $(MODLIB)/%: % FORCE
$(call cmd,install)






-- 
Best Regards
Masahiro Yamada



Re: [PATCHv5 bpf-next 6/8] x86/shstk: Add return uprobe support

2024-05-13 Thread Google
On Sat, 11 May 2024 15:09:48 -0600
Jiri Olsa  wrote:

> On Thu, May 09, 2024 at 04:24:37PM +, Edgecombe, Rick P wrote:
> > On Thu, 2024-05-09 at 10:30 +0200, Jiri Olsa wrote:
> > > > Per the earlier discussion, this cannot be reached unless uretprobes 
> > > > are in
> > > > use,
> > > > which cannot happen without something with privileges taking an action. 
> > > > But
> > > > are
> > > > uretprobes ever used for monitoring applications where security is
> > > > important? Or
> > > > is it strictly a debug-time thing?
> > > 
> > > sorry, I don't have that level of detail, but we do have customers
> > > that use uprobes in general or want to use it and complain about
> > > the speed
> > > 
> > > there are several tools in bcc [1] that use uretprobes in scripts,
> > > like:
> > >   memleak, sslsniff, trace, bashreadline, gethostlatency, argdist,
> > >   funclatency
> > 
> > Is it possible to have shadow stack only use the non-syscall solution? It 
> > seems
> > it exposes a more limited compatibility in that it only allows writing the
> > specific trampoline address. (IIRC) Then shadow stack users could still use
> > uretprobes, but just not the new optimized solution. There are already
> > operations that are slower with shadow stack, like longjmp(), so this could 
> > be
> > ok maybe.
> 
> I guess it's doable, we'd need to keep both trampolines around, because
> shadow stack is enabled by app dynamically and use one based on the
> state of shadow stack when uretprobe is installed
> 
> so you're worried the optimized syscall path could be somehow exploited
> to add data on shadow stack?

Good point. For the security concerning (e.g. leaking sensitive information
from secure process which uses shadow stack), we need another limitation
which prohibits probing such process even for debugging. But I think that
needs another series of patches. We also need to discuss when it should be
prohibited and how (e.g. audit interface? SELinux?).
But I think this series is just optimizing currently available uprobes with
a new syscall. I don't think it changes such security concerning.

Thank you,

> 
> jirka


-- 
Masami Hiramatsu (Google) 



Re: [PATCHv5 bpf-next 7/8] selftests/x86: Add return uprobe shadow stack test

2024-05-13 Thread Google
On Tue,  7 May 2024 12:53:20 +0200
Jiri Olsa  wrote:

> Adding return uprobe test for shadow stack and making sure it's
> working properly. Borrowed some of the code from bpf selftests.

Hi Jiri,

I can not find "SKIP" result in this change. If CONFIG_UPROBES=n,
this should skip uprobe test.

Thank you,

> 
> Signed-off-by: Jiri Olsa 
> ---
>  .../testing/selftests/x86/test_shadow_stack.c | 142 ++
>  1 file changed, 142 insertions(+)
> 
> diff --git a/tools/testing/selftests/x86/test_shadow_stack.c 
> b/tools/testing/selftests/x86/test_shadow_stack.c
> index 757e6527f67e..1b919baa999b 100644
> --- a/tools/testing/selftests/x86/test_shadow_stack.c
> +++ b/tools/testing/selftests/x86/test_shadow_stack.c
> @@ -34,6 +34,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /*
>   * Define the ABI defines if needed, so people can run the tests
> @@ -681,6 +682,141 @@ int test_32bit(void)
>   return !segv_triggered;
>  }
>  
> +static int parse_uint_from_file(const char *file, const char *fmt)
> +{
> + int err, ret;
> + FILE *f;
> +
> + f = fopen(file, "re");
> + if (!f) {
> + err = -errno;
> + printf("failed to open '%s': %d\n", file, err);
> + return err;
> + }
> + err = fscanf(f, fmt, &ret);
> + if (err != 1) {
> + err = err == EOF ? -EIO : -errno;
> + printf("failed to parse '%s': %d\n", file, err);
> + fclose(f);
> + return err;
> + }
> + fclose(f);
> + return ret;
> +}
> +
> +static int determine_uprobe_perf_type(void)
> +{
> + const char *file = "/sys/bus/event_source/devices/uprobe/type";
> +
> + return parse_uint_from_file(file, "%d\n");
> +}
> +
> +static int determine_uprobe_retprobe_bit(void)
> +{
> + const char *file = 
> "/sys/bus/event_source/devices/uprobe/format/retprobe";
> +
> + return parse_uint_from_file(file, "config:%d\n");
> +}
> +
> +static ssize_t get_uprobe_offset(const void *addr)
> +{
> + size_t start, end, base;
> + char buf[256];
> + bool found = false;
> + FILE *f;
> +
> + f = fopen("/proc/self/maps", "r");
> + if (!f)
> + return -errno;
> +
> + while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &base) 
> == 4) {
> + if (buf[2] == 'x' && (uintptr_t)addr >= start && 
> (uintptr_t)addr < end) {
> + found = true;
> + break;
> + }
> + }
> +
> + fclose(f);
> +
> + if (!found)
> + return -ESRCH;
> +
> + return (uintptr_t)addr - start + base;
> +}
> +
> +static __attribute__((noinline)) void uretprobe_trigger(void)
> +{
> + asm volatile ("");
> +}
> +
> +/*
> + * This test setups return uprobe, which is sensitive to shadow stack
> + * (crashes without extra fix). After executing the uretprobe we fail
> + * the test if we receive SIGSEGV, no crash means we're good.
> + *
> + * Helper functions above borrowed from bpf selftests.
> + */
> +static int test_uretprobe(void)
> +{
> + const size_t attr_sz = sizeof(struct perf_event_attr);
> + const char *file = "/proc/self/exe";
> + int bit, fd = 0, type, err = 1;
> + struct perf_event_attr attr;
> + struct sigaction sa = {};
> + ssize_t offset;
> +
> + type = determine_uprobe_perf_type();
> + if (type < 0)
> + return 1;
> +
> + offset = get_uprobe_offset(uretprobe_trigger);
> + if (offset < 0)
> + return 1;
> +
> + bit = determine_uprobe_retprobe_bit();
> + if (bit < 0)
> + return 1;
> +
> + sa.sa_sigaction = segv_gp_handler;
> + sa.sa_flags = SA_SIGINFO;
> + if (sigaction(SIGSEGV, &sa, NULL))
> + return 1;
> +
> + /* Setup return uprobe through perf event interface. */
> + memset(&attr, 0, attr_sz);
> + attr.size = attr_sz;
> + attr.type = type;
> + attr.config = 1 << bit;
> + attr.config1 = (__u64) (unsigned long) file;
> + attr.config2 = offset;
> +
> + fd = syscall(__NR_perf_event_open, &attr, 0 /* pid */, -1 /* cpu */,
> +  -1 /* group_fd */, PERF_FLAG_FD_CLOEXEC);
> + if (fd < 0)
> + goto out;
> +
> + if (sigsetjmp(jmp_buffer, 1))
> + goto out;
> +
> + ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK);
> +
> + /*
> +  * This either segfaults and goes through sigsetjmp above
> +  * or succeeds and we're good.
> +  */
> + uretprobe_trigger();
> +
> + printf("[OK]\tUretprobe test\n");
> + err = 0;
> +
> +out:
> + ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK);
> + signal(SIGSEGV, SIG_DFL);
> + if (fd)
> + close(fd);
> + return err;
> +}
> +
>  void segv_handler_ptrace(int signum, siginfo_t *si, void *uc)
>  {
>   /* The SSP adjustment caused a segfault. */
> @@ -867,6 +1003,12 @@ int main(int argc, char *argv[])
>   goto out;
>   }
>  
> + if (test_uretprobe())