> On Jul 22, 2018, at 11:27 AM, Linus Torvalds <torva...@linux-foundation.org> 
> wrote:
> 
>> On Sun, Jul 22, 2018 at 10:45 AM Andy Lutomirski <l...@kernel.org> wrote:
>> 
>> This patch changes the code to map the percpu TSS into the user page
>> tables to allow the non-trampoline SYSCALL64 path to work under PTI.
> 
> Me likey.
> 
> However:
> 
>> This does not add a new direct information leak, since the TSS is
>> readable by Meltdown from the cpu_entry_area alias regardless.
> 
> Afaik, it does now potentially expose through meltdown the per-thread
> entry stack info, which is new.

It’s always been exposed through the RO alias. The only new exposure is the 
*address* of the RW alias, I think.

> 
> But I don't think that's a show-stopper.
> 
>> static void __init pti_clone_user_shared(void)
>> {
>> +       for_each_possible_cpu(cpu) {
> 
> But this code is pretty disgusting and seems wrong.
> 
> Do you really want to do all trhe _possible_ cpu's, not just the
> online ones? I'd rather expose less (think MAXCPU) and then have the
> CPU hotplug code expose the page as the CPU comes up?

We already have exactly the same issue for cpu_entry_area. If we change it, I 
think we should do cpu_entry_area at the same time.  But that’s awkward because 
cpu_entry_area is mapped one PMD at a time right now.

It’s also awkward to expose a percpu page dynamically, because (I think) percpu 
data isn’t guaranteed to all be in the same PGD-sized area. A vmalloc fault in 
the early SYSCALL64 path is fatal.

> 
>> +               unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
>> +               phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
>> +               pte_t *target_pte;
>> +
>> +               target_pte = pti_user_pagetable_walk_pte(va);
> 
> This function only exists if CONFIG_X86_VSYSCALL_EMULATION, so it
> won't even compile under (very unusual) configurations.

Oops.

> 
> The "disgusting" part is that I think it could/should share more code
> with the vsyscall case, and the whole target-pte checking and setting
> should be shared too.

I tried that. It was uglier. The percpu code wants to make up a new PTE because 
the real kernel mapping uses large pages. The vsyscall code wants to copy a PTE 
because it’s really a PTE and it has unusual permissions.

> 
> Beause not being shared, I react to this:
> 
>> +               set_pte(target_pte, pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL));
> 
> Hmm. The vsyscall code just does
> 
>        *target_pte = ..
> 
> without any set_pte() stuff. Do we want/need the PVOP cases, and if
> so, why doesn't the vsyscall case need it?

It doesn’t need it. I could use plain assignment.

> 
> Anyway, I love the approach, and how this gets rid of the nasty
> trampoline, so no real complaints, just "this needs some fixups".
> 
> 

I’ll do the fixups. I think that, if we want to unmap the pages for CPUs that 
aren’t present, that should be a separate patch. I’m also not convinced it adds 
much value.

In general, PTI is fairly crappy, and it leaks all kinds of information. I 
suspect the worst leak is the NMI stack for local and remote CPUs. Fixing 
*that* is going to be fugly, but may actually be important, because I can 
easily imagine malicious user code that causes arbitrary kernel memory to get 
read and spilled on the NMI stack.

What we *should* do IMO is defer allocation of percpu space for not-present 
CPUs to save a bunch of memory.  But that’s a major change and will probably 
break things.

Reply via email to