On 20/12/2022 1:51 pm, Jan Beulich wrote: > On 16.12.2022 21:17, Andrew Cooper wrote: >> Partly for clarity because there is a lot of subtle magic at work here. >> Expand the commentary of what's going on. >> >> Also, because there is no need to double copy the stack (32kB) to retrieve >> local values spilled to the stack under the old alias, when all of the >> aforementioned local variables are going out of scope anyway. >> >> There is also a logic change when walking l2_xenmap[]. The old logic would >> skip recursing into 4k mappings; this would corrupt Xen if it were used. >> There are no 4k mappings in l2_xenmap[] this early on boot, so assert the >> property instead of leaving a latent breakage. >> >> Signed-off-by: Andrew Cooper <andrew.coop...@citrix.com> >> --- >> CC: Jan Beulich <jbeul...@suse.com> >> CC: Roger Pau Monné <roger....@citrix.com> >> CC: Wei Liu <w...@xen.org> >> >> We probably want to drop PGE from XEN_MINIMAL_CR4. It will simplify several >> boot paths which don't need to care about global pages, and PGE is >> conditional >> anyway now with the PCID support. > Perhaps, especially if - as you say - this allows for simplification. > >> --- a/xen/arch/x86/setup.c >> +++ b/xen/arch/x86/setup.c >> @@ -467,6 +467,113 @@ static void __init move_memory( >> } >> } >> >> +static void __init noinline move_xen(void) >> +{ >> + l4_pgentry_t *pl4e; >> + l3_pgentry_t *pl3e; >> + l2_pgentry_t *pl2e; >> + unsigned long tmp; >> + int i, j, k; > Can these become "unsigned int" please at this occasion?
Fixed. > (Perhaps as > another reason to make the change, mention that the old instances of i and > j did shadow outer scope variables in the same function?) Oh, so it does. I'm slightly surprised that we haven't seen a compiler objection from that. > >> + /* >> + * The caller has selected xen_phys_start, ensuring that the old and new >> + * locations do not overlap. Make it so. > As a non-native speaker I'm struggling with what "Make it so" is supposed > to tell me here. "Actually do it" I'll drop it. It's not overly important to the understanding here. > >> + * Prevent the compiler from reordering anything across this point. >> Such >> + * things will end badly. >> + */ >> + barrier(); >> + >> + /* >> + * Copy out of the current alias, into the directmap at the new >> location. >> + * This includes a snapshot of the current stack. >> + */ >> + memcpy(__va(__pa(_start)), _start, _end - _start); >> + >> + /* >> + * We are now in a critical region. Any write (modifying a global, >> + * spilling a local to the stack, etc) via the current alias will get >> lost >> + * when we switch to the new pagetables. This even means no printk()'s >> + * debugging purposes. > Nit: Missing "for"? Fixed. > >> + * Walk the soon-to-be-used pagetables in the new location, relocating >> all >> + * intermediate (non-leaf) entries to point to their new-location >> + * equivalents. All writes are via the directmap alias. >> + */ >> + pl4e = __va(__pa(idle_pg_table)); >> + for ( i = 0 ; i < L4_PAGETABLE_ENTRIES; i++, pl4e++ ) >> + { >> + if ( !(l4e_get_flags(*pl4e) & _PAGE_PRESENT) ) >> + continue; >> + >> + *pl4e = l4e_from_intpte(l4e_get_intpte(*pl4e) + xen_phys_start); >> + pl3e = __va(l4e_get_paddr(*pl4e)); >> + for ( j = 0; j < L3_PAGETABLE_ENTRIES; j++, pl3e++ ) >> + { >> + if ( !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) || >> + (l3e_get_flags(*pl3e) & _PAGE_PSE) ) >> + continue; >> + >> + *pl3e = l3e_from_intpte(l3e_get_intpte(*pl3e) + xen_phys_start); >> + pl2e = __va(l3e_get_paddr(*pl3e)); >> + for ( k = 0; k < L2_PAGETABLE_ENTRIES; k++, pl2e++ ) >> + { >> + if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) || >> + (l2e_get_flags(*pl2e) & _PAGE_PSE) ) >> + continue; >> + >> + *pl2e = l2e_from_intpte(l2e_get_intpte(*pl2e) + >> xen_phys_start); >> + } >> + } >> + } >> + >> + /* >> + * Walk the soon-to-be-used l2_xenmap[], relocating all the leaf >> mappings >> + * so text/data/bss etc refer to the new location in memory. >> + */ >> + pl2e = __va(__pa(l2_xenmap)); >> + for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++, pl2e++ ) >> + { >> + if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) ) >> + continue; >> + >> + /* >> + * We don't have 4k mappings in l2_xenmap[] at this point in boot, >> nor >> + * is this anticipated to change. Simply assert the fact, rather >> than >> + * introducing dead logic to decend into l1 tables. > Nit: "descend"? Fixed. > >> + */ >> + ASSERT(l2e_get_flags(*pl2e) & _PAGE_PSE); > The warning about the use of printk() around here could make one think > that using ASSERT() (or BUG()) is similarly bad. However, aiui using > printk() isn't by itself a problem, just that the log message would be > lost as far as the circular buffer goes. The message would be observable > on the serial con sole though, at least with "sync_console". It is on > this basis that using ASSERT() here makes sense. Perhaps the printk() > related comment could be slightly adjusted to express this? I did try to describe it, and it gets complicated, so I opted for the simple approach. Before the pagetable switch, it will operate "properly" and emit text onto the screen, serial, etc. After the pagetable switch, all Xen data will be lost, which includes the main console buffer, but also things like the local state for Xen's UART driver. I gave up trying to reason what would happen at that point. The ASSERT() is very much a best-effort attempt. Anyone liable to trip the assert is already working on the boot pagetable and ought to be aware. > >> + *pl2e = l2e_from_intpte(l2e_get_intpte(*pl2e) + xen_phys_start); >> + } >> + >> + /* >> + * Switch to relocated pagetables, shooting down global mappings as we >> go. >> + */ >> + asm volatile ( >> + "mov %%cr4, %[cr4]\n\t" >> + "andb $~%c[pge], %b[cr4]\n\t" >> + "mov %[cr4], %%cr4\n\t" /* CR4.PGE = 0 */ >> + "mov %[cr3], %%cr3\n\t" /* CR3 = new pagetables */ >> + "orb $%c[pge], %b[cr4]\n\t" > I understand you need %c to apply the ~ in the operand of ANDB above. > But could I talk you into keeping things as simple as possible here > by using plain %[pge]? The $ is still useful in the second case, for consistency if nothing else. Both of these instances disappear if PGE gets cleaned up on boot. > >> + "mov %[cr4], %%cr4\n\t" /* CR4.PGE = 1 */ >> + : [cr4] "=&a" (tmp) /* Could be "r", but "a" makes better asm */ >> + : [cr3] "r" (__pa(idle_pg_table)), >> + [pge] "i" (X86_CR4_PGE) >> + : "memory" ); > The removed stack copying worries me, to be honest. Yes, for local > variables of ours it doesn't matter because they are about to go out > of scope. But what if the compiler instantiates any for its own use, > e.g. ... > >> + /* >> + * End of the critical region. Updates to locals and globals now work >> as >> + * expected. >> + * >> + * Updates to local variables which have been spilled to the stack since >> + * the memcpy() have been lost. But we don't care, because they're all >> + * going out of scope imminently. >> + */ >> + >> + printk("New Xen image base address: %#lx\n", xen_phys_start); > ... the result of the address calculation for the string literal > here? Such auxiliary calculations can happen at any point in the > function, and won't necessarily (hence the example chosen) become > impossible for the compiler to do with the memory clobber in the > asm(). And while the string literal's address is likely cheap > enough to calculate right in the function invocation sequence (and > an option would also be to retain the printk() in the caller), > other instrumentation options could be undermined by this as well. Right now, the compiler is free to calculate the address of the string literal in a register, and move it the other side of the TLB flush. This will work just fine. However, the compiler cannot now, or ever in the future, spill such a calculation to the stack. Whether it's spelt "memory", or something else in the future, OSes really do genuinely need a way of telling the compiler "you literally cannot move anything the other side of this asm()", because otherwise malfunctions will occur. It's not the locals which worry me - we really do lose things like coverage data in here. Short of writing it fully in asm (which would be irritating to maintain), I don't see any other option. ~Andrew