"Rafael J. Wysocki" <[EMAIL PROTECTED]> writes: > On Sunday, 25 March 2007 14:56, Eric W. Biederman wrote: >> "Rafael J. Wysocki" <[EMAIL PROTECTED]> writes: >> >> > Yes, in kernel/power/disk.c:power_down() . >> > >> > Please comment out the disable_nonboot_cpus() in there and retest (but > please >> > test the latest Linus' tree). >> >> <rant> >> >> Why do we even need a disable_nonboot_cpus in that path? machine_shutdown >> on i386 and x86_64 should take care of that. Further the code that computes >> the boot cpu is bogus (not all architectures require cpu == 0 to be >> the boot cpu), and disabling non boot cpus appears to be a strong >> x86ism, in the first place. > > Yes. > >> If the only reason for disable_nonboot_cpus there is to avoid the >> WARN_ON in init_low_mappings() we should seriously consider killing >> it. > > We have considered it, but no one was sure that it was a good idea.
The problem with the current init_low_mappings is that it hacks the current page table. If we can instead use a different page table the code becomes SMP safe. I have extracted the patch that addresses this from the relocatable patchset and appended it for sparking ideas. It goes a little farther than we need to solve this issue but the basics are there. >> If we can wait for 2.6.22 the relocatable x86_64 patchset that >> Andi has queued, has changes that kill the init_low_mapping() hack. > > I think we should kill the WARN_ON() right now, perhaps replacing it with > a FIXME comment. Reasonable. >> I'm not very comfortable with calling cpu_down in a common code path >> right now either. I'm fairly certain we still don't have that >> correct. So if we confine the mess that is cpu_down to #if >> defined(CPU_HOTPLUG) && defined(CONFIG_EXPERIMENTAL) I don't care. >> If we start using it everywhere I'm very nervous. >> migration when bringing a cpu down is strongly racy, and I don't think >> we actually put cpus to sleep properly either. > > I'm interested in all of the details, please. I seriously consider dropping > cpu_up()/cpu_down() from the suspend code paths. So I'm not certain if in a multiple cpu context we can avoid all of the issues with cpu hotplug but there is a reasonable chance so I will explain as best I can. Yanking the appropriate code out of linuxbios the way a processor should stop itself is to send an INIT IPI to itself. This puts a cpu into an optimized wait for startup IPI state where it is otherwise disabled. This is the state any sane BIOS will put the cpus into before control is handed off to the kernel. > static inline void stop_this_cpu(void) > { > unsigned apicid; > apicid = lapicid(); > > /* Send an APIC INIT to myself */ > lapic_write(LAPIC_ICR2, SET_LAPIC_DEST_FIELD(apicid)); > lapic_write(LAPIC_ICR, LAPIC_INT_LEVELTRIG | LAPIC_INT_ASSERT | > LAPIC_DM_INIT); > /* Wait for the ipi send to finish */ > lapic_wait_icr_idle(); > > /* Deassert the APIC INIT */ > lapic_write(LAPIC_ICR2, SET_LAPIC_DEST_FIELD(apicid)); > lapic_write(LAPIC_ICR, LAPIC_INT_LEVELTRIG | LAPIC_DM_INIT); > /* Wait for the ipi send to finish */ > lapic_wait_icr_idle(); > > /* If I haven't halted spin forever */ > for(;;) { > hlt(); > } > } I'm not certain what to do with the interrupt races. But I will see if I can explain what I know. <braindump> - Most ioapics are buggy. - Most ioapics do not follow pci-ordering rules with respect to interrupt message deliver so ensuring all in-flight irqs have arrived somewhere is very hard. - To avoid bugs we always limit ourselves to reprogramming the ioapics in the interrupt handler, and not considering an interrupt successfully reprogrammed until we have received an irq in the new location. - On x86 we have two basic interrupt handling modes. o logical addressing with lowest priority delivery. o physical addressing with delivery to a single cpu. - With logical addressing as long as the cpu is not available for having an interrupt delivered to it the interrupt will be never be delivered to a particular cpu. Ideally we also update the mask in the ioapic to not target that cpu. - With physical addressing targeting a single cpu we need to reprogram the ioapics not to target that specific cpu. This needs to happen in the interrupt handler and we need to wait for the next interrupt before we tear down our data structures for handling the interrupt. The current cpu hotplug code attempts to reprogram the ioapics from process context which is just wrong. Now as part of suspend/resume I think we should be programming the hardware not to generate interrupts in the first place at the actual hardware devices so we can likely avoid all of the code that reprograms interrupts while they are active. If we can use things like pci ordering rules to ensure the device will never fire the interrupt until resumed we should be able to disable interrupts synchronously. Something that we can not safely do in the current cpu hotplug scenario. Which should make the problem of doing the work race free much easier. I don't know if other architectures need to disable cpus or not before doing a suspend to ram or a suspend to disk. I also don't know if we have any code that brings up the other cpus after we have been suspended. In either the cpu hotplug paths or the architecture specific paths. The bottom line is after the partial review of the irq handling during cpu shutdown a little while ago that I consider the current cpu hotplug code to be something that works when you are lucky. There are to many hardware bugs to support the general case it tries to implement. I have not reviewed the entire cpu hotplug path nor have I even tried suspend/resume. But I have been all and down the boot and shutdown code paths so I understand the general problem. The other part I know is that cpu numbers are assigned at the architectures discretion. So unless the architecture code or the early boot code sets a variable remembering which was the boot cpu there is no reason to believe we can deduce it. In addition I don't know if any other architectures have anything resembling the x86 requirement to disable non-boot cpus. I do know machine_shutdown on i386 and x86_64 performs that action (if not perfectly) so at least currently we should be able to do this in architecture specific code. </braindump> Eric From: Vivek Goyal <[EMAIL PROTECTED]> Subject: [PATCH 12/20] x86_64: 64bit ACPI wakeup trampoline o Moved wakeup_level4_pgt into the wakeup routine so we can run the kernel above 4G. o Now we first go to 64bit mode and continue to run from trampoline and then then start accessing kernel symbols and restore processor context. This enables us to resume even in relocatable kernel context when kernel might not be loaded at physical addr it has been compiled for. o Removed the need for modifying any existing kernel page table. o Increased the size of the wakeup routine to 8K. This is required as wake page tables are on trampoline itself and they got to be at 4K boundary, hence one page is not sufficient. Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]> Signed-off-by: Vivek Goyal <[EMAIL PROTECTED]> --- arch/x86_64/kernel/acpi/sleep.c | 22 ++------------ arch/x86_64/kernel/acpi/wakeup.S | 59 ++++++++++++++++++++++++--------------- arch/x86_64/kernel/head.S | 9 ----- 3 files changed, 41 insertions(+), 49 deletions(-) diff -puN arch/x86_64/kernel/acpi/sleep.c~x86_64-64bit-ACPI-wakeup-trampoline arch/x86_64/kernel/acpi/sleep.c --- linux-2.6.21-rc2-reloc/arch/x86_64/kernel/acpi/sleep.c~x86_64-64bit-ACPI-wakeup-trampoline 2007-03-07 01:28:11.000000000 +0530 +++ linux-2.6.21-rc2-reloc-root/arch/x86_64/kernel/acpi/sleep.c 2007-03-07 01:28:11.000000000 +0530 @@ -60,17 +60,6 @@ extern char wakeup_start, wakeup_end; extern unsigned long acpi_copy_wakeup_routine(unsigned long); -static pgd_t low_ptr; - -static void init_low_mapping(void) -{ - pgd_t *slot0 = pgd_offset(current->mm, 0UL); - low_ptr = *slot0; - set_pgd(slot0, *pgd_offset(current->mm, PAGE_OFFSET)); - WARN_ON(num_online_cpus() != 1); - local_flush_tlb(); -} - /** * acpi_save_state_mem - save kernel state * @@ -79,8 +68,6 @@ static void init_low_mapping(void) */ int acpi_save_state_mem(void) { - init_low_mapping(); - memcpy((void *)acpi_wakeup_address, &wakeup_start, &wakeup_end - &wakeup_start); acpi_copy_wakeup_routine(acpi_wakeup_address); @@ -93,8 +80,6 @@ int acpi_save_state_mem(void) */ void acpi_restore_state_mem(void) { - set_pgd(pgd_offset(current->mm, 0UL), low_ptr); - local_flush_tlb(); } /** @@ -107,10 +92,11 @@ void acpi_restore_state_mem(void) */ void __init acpi_reserve_bootmem(void) { - acpi_wakeup_address = (unsigned long)alloc_bootmem_low(PAGE_SIZE); - if ((&wakeup_end - &wakeup_start) > PAGE_SIZE) + acpi_wakeup_address = (unsigned long)alloc_bootmem_low(PAGE_SIZE*2); + if ((&wakeup_end - &wakeup_start) > (PAGE_SIZE*2)) printk(KERN_CRIT - "ACPI: Wakeup code way too big, will crash on attempt to suspend\n"); + "ACPI: Wakeup code way too big, will crash on attempt" + " to suspend\n"); } static int __init acpi_sleep_setup(char *str) diff -puN arch/x86_64/kernel/acpi/wakeup.S~x86_64-64bit-ACPI-wakeup-trampoline arch/x86_64/kernel/acpi/wakeup.S --- linux-2.6.21-rc2-reloc/arch/x86_64/kernel/acpi/wakeup.S~x86_64-64bit-ACPI-wakeup-trampoline 2007-03-07 01:28:11.000000000 +0530 +++ linux-2.6.21-rc2-reloc-root/arch/x86_64/kernel/acpi/wakeup.S 2007-03-07 01:28:11.000000000 +0530 @@ -1,6 +1,7 @@ .text #include <linux/linkage.h> #include <asm/segment.h> +#include <asm/pgtable.h> #include <asm/page.h> #include <asm/msr.h> @@ -62,12 +63,15 @@ wakeup_code: movb $0xa2, %al ; outb %al, $0x80 - lidt %ds:idt_48a - wakeup_code - xorl %eax, %eax - movw %ds, %ax # (Convert %ds:gdt to a linear ptr) - shll $4, %eax - addl $(gdta - wakeup_code), %eax - movl %eax, gdt_48a +2 - wakeup_code + mov %ds, %ax # Find 32bit wakeup_code addr + movzx %ax, %esi # (Convert %ds:gdt to a liner ptr) + shll $4, %esi + # Fix up the vectors + addl %esi, wakeup_32_vector - wakeup_code + addl %esi, wakeup_long64_vector - wakeup_code + addl %esi, gdt_48a + 2 - wakeup_code # Fixup the gdt pointer + + lidtl %ds:idt_48a - wakeup_code lgdtl %ds:gdt_48a - wakeup_code # load gdt with whatever is # appropriate @@ -80,7 +84,7 @@ wakeup_code: .balign 4 wakeup_32_vector: - .long wakeup_32 - __START_KERNEL_map + .long wakeup_32 - wakeup_code .word __KERNEL32_CS, 0 .code32 @@ -103,10 +107,6 @@ wakeup_32: movl $__KERNEL_DS, %eax movl %eax, %ds - movl saved_magic - __START_KERNEL_map, %eax - cmpl $0x9abcdef0, %eax - jne bogus_32_magic - movw $0x0e00 + 'i', %ds:(0xb8012) movb $0xa8, %al ; outb %al, $0x80; @@ -120,7 +120,7 @@ wakeup_32: movl %eax, %cr4 /* Setup early boot stage 4 level pagetables */ - movl $(wakeup_level4_pgt - __START_KERNEL_map), %eax + leal (wakeup_level4_pgt - wakeup_code)(%esi), %eax movl %eax, %cr3 /* Enable Long Mode */ @@ -159,11 +159,11 @@ wakeup_32: */ /* Finally jump in 64bit mode */ - ljmp *(wakeup_long64_vector - __START_KERNEL_map) + ljmp *(wakeup_long64_vector - wakeup_code)(%esi) .balign 4 wakeup_long64_vector: - .long wakeup_long64 - __START_KERNEL_map + .long wakeup_long64 - wakeup_code .word __KERNEL_CS, 0 .code64 @@ -178,11 +178,16 @@ wakeup_long64: * addresses where we're currently running on. We have to do that here * because in 32bit we couldn't load a 64bit linear address. */ - lgdt cpu_gdt_descr - __START_KERNEL_map + lgdt cpu_gdt_descr movw $0x0e00 + 'n', %ds:(0xb8014) movb $0xa9, %al ; outb %al, $0x80 + movq saved_magic, %rax + movq $0x123456789abcdef0, %rdx + cmpq %rdx, %rax + jne bogus_64_magic + movw $0x0e00 + 'u', %ds:(0xb8016) nop @@ -223,20 +228,21 @@ idt_48a: gdt_48a: .word 0x800 # gdt limit=2048, # 256 GDT entries - .word 0, 0 # gdt base (filled in later) - + .long gdta - wakeup_code # gdt base (relocated in later) real_magic: .quad 0 video_mode: .quad 0 video_flags: .quad 0 +.code16 bogus_real_magic: movb $0xba,%al ; outb %al,$0x80 jmp bogus_real_magic -bogus_32_magic: +.code64 +bogus_64_magic: movb $0xb3,%al ; outb %al,$0x80 - jmp bogus_32_magic + jmp bogus_64_magic bogus_cpu: movb $0xbc,%al ; outb %al,$0x80 @@ -263,6 +269,7 @@ bogus_cpu: #define VIDEO_FIRST_V7 0x0900 # Setting of user mode (AX=mode ID) => CF=success +.code16 mode_seta: movw %ax, %bx #if 0 @@ -313,6 +320,13 @@ wakeup_stack_begin: # Stack grows down .org 0xff0 wakeup_stack: # Just below end of page +.org 0x1000 +ENTRY(wakeup_level4_pgt) + .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE + .fill 510,8,0 + /* (2^48-(2*1024*1024*1024))/(2^39) = 511 */ + .quad level3_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE + ENTRY(wakeup_end) ## @@ -338,9 +352,10 @@ ENTRY(acpi_copy_wakeup_routine) movq $0x123456789abcdef0, %rdx movq %rdx, saved_magic - movl saved_magic - __START_KERNEL_map, %eax - cmpl $0x9abcdef0, %eax - jne bogus_32_magic + movq saved_magic, %rax + movq $0x123456789abcdef0, %rdx + cmpq %rdx, %rax + jne bogus_64_magic # restore the regs we used popq %rdx diff -puN arch/x86_64/kernel/head.S~x86_64-64bit-ACPI-wakeup-trampoline arch/x86_64/kernel/head.S --- linux-2.6.21-rc2-reloc/arch/x86_64/kernel/head.S~x86_64-64bit-ACPI-wakeup-trampoline 2007-03-07 01:28:11.000000000 +0530 +++ linux-2.6.21-rc2-reloc-root/arch/x86_64/kernel/head.S 2007-03-07 01:28:11.000000000 +0530 @@ -308,15 +308,6 @@ NEXT_PAGE(level2_kernel_pgt) .data -#ifdef CONFIG_ACPI_SLEEP - .align PAGE_SIZE -ENTRY(wakeup_level4_pgt) - .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE - .fill 510,8,0 - /* (2^48-(2*1024*1024*1024))/(2^39) = 511 */ - .quad level3_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE -#endif - #ifndef CONFIG_HOTPLUG_CPU __INITDATA #endif _ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/