On Sun, Jun 16, 2019 at 3:24 PM Waldek Kozaczuk <[email protected]> wrote:
> > > On Sunday, June 16, 2019 at 8:22:03 AM UTC-4, Waldek Kozaczuk wrote: >> >> >> >> On Sunday, June 16, 2019 at 5:40:46 AM UTC-4, Nadav Har'El wrote: >>> >>> On Sun, Jun 16, 2019 at 8:05 AM Waldemar Kozaczuk <[email protected]> >>> wrote: >>> >>>> This patch provides all necessary changes to move OSv kernel by 1 GiB >>>> higher >>>> in virtual memory space to start at 0x40200000. Most changes involve >>>> adding >>>> or substracting 0x40000000 (OSV_KERNEL_VM_SHIFT) in all relevant >>>> places. Please >>>> note that the kernel is still loaded at 2MiB in physical memory. >>>> >>> >>> Hi, overall I think this is a very good direction, because there's >>> indeed no reason why the >>> kernel's physical and virtual addresses need to be the same and the >>> assumption that it >>> was was hidden throughout our code. I do have a couple of questions >>> indline below though. >>> >>> By the way, I personally feel that it would be more convenient to just >>> control the virtual >>> address of the kernel directly - not as a "shift" from its physical >>> address, but I guess it's >>> just a matter of taste, and some of the code below is indeed more >>> natural when written >>> with the "shift" available directly. Perhaps we could have the best of >>> both worlds - >>> define the kernel *position* as the primary setting, and then define >>> OSV_KERNEL_VM_SHIFT >>> as simply the subtraction of the kerne's virtual position and its >>> physical position. >>> >> Either way there should be only single variable so either: >> 1) the shift one is a function of the virtual address (shift = virtual - >> physical) or >> 2) virtual address is a function of the shift (virtual = shift + physical) >> We might first apply other patch I sent "Make rebuilding loader.elf >> automatic and more efficient when changing kernel_base" that could make it >> handy >> to define it in single header rather than in Makefile. In either case >> this could be changed later. >> > Also possible the OSV_KERNEL_VM_BASE is the better name instead of > OSV_KERNEL_VM_SHIFT. > OSV_KERNEL_VM_SHIFT is a good name for the 0x4000000 and OSV_KERNEL_VM_BASE for the 0x40200000 as you said, we just need one of them and the second can be calculated by adding/subtracting two constants. > >>> Also, I see your patch makes 0x40000000 the default. I'm not sure I'm >>> following all >>> the details here - is there any *downside* to this change from the >>> previous default? >>> I can't think of any, but wondering if there's something I am missing. >>> >> I am not sure what change you are talking about. Just to clarify (per >> this patch): >> OSV_KERNEL_VM_SHIFT = 0x40000000 >> elf_start = 0x200000 >> phys_elf_start = elf_start + OSV_KERNEL_VM_SHIFT = 0x40200000 >> So in the comments I meant that "really" kernel starts at 0x40200000 in >> VMA. >> >> >>> >>>> The motivation for this patch is to make as much space as possible (or >>>> just enough) >>>> in virtual memory to allow running unmodified Linux non-PIE executables >>>> (issue #190). >>>> Even though due to the advancement of ASLR more and more applications >>>> are PIEs (Position >>>> Independent Executables) which are pretty well supported by OSv, there >>>> are still many >>>> non-PIEs (Position Dependent Executables) that are out. The most >>>> prominent one is >>>> actualy JVM whose most distributions come with tiny (~20K) bootstrap >>>> java non-PIE >>>> executable. There are many other examples where small non-PIE >>>> executable loads >>>> other shared libraries. >>>> >>>> As issue #1043 explains there are at least 3 possible solutions and >>>> this patch implements the 3rd last one described there. Please note >>>> that in future >>>> with little effort we could provide slightly beter scheme for >>>> OSV_KERNEL_VM_SHIFT >>>> that would allow us to place the kernel even higher at the end of the >>>> 2GiB limit (small memory model) >>>> and thus support virtually any non-PIE built using small memory model. >>>> >>>> Fixes #1043 >>>> >>>> Signed-off-by: Waldemar Kozaczuk <[email protected]> >>>> --- >>>> Makefile | 6 +++-- >>>> arch/x64/arch-setup.cc | 39 +++++++++++++++++----------- >>>> arch/x64/boot.S | 38 +++++++++++++++++++++------- >>>> arch/x64/entry-xen.S | 3 ++- >>>> arch/x64/loader.ld | 53 ++++++++++++++++++++++----------------- >>>> arch/x64/vmlinux-boot64.S | 6 +++-- >>>> core/elf.cc | 2 +- >>>> core/mmu.cc | 8 +++--- >>>> loader.cc | 3 ++- >>>> 9 files changed, 100 insertions(+), 58 deletions(-) >>>> >>>> diff --git a/Makefile b/Makefile >>>> index 314e74f4..fcb0121a 100644 >>>> --- a/Makefile >>>> +++ b/Makefile >>>> @@ -312,7 +312,7 @@ gcc-sysroot = $(if $(CROSS_PREFIX), --sysroot >>>> external/$(arch)/gcc.bin) \ >>>> # To add something that will *not* be part of the main kernel, you can >>>> do: >>>> # >>>> # mydir/*.o EXTRA_FLAGS = <MY_STUFF> >>>> -EXTRA_FLAGS = -D__OSV_CORE__ -DOSV_KERNEL_BASE=$(kernel_base) >>>> -DOSV_LZKERNEL_BASE=$(lzkernel_base) >>>> +EXTRA_FLAGS = -D__OSV_CORE__ -DOSV_KERNEL_BASE=$(kernel_base) >>>> -DOSV_LZKERNEL_BASE=$(lzkernel_base) >>>> -DOSV_KERNEL_VM_SHIFT=$(kernel_vm_shift) >>>> EXTRA_LIBS = >>>> COMMON = $(autodepend) -g -Wall -Wno-pointer-arith $(CFLAGS_WERROR) >>>> -Wformat=0 -Wno-format-security \ >>>> -D __BSD_VISIBLE=1 -U _FORTIFY_SOURCE -fno-stack-protector >>>> $(INCLUDES) \ >>>> @@ -421,6 +421,7 @@ ifeq ($(arch),x64) >>>> # lzkernel_base is where the compressed kernel is loaded from disk. >>>> kernel_base := 0x200000 >>>> lzkernel_base := 0x100000 >>>> +kernel_vm_shift := 0x40000000 >>>> >>>> $(out)/arch/x64/boot16.o: $(out)/lzloader.elf >>>> $(out)/boot.bin: arch/x64/boot16.ld $(out)/arch/x64/boot16.o >>>> @@ -480,6 +481,7 @@ endif # x64 >>>> ifeq ($(arch),aarch64) >>>> >>>> kernel_base := 0x40080000 >>>> +kernel_vm_shift := 0x0 >>>> >>>> include $(libfdt_base)/Makefile.libfdt >>>> libfdt-source := $(patsubst %.c, $(libfdt_base)/%.c, $(LIBFDT_SRCS)) >>>> @@ -1872,7 +1874,7 @@ stage1: $(stage1_targets) links >>>> .PHONY: stage1 >>>> >>>> $(out)/loader.elf: $(stage1_targets) arch/$(arch)/loader.ld >>>> $(out)/bootfs.o >>>> - $(call quiet, $(LD) -o $@ >>>> --defsym=OSV_KERNEL_BASE=$(kernel_base) \ >>>> + $(call quiet, $(LD) -o $@ >>>> --defsym=OSV_KERNEL_BASE=$(kernel_base) >>>> --defsym=OSV_KERNEL_VM_SHIFT=$(kernel_vm_shift) \ >>>> -Bdynamic --export-dynamic --eh-frame-hdr >>>> --enable-new-dtags \ >>>> $(^:%.ld=-T %.ld) \ >>>> --whole-archive \ >>>> diff --git a/arch/x64/arch-setup.cc b/arch/x64/arch-setup.cc >>>> index 62236486..210f2d7e 100644 >>>> --- a/arch/x64/arch-setup.cc >>>> +++ b/arch/x64/arch-setup.cc >>>> @@ -85,12 +85,15 @@ extern boot_time_chart boot_time; >>>> // it by placing address of start32 at the known offset at memory >>>> // as defined by section .start32_address in loader.ld >>>> extern "C" void start32(); >>>> -void * __attribute__((section (".start32_address"))) start32_address = >>>> reinterpret_cast<void*>(&start32); >>>> +void * __attribute__((section (".start32_address"))) start32_address = >>>> + reinterpret_cast<void*>((long)&start32 - OSV_KERNEL_VM_SHIFT); >>>> >>>> void arch_setup_free_memory() >>>> { >>>> - static ulong edata; >>>> + static ulong edata, edata_phys; >>>> asm ("movl $.edata, %0" : "=rm"(edata)); >>>> + edata_phys = edata - OSV_KERNEL_VM_SHIFT; >>>> + >>>> // copy to stack so we don't free it now >>>> auto omb = *osv_multiboot_info; >>>> auto mb = omb.mb; >>>> @@ -129,13 +132,13 @@ void arch_setup_free_memory() >>>> // page tables have been set up, so we can't reference the memory >>>> being >>>> // freed. >>>> for_each_e820_entry(e820_buffer, e820_size, [] (e820ent ent) { >>>> - // can't free anything below edata, it's core code. >>>> + // can't free anything below edata_phys, it's core code. >>>> // can't free anything below kernel at this moment >>>> - if (ent.addr + ent.size <= edata) { >>>> + if (ent.addr + ent.size <= edata_phys) { >>>> return; >>>> } >>>> - if (intersects(ent, edata)) { >>>> - ent = truncate_below(ent, edata); >>>> + if (intersects(ent, edata_phys)) { >>>> + ent = truncate_below(ent, edata_phys); >>>> } >>>> // ignore anything above 1GB, we haven't mapped it yet >>>> if (intersects(ent, initial_map)) { >>>> @@ -149,21 +152,27 @@ void arch_setup_free_memory() >>>> auto base = reinterpret_cast<void*>(get_mem_area_base(area)); >>>> mmu::linear_map(base, 0, initial_map, initial_map); >>>> } >>>> - // map the core, loaded 1:1 by the boot loader >>>> - mmu::phys elf_phys = reinterpret_cast<mmu::phys>(elf_header); >>>> - elf_start = reinterpret_cast<void*>(elf_header); >>>> - elf_size = edata - elf_phys; >>>> - mmu::linear_map(elf_start, elf_phys, elf_size, OSV_KERNEL_BASE); >>>> + // Map the core, loaded by the boot loader >>>> + // In order to properly setup mapping between virtual >>>> + // and physical we need to take into account where kernel >>>> + // is loaded in physical memory - elf_phys_start - and >>>> + // where it is linked to start in virtual memory - elf_start >>>> + static mmu::phys elf_phys_start = >>>> reinterpret_cast<mmu::phys>(elf_header); >>>> + // There is simple invariant between elf_phys_start and elf_start >>>> + // as expressed by the assignment below >>>> + elf_start = reinterpret_cast<void*>(elf_phys_start + >>>> OSV_KERNEL_VM_SHIFT); >>>> + elf_size = edata_phys - elf_phys_start; >>>> + mmu::linear_map(elf_start, elf_phys_start, elf_size, >>>> OSV_KERNEL_BASE); >>>> // get rid of the command line, before low memory is unmapped >>>> parse_cmdline(mb); >>>> // now that we have some free memory, we can start mapping the rest >>>> mmu::switch_to_runtime_page_tables(); >>>> for_each_e820_entry(e820_buffer, e820_size, [] (e820ent ent) { >>>> // >>>> - // Free the memory below elf_start which we could not before >>>> - if (ent.addr < (u64)elf_start) { >>>> - if (ent.addr + ent.size >= (u64)elf_start) { >>>> - ent = truncate_above(ent, (u64) elf_start); >>>> + // Free the memory below elf_phys_start which we could not >>>> before >>>> + if (ent.addr < (u64)elf_phys_start) { >>>> + if (ent.addr + ent.size >= (u64)elf_phys_start) { >>>> + ent = truncate_above(ent, (u64) elf_phys_start); >>>> } >>>> mmu::free_initial_memory_range(ent.addr, ent.size); >>>> return; >>>> diff --git a/arch/x64/boot.S b/arch/x64/boot.S >>>> index 1402e5d0..91c25d5a 100644 >>>> --- a/arch/x64/boot.S >>>> +++ b/arch/x64/boot.S >>>> @@ -24,13 +24,25 @@ >>>> .align 4096 >>>> .global ident_pt_l4 >>>> ident_pt_l4: >>>> - .quad ident_pt_l3 + 0x67 >>>> + # The addresses of the paging tables have to be the physical ones, >>>> so we have to >>>> + # manually subtract OSV_KERNEL_VM_SHIFT in all relevant places >>>> + .quad ident_pt_l3 + 0x67 - OSV_KERNEL_VM_SHIFT >>>> .rept 511 >>>> .quad 0 >>>> .endr >>>> ident_pt_l3: >>>> - .quad ident_pt_l2 + 0x67 >>>> - .rept 511 >>>> + # Each of the 512 entries in this table maps the very 1st 512 GiB >>>> of >>>> + # virtual address space 1 GiB at a time >>>> + # The very 1st entry maps 1st GiB 1:1 by pointing to ident_pt_l2 >>>> table >>>> + # that specifies addresses of every one of 512 2MiB slots of >>>> physical memory >>>> + .quad ident_pt_l2 + 0x67 - OSV_KERNEL_VM_SHIFT >>>> + # The 2nd entry maps 2nd GiB to the same 1st GiB of physical >>>> memory by pointing >>>> + # to the same ident_pt_l2 table as the 1st entry above >>>> + # This way we effectively provide correct mapping for the kernel >>>> linked >>>> + # to start at 1 GiB + 2 MiB (0x40200000) in virtual memory and >>>> point to >>>> + # 2 MiB address (0x200000) where it starts in physical memory >>>> + .quad ident_pt_l2 + 0x67 - OSV_KERNEL_VM_SHIFT >>>> >>> >>> Oh, but doesn't this mean that this only works correctly when >>> OSV_KERNEL_VM_SHIFT is *exactly* 1 GB? >>> I.e., the reason why you want the mapping of the second gigabyte to be >>> identical to the first gigabyte is >>> just because the shift is exactly 1GB? >>> >> That is correct. The general scheme (which I am planning to make part of >> the next patch at some time) should be this: >> OSV_KERNEL_VM_SHIFT = 1 GiB + N * 2MiB where 0 =< N < 500 (more less as >> last 24MB of the 2nd GB should be enough for the kernel). >> But then instead of re-using and pointing to the ident_pt_l2 table I will >> have to define extra instance of ident_pt_l2-equivalent-table where the >> first N entries will be zero. >> >> >>> >>> If this is the case (please correct me if I misunderstood!), this code >>> need to be more sophisticated to handle >>> a general OSV_KERNEL_VM_SHIFT, or you need some sort of compile time >>> check to verify that >>> OSV_KERNEL_VM_SHIFT must be set to 1GB and nothing else. >>> >> Well this be enforced by the linker which should complain if the kernel >> code goes beyond 2GiB limit in VM (small model). >> >>> >>> >>> >>>> + .rept 510 >>>> .quad 0 >>>> .endr >>>> ident_pt_l2: >>>> @@ -42,7 +54,8 @@ ident_pt_l2: >>>> >>>> gdt_desc: >>>> .short gdt_end - gdt - 1 >>>> - .long gdt >>>> + # subtract OSV_KERNEL_VM_SHIFT because when gdt_desc is >>>> referenced, the memory is mapped 1:1 >>>> + .long gdt - OSV_KERNEL_VM_SHIFT >>>> >>>> # Set up the 64-bit compatible version of GDT description structure >>>> # that points to the same GDT (Global segments Descriptors Table) and >>>> @@ -53,7 +66,8 @@ gdt_desc: >>>> .align 8 >>>> gdt64_desc: >>>> .short gdt_end - gdt - 1 >>>> - .quad gdt >>>> + # subtract OSV_KERNEL_VM_SHIFT because when gdt64_desc is >>>> referenced, the memory is mapped 1:1 >>>> + .quad gdt - OSV_KERNEL_VM_SHIFT >>>> >>>> .align 8 >>>> gdt = . - 8 >>>> @@ -77,10 +91,12 @@ init_stack_top = . >>>> .globl start32 >>>> .globl start32_from_64 >>>> start32: >>>> + # Because the memory is mapped 1:1 at this point, we have to >>>> manualy >>>> + # subtract OSV_KERNEL_VM_SHIFT from virtual addresses in all >>>> relevant places >>>> # boot16.S set %eax to ELF start address, we'll use it later >>>> mov %eax, %ebp >>>> mov $0x0, %edi >>>> - lgdt gdt_desc >>>> + lgdt gdt_desc-OSV_KERNEL_VM_SHIFT >>>> >>>> # Add an address the vmlinux_entry64 will jump to when >>>> # switching from 64-bit to 32-bit mode >>>> @@ -91,7 +107,7 @@ start32_from_64: >>>> mov %eax, %fs >>>> mov %eax, %gs >>>> mov %eax, %ss >>>> - ljmp $0x18, $1f >>>> + ljmp $0x18, $1f-OSV_KERNEL_VM_SHIFT >>>> 1: >>>> and $~7, %esp >>>> # Enable PAE (Physical Address Extension) - ability to address 64GB >>>> @@ -101,6 +117,9 @@ start32_from_64: >>>> >>>> # Set root of a page table in cr3 >>>> lea ident_pt_l4, %eax >>>> + # The address of the root paging table has to be physical >>>> + # so substract OSV_KERNEL_VM_SHIFT from ident_pt_l4 >>>> + sub $OSV_KERNEL_VM_SHIFT, %eax >>>> mov %eax, %cr3 >>>> >>>> # Set long mode >>>> @@ -128,7 +147,7 @@ start64: >>>> jz start64_continue >>>> call extract_linux_boot_params >>>> mov $0x1000, %rbx >>>> - mov $0x200000, %rbp >>>> + mov $OSV_KERNEL_BASE, %rbp >>>> >>> >>> Good catch. >>> >> Please also note that (I think) we unnecessarily set ebp/rbp in all these >> places and then pass all the way to arch-setup.cc but in reality the >> 0x200000 is predefined in the makefile so I am not sure what is the point >> of this code in *.S files. Either way we can remove this redundancy later. >> >>> >>>> start64_continue: >>>> lea .bss, %rdi >>>> @@ -168,6 +187,7 @@ smpboot: >>>> mov smpboot_cr4-smpboot, %eax >>>> mov %eax, %cr4 >>>> lea ident_pt_l4, %eax >>>> + sub $OSV_KERNEL_VM_SHIFT, %eax >>>> mov %eax, %cr3 >>>> mov smpboot_efer-smpboot, %eax >>>> mov smpboot_efer+4-smpboot, %edx >>>> @@ -181,7 +201,7 @@ smpboot: >>>> >>>> smpboot_gdt_desc: >>>> .short gdt_end - gdt - 1 >>>> - .long gdt >>>> + .long gdt - OSV_KERNEL_VM_SHIFT >>>> .global smpboot_cr0 >>>> smpboot_cr0: >>>> .long 0 >>>> diff --git a/arch/x64/entry-xen.S b/arch/x64/entry-xen.S >>>> index 11f72da4..81342284 100644 >>>> --- a/arch/x64/entry-xen.S >>>> +++ b/arch/x64/entry-xen.S >>>> @@ -23,7 +23,7 @@ >>>> >>>> elfnote_val(XEN_ELFNOTE_ENTRY, xen_start) >>>> elfnote_val(XEN_ELFNOTE_HYPERCALL_PAGE, hypercall_page) >>>> -elfnote_val(XEN_ELFNOTE_VIRT_BASE, 0) >>>> +elfnote_val(XEN_ELFNOTE_VIRT_BASE, OSV_KERNEL_VM_SHIFT) >>>> >>> >>> I have no idea what this does :-( >>> >> Here is the article I read - >> https://wiki.xen.org/wiki/X86_Paravirtualised_Memory_Management (look >> for "Start Of Day" section) >> >>> >>> >>>> elfnote_str(XEN_ELFNOTE_XEN_VERSION, "xen-3.0") >>>> elfnote_str(XEN_ELFNOTE_GUEST_OS, "osv") >>>> elfnote_str(XEN_ELFNOTE_GUEST_VERSION, "?.?") >>>> @@ -50,4 +50,5 @@ xen_start: >>>> mov %rsp, xen_bootstrap_end >>>> mov %rsi, %rdi >>>> call xen_init >>>> + mov $0x0, %rdi >>>> jmp start64 >>>> diff --git a/arch/x64/loader.ld b/arch/x64/loader.ld >>>> index caae1f68..a3ac0790 100644 >>>> --- a/arch/x64/loader.ld >>>> +++ b/arch/x64/loader.ld >>>> @@ -15,15 +15,21 @@ SECTIONS >>>> * We can't export the ELF header base as a symbol, because ld >>>> * insists on moving stuff around if we do. >>>> * >>>> + * Make kernel start OSV_KERNEL_VM_SHIFT bytes higher than >>>> where it >>>> + * starts in physical memory. Also put AT() expressions in all >>>> + * sections to enforce their placements in physical memory lower >>>> + * by OSV_KERNEL_VM_SHIFT bytes. >>>> >>> >>> I'm afraid I didn't understand the "AT()" part. Why do we need to add >>> OSV_KERNEL_VM_SHIFT >>> On the first line ( . = OSV_KERNEL_BASE + 0x800 + OSV_KERNEL_VM_SHIFT) >>> but then >>> subtract it back on every single address calculation that follows? I >>> didn't understand what >>> you are trying to achieve. >>> >> This part is critical to keep the segments (and sections they are made >> of) in right place in physical memory. It makes paddr stay in tact at >> 0x200000 and for example firecracker relies on it. Without AT the >> loader.elf would be 1GB. >> >>> >>> >>>> + */ >>>> + . = OSV_KERNEL_BASE + 0x800 + OSV_KERNEL_VM_SHIFT; >>>> >>> + /* >>>> * Place address of start32 routine at predefined offset in >>>> memory >>>> */ >>>> - . = OSV_KERNEL_BASE + 0x800; >>>> - .start32_address : { >>>> + .start32_address : AT(ADDR(.start32_address) - >>>> OSV_KERNEL_VM_SHIFT) { >>>> *(.start32_address) >>>> } >>>> - . = OSV_KERNEL_BASE + 0x1000; >>>> - .dynamic : { *(.dynamic) } :dynamic :text >>>> - .text : { >>>> + . = OSV_KERNEL_BASE + 0x1000 + OSV_KERNEL_VM_SHIFT; >>>> + .dynamic : AT(ADDR(.dynamic) - OSV_KERNEL_VM_SHIFT) { *(.dynamic) >>>> } :dynamic :text >>>> + .text : AT(ADDR(.text) - OSV_KERNEL_VM_SHIFT) { >>>> text_start = .; >>>> *(.text.hot .text.hot.*) >>>> *(.text.unlikely .text.*_unlikely) >>>> @@ -31,60 +37,61 @@ SECTIONS >>>> *(.text.startup .text.startup.*) >>>> *(.text .text.*) >>>> text_end = .; >>>> + PROVIDE(low_vmlinux_entry64 = vmlinux_entry64 - >>>> OSV_KERNEL_VM_SHIFT); >>>> } :text >>>> . = ALIGN(8); >>>> - .fixup : { >>>> + .fixup : AT(ADDR(.fixup) - OSV_KERNEL_VM_SHIFT) { >>>> fault_fixup_start = .; >>>> *(.fixup) >>>> fault_fixup_end = .; >>>> } :text >>>> >>>> . = ALIGN(8); >>>> - .memcpy_decode : { >>>> + .memcpy_decode : AT(ADDR(.memcpy_decode) - OSV_KERNEL_VM_SHIFT) { >>>> memcpy_decode_start = .; >>>> *(.memcpy_decode) >>>> memcpy_decode_end = .; >>>> } :text >>>> >>>> - .eh_frame : { *(.eh_frame) } : text >>>> - .rodata : { *(.rodata*) } :text >>>> - .eh_frame : { *(.eh_frame) } :text >>>> - .eh_frame_hdr : { *(.eh_frame_hdr) } :text :eh_frame >>>> - .note : { *(.note*) } :text :note >>>> - .gcc_except_table : { *(.gcc_except_table) *(.gcc_except_table.*) >>>> } : text >>>> - .tracepoint_patch_sites ALIGN(8) : { >>>> + .eh_frame : AT(ADDR(.eh_frame) - OSV_KERNEL_VM_SHIFT) { >>>> *(.eh_frame) } : text >>>> + .rodata : AT(ADDR(.rodata) - OSV_KERNEL_VM_SHIFT) { *(.rodata*) } >>>> :text >>>> + .eh_frame : AT(ADDR(.eh_frame) - OSV_KERNEL_VM_SHIFT) { >>>> *(.eh_frame) } :text >>>> + .eh_frame_hdr : AT(ADDR(.eh_frame_hdr) - OSV_KERNEL_VM_SHIFT) { >>>> *(.eh_frame_hdr) } :text :eh_frame >>>> + .note : AT(ADDR(.note) - OSV_KERNEL_VM_SHIFT) { *(.note*) } :text >>>> :note >>>> + .gcc_except_table : AT(ADDR(.gcc_except_table) - >>>> OSV_KERNEL_VM_SHIFT) { *(.gcc_except_table) *(.gcc_except_table.*) } : text >>>> + .tracepoint_patch_sites ALIGN(8) : >>>> AT(ADDR(.tracepoint_patch_sites) - OSV_KERNEL_VM_SHIFT) { >>>> __tracepoint_patch_sites_start = .; >>>> *(.tracepoint_patch_sites) >>>> __tracepoint_patch_sites_end = .; >>>> } : text >>>> - .data.rel.ro : { *(.data.rel.ro.local* >>>> .gnu.linkonce.d.rel.ro.local.*) *(.data.rel.ro .data.rel.ro.* >>>> .gnu.linkonce.d.rel.ro.*) } : text >>>> - .data : { *(.data) } :text >>>> + .data.rel.ro : AT(ADDR(.data.rel.ro) - OSV_KERNEL_VM_SHIFT) { >>>> *(.data.rel.ro.local* .gnu.linkonce.d.rel.ro.local.*) *(.data.rel.ro >>>> .data.rel.ro.* .gnu.linkonce.d.rel.ro.*) } : text >>>> + .data : AT(ADDR(.data) - OSV_KERNEL_VM_SHIFT) { *(.data) } :text >>>> _init_array_start = .; >>>> - .init_array : { >>>> + .init_array : AT(ADDR(.init_array) - OSV_KERNEL_VM_SHIFT) { >>>> *(SORT_BY_INIT_PRIORITY(.init_array.*) >>>> SORT_BY_INIT_PRIORITY(.ctors.*)) >>>> *(.init_array .ctors) >>>> } : text >>>> _init_array_end = .; >>>> . = ALIGN(4096); >>>> - .percpu : { >>>> + .percpu : AT(ADDR(.percpu) - OSV_KERNEL_VM_SHIFT) { >>>> _percpu_start = .; >>>> *(.percpu) >>>> . = ALIGN(4096); >>>> _percpu_end = .; >>>> } >>>> - .percpu_workers : { >>>> + .percpu_workers : AT(ADDR(.percpu_workers) - OSV_KERNEL_VM_SHIFT) { >>>> _percpu_workers_start = .; >>>> *(.percpu_workers) >>>> _percpu_workers_end = .; >>>> } >>>> . = ALIGN(64); >>>> - .tdata : { *(.tdata .tdata.* .gnu.linkonce.td.*) } :tls :text >>>> - .tbss : { >>>> + .tdata : AT(ADDR(.tdata) - OSV_KERNEL_VM_SHIFT) { *(.tdata >>>> .tdata.* .gnu.linkonce.td.*) } :tls :text >>>> + .tbss : AT(ADDR(.tbss) - OSV_KERNEL_VM_SHIFT) { >>>> *(.tbss .tbss.* .gnu.linkonce.tb.*) >>>> . = ALIGN(64); >>>> } :tls :text >>>> .tls_template_size = SIZEOF(.tdata) + SIZEOF(.tbss); >>>> - .bss : { *(.bss .bss.*) } :text >>>> + .bss : AT(ADDR(.bss) - OSV_KERNEL_VM_SHIFT) { *(.bss .bss.*) } >>>> :text >>>> . = ALIGN(64); >>>> tcb0 = .; >>>> . = . + .tls_template_size + 256; >>>> @@ -114,4 +121,4 @@ PHDRS { >>>> eh_frame PT_GNU_EH_FRAME; >>>> note PT_NOTE; >>>> } >>>> -ENTRY(vmlinux_entry64); >>>> +ENTRY(low_vmlinux_entry64); >>>> diff --git a/arch/x64/vmlinux-boot64.S b/arch/x64/vmlinux-boot64.S >>>> index 230afd3c..12047513 100644 >>>> --- a/arch/x64/vmlinux-boot64.S >>>> +++ b/arch/x64/vmlinux-boot64.S >>>> @@ -13,7 +13,9 @@ vmlinux_entry64: >>>> mov %rsi, %rdi >>>> >>>> # Load the 64-bit version of the GDT >>>> - lgdt gdt64_desc >>>> + # Because the memory is mapped 1:1 at this point, we have to >>>> manualy >>>> + # subtract OSV_KERNEL_VM_SHIFT from the gdt address >>>> + lgdt gdt64_desc-OSV_KERNEL_VM_SHIFT >>>> >>>> # Setup the stack to switch back to 32-bit mode in order >>>> # to converge with the code that sets up transiton to 64-bit mode >>>> later. >>>> @@ -32,6 +34,6 @@ vmlinux_entry64: >>>> # to start32_from_64 which is where the boot process converges. >>>> subq $8, %rsp >>>> movl $0x18, 4(%rsp) >>>> - movl $start32_from_64, %eax >>>> + movl $start32_from_64-OSV_KERNEL_VM_SHIFT, %eax # Because memory >>>> is mapped 1:1 subtract OSV_KERNEL_VM_SHIFT >>>> movl %eax, (%rsp) >>>> lret >>>> diff --git a/core/elf.cc b/core/elf.cc >>>> index fc2ee0c3..477a0177 100644 >>>> --- a/core/elf.cc >>>> +++ b/core/elf.cc >>>> @@ -1099,7 +1099,7 @@ void create_main_program() >>>> program::program(void* addr) >>>> : _next_alloc(addr) >>>> { >>>> - _core = std::make_shared<memory_image>(*this, >>>> (void*)ELF_IMAGE_START); >>>> + _core = std::make_shared<memory_image>(*this, >>>> (void*)(ELF_IMAGE_START + OSV_KERNEL_VM_SHIFT)); >>>> assert(_core->module_index() == core_module_index); >>>> _core->load_segments(); >>>> set_search_path({"/", "/usr/lib"}); >>>> diff --git a/core/mmu.cc b/core/mmu.cc >>>> index f9294125..75366360 100644 >>>> --- a/core/mmu.cc >>>> +++ b/core/mmu.cc >>>> @@ -91,12 +91,12 @@ phys pte_level_mask(unsigned level) >>>> return ~((phys(1) << shift) - 1); >>>> } >>>> >>>> +static void *elf_phys_start = (void*)OSV_KERNEL_BASE; >>>> void* phys_to_virt(phys pa) >>>> { >>>> - // The ELF is mapped 1:1 >>>> void* phys_addr = reinterpret_cast<void*>(pa); >>>> - if ((phys_addr >= elf_start) && (phys_addr < elf_start + >>>> elf_size)) { >>>> - return phys_addr; >>>> + if ((phys_addr >= elf_phys_start) && (phys_addr < elf_phys_start + >>>> elf_size)) { >>>> + return (void*)(phys_addr + OSV_KERNEL_VM_SHIFT); >>>> } >>>> >>>> return phys_mem + pa; >>>> @@ -108,7 +108,7 @@ phys virt_to_phys(void *virt) >>>> { >>>> // The ELF is mapped 1:1 >>>> if ((virt >= elf_start) && (virt < elf_start + elf_size)) { >>>> - return reinterpret_cast<phys>(virt); >>>> + return reinterpret_cast<phys>((void*)(virt - >>>> OSV_KERNEL_VM_SHIFT)); >>>> } >>>> >>>> #if CONF_debug_memory >>>> diff --git a/loader.cc b/loader.cc >>>> index 7d88e609..7ac99ef5 100644 >>>> --- a/loader.cc >>>> +++ b/loader.cc >>>> @@ -102,7 +102,8 @@ void premain() >>>> >>>> arch_init_premain(); >>>> >>>> - auto inittab = elf::get_init(elf_header); >>>> + auto inittab = elf::get_init(reinterpret_cast<elf::Elf64_Ehdr*>( >>>> + (void*)elf_header + OSV_KERNEL_VM_SHIFT)); >>>> >>>> if (inittab.tls.start == nullptr) { >>>> debug_early("premain: failed to get TLS data from ELF\n"); >>>> -- >>>> 2.20.1 >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "OSv Development" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/osv-dev/20190616050548.7888-1-jwkozaczuk%40gmail.com >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- > You received this message because you are subscribed to the Google Groups > "OSv Development" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/osv-dev/f2da1cc4-76e6-4823-8f32-31c82a865d11%40googlegroups.com > <https://groups.google.com/d/msgid/osv-dev/f2da1cc4-76e6-4823-8f32-31c82a865d11%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/CANEVyjtrLoXpdY7KLMKCnuFiBWO8ST81yQrc81Uc0kk-wKtiTQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
