On 26.08.2014 18:01, Konrad Rzeszutek Wilk wrote: > On Fri, Aug 22, 2014 at 11:20:50AM +0200, Stefan Bader wrote: >> On 21.08.2014 18:03, Kees Cook wrote: >>> On Tue, Aug 12, 2014 at 2:07 PM, Konrad Rzeszutek Wilk >>> <[email protected]> wrote: >>>> On Tue, Aug 12, 2014 at 11:53:03AM -0700, Kees Cook wrote: >>>>> On Tue, Aug 12, 2014 at 11:05 AM, Stefan Bader >>>>> <[email protected]> wrote: >>>>>> On 12.08.2014 19:28, Kees Cook wrote: >>>>>>> On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader >>>>>>> <[email protected]> wrote: >>>>>>>> On 08.08.2014 14:43, David Vrabel wrote: >>>>>>>>> On 08/08/14 12:20, Stefan Bader wrote: >>>>>>>>>> Unfortunately I have not yet figured out why this happens, but can >>>>>>>>>> confirm by >>>>>>>>>> compiling with or without CONFIG_RANDOMIZE_BASE being set that >>>>>>>>>> without KASLR all >>>>>>>>>> is ok, but with it enabled there are issues (actually a dom0 does >>>>>>>>>> not even boot >>>>>>>>>> as a follow up error). >>>>>>>>>> >>>>>>>>>> Details can be seen in [1] but basically this is always some portion >>>>>>>>>> of a >>>>>>>>>> vmalloc allocation failing after hitting a freshly allocated PTE >>>>>>>>>> space not being >>>>>>>>>> PTE_NONE (usually from a module load triggered by systemd-udevd). In >>>>>>>>>> the >>>>>>>>>> non-dom0 case this repeats many times but ends in a guest that >>>>>>>>>> allows login. In >>>>>>>>>> the dom0 case there is a more fatal error at some point causing a >>>>>>>>>> crash. >>>>>>>>>> >>>>>>>>>> I have not tried this for a normal PV guest but for dom0 it also >>>>>>>>>> does not help >>>>>>>>>> to add "nokaslr" to the kernel command-line. >>>>>>>>> >>>>>>>>> Maybe it's overlapping with regions of the virtual address space >>>>>>>>> reserved for Xen? What the the VA that fails? >>>>>>>>> >>>>>>>>> David >>>>>>>>> >>>>>>>> Yeah, there is some code to avoid some regions of memory (like >>>>>>>> initrd). Maybe >>>>>>>> missing p2m tables? I probably need to add debugging to find the >>>>>>>> failing VA (iow >>>>>>>> not sure whether it might be somewhere in the stacktraces in the >>>>>>>> report). >>>>>>>> >>>>>>>> The kernel-command line does not seem to be looked at. It should put >>>>>>>> something >>>>>>>> into dmesg and that never shows up. Also today's random feature is >>>>>>>> other PV >>>>>>>> guests crashing after a bit somewhere in the check_for_corruption >>>>>>>> area... >>>>>>> >>>>>>> Right now, the kaslr code just deals with initrd, cmdline, etc. If >>>>>>> there are other reserved regions that aren't listed in the e820, it'll >>>>>>> need to locate and skip them. >>>>>>> >>>>>>> -Kees >>>>>>> >>>>>> Making my little steps towards more understanding I figured out that it >>>>>> isn't >>>>>> the code that does the relocation. Even with that completely disabled >>>>>> there were >>>>>> the vmalloc issues. What causes it seems to be the default of the upper >>>>>> limit >>>>>> and that this changes the split between kernel and modules to 1G+1G >>>>>> instead of >>>>>> 512M+1.5G. That is the reason why nokaslr has no effect. >>>>> >>>>> Oh! That's very interesting. There must be some assumption in Xen >>>>> about the kernel VM layout then? >>>> >>>> No. I think most of the changes that look at PTE and PMDs are are all >>>> in arch/x86/xen/mmu.c. I wonder if this is xen_cleanhighmap being >>>> too aggressive >>> >>> (Sorry I had to cut our chat short at Kernel Summit!) >>> >>> I sounded like there was another region of memory that Xen was setting >>> aside for page tables? But Stefan's investigation seems to show this >>> isn't about layout at boot (since the kaslr=0 case means no relocation >>> is done). Sounds more like the split between kernel and modules area, >>> so I'm not sure how the memory area after the initrd would be part of >>> this. What should next steps be, do you think? >> >> Maybe layout, but not about placement of the kernel. Basically leaving KASLR >> enabled but shrink the possible range back to the original kernel/module >> split >> is fine as well. >> >> I am bouncing between feeling close to understand to being confused. Konrad >> suggested xen_cleanhighmap being overly aggressive. But maybe its the other >> way >> round. The warning that occurs first indicates that PTE that was obtained for >> some vmalloc mapping is not unused (0) as it is expected. So it feels rather >> like some cleanup has *not* been done. >> >> Let me think aloud a bit... What seems to cause this, is the change of the >> kernel/module split from 512M:1.5G to 1G:1G (not exactly since there is 8M >> vsyscalls and 2M hole at the end). Which in vaddr terms means: >> >> Before: >> ffffffff80000000 - ffffffff9fffffff (=512 MB) kernel text mapping, from >> phys 0 >> ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space >> >> After: >> ffffffff80000000 - ffffffffbfffffff (=1024 MB) kernel text mapping, from >> phys 0 >> ffffffffc0000000 - ffffffffff5fffff (=1014 MB) module mapping space >> >> Now, *if* I got this right, this means the kernel starts on a vaddr that is >> pointed at by: >> >> PGD[510]->PUD[510]->PMD[0]->PTE[0] >> >> In the old layout the module vaddr area would start in the same PUD area, but >> with the change the kernel would cover PUD[510] and the module vaddr + >> vsyscalls >> and the hole would cover PUD[511]. > > I think there is a fixmap there too?
Right, they forgot that in Documentation/x86/x86_64/mm... but head_64.S has it.
So fixmap seems to be in the 2M space before the vsyscalls.
Btw, apparently I got the PGD index wrong. It is of course 511, not 510.
init_level4_pgt[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..255]->kernel
[256..511]->mod
[511]->level2_fixmap_pgt[0..505]->mod
[506]->fixmap
[507..510]->vsysc
[511]->hole
With the change being level2_kernel_pgt completely covering kernel only.
>>
>> xen_cleanhighmap operates only on the kernel_level2_pgt which (speculating a
>> bit
>> since I am not sure I understand enough details) I believe is the one PMD
>> pointed at by PGD[510]->PUD[510]. That could mean that before the change
>
> That sounds right.
>
> I don't know if you saw:
>
> 1248 #ifdef DEBUG
>
> 1249 /* This is superflous and is not neccessary, but you know what
>
> 1250 * lets do it. The MODULES_VADDR -> MODULES_END should be clear
> of
> 1251 * anything at this stage. */
>
> 1252 xen_cleanhighmap(MODULES_VADDR, roundup(MODULES_VADDR, PUD_SIZE)
> - 1);
> 1253 #endif
>
> 1254 }
I saw that but it would have no effect, even with running it. Because
xen_cleanhighmap clamps the pmds it walks over to the kernel_level2_pgt page.
Now MODULES_VADDR is mapped only from level2_fixmap_pgt.
Even with the old layout it might do less that anticipated as it would only
cover 512M and stop then. But I think it really does not matter.
>
> Which was me being a bit paranoid and figured it might help in
> troubleshooting.
> If you disable that does it work?
>
>> xen_cleanhighmap may touch some (the initial 512M) of the module vaddr space
>> but
>> not after the change. Maybe that also means it always should have covered
>> more
>> but this would not be observed as long as modules would not claim more than
>> 512M? I still need to check the vaddr ranges for which xen_cleanhighmap is
>> actually called. The modules vaddr space would normally not be touched (only
>> with DEBUG set). I moved that to be unconditionally done but then this might
>> be
>> of no use when it needs to cover a different PMD...
>
> What does the toolstack say in regards to allocating the memory? It is pretty
> verbose (domainloginfo..something) in printing out the vaddr of where
> it stashes the kernel, ramdisk, P2M, and the pagetables (which of course
> need to fit all within the 512MB, now 1GB area).
That is taken from starting a 2G PV domU with pvgrub (not pygrub):
Xen Minimal OS!
start_info: 0xd90000(VA)
nr_pages: 0x80000
shared_inf: 0xdfe92000(MA)
pt_base: 0xd93000(VA)
nr_pt_frames: 0xb
mfn_list: 0x990000(VA)
mod_start: 0x0(VA)
mod_len: 0
flags: 0x0
cmd_line:
stack: 0x94f860-0x96f860
MM: Init
_text: 0x0(VA)
_etext: 0x6000d(VA)
_erodata: 0x78000(VA)
_edata: 0x80b00(VA)
stack start: 0x94f860(VA)
_end: 0x98fe68(VA)
start_pfn: da1
max_pfn: 80000
Mapping memory range 0x1000000 - 0x80000000
setting 0x0-0x78000 readonly
For a moment I was puzzled by the use of max_pfn_mapped in the generic
cleanup_highmap function of 64bit x86. It limits the cleanup to the start of the
mfn_list. And the max_pfn_mapped value changes soon after to reflect the total
amount of memory of the guest.
Making a copy showed it to be around 51M at the time of cleanup. That initially
looks suspect but Xen already replaced the page tables. The compile-time
variants would have 2M large pages on the whole level2_kernel_pgt range. But as
far as I can see, the Xen provided ones don't put in mappings for anything
beyond the provided boot stack which is clean in the xen_cleanhighmap.
So not much further... but then I think I know what I do next. Probably should
have done before. I'll replace the WARN_ON in vmalloc that triggers by a panic
and at least get a crash dump of that situation when it occurs. Then I can dig
in there with crash (really should have thought of that before)...
-Stefan
>
>>
>> Really not sure here. But maybe a starter for others...
>>
>> -Stefan
>>
>>>
>>> -Kees
>>>
>>>
>>>>>
>>>>> -Kees
>>>>>
>>>>> --
>>>>> Kees Cook
>>>>> Chrome OS Security
>>>>>
>>>>> _______________________________________________
>>>>> Xen-devel mailing list
>>>>> [email protected]
>>>>> http://lists.xen.org/xen-devel
>>>
>>>
>>>
>>
>>
>
>
signature.asc
Description: OpenPGP digital signature

