On Tue, Mar 24, 2015 at 1:48 AM, Tamas K Lengyel <tkleng...@sec.in.tum.de>
wrote:

>
>
> On Tue, Mar 24, 2015 at 4:54 AM, Andres Lagar Cavilla <
> and...@lagarcavilla.org> wrote:
>
>>
>>
>> On Mon, Mar 23, 2015 at 11:25 AM, Tamas K Lengyel <
>> tkleng...@sec.in.tum.de> wrote:
>>
>>> On Mon, Mar 23, 2015 at 6:59 PM, Andres Lagar Cavilla <
>>> and...@lagarcavilla.org> wrote:
>>>
>>>> On Mon, Mar 23, 2015 at 9:10 AM, Tamas K Lengyel <
>>>> tkleng...@sec.in.tum.de> wrote:
>>>>
>>>>> Hello everyone,
>>>>> I'm trying to chase down a bug that reproducibly crashes Xen (tested
>>>>> with 4.4.1). The problem is somewhere within the mem-sharing subsystem and
>>>>> how that interacts with domains that are being actively saved. In my setup
>>>>> I use the xl toolstack to rapidly create clones of HVM domains by piping
>>>>> "xl save -c" into xl restore with a modified domain config which updates
>>>>> the name/disk/vif. However, during such an operation Xen crashes with the
>>>>> following log if there are already active clones.
>>>>>
>>>>> IMHO there should be no conflict between saving the domain and
>>>>> memsharing, as long as the domain is actually just being checkpointed "-c"
>>>>> - it's memory should remain as is. This is however clearly not the case.
>>>>> Any ideas?
>>>>>
>>>>
>>>> Tamas, I'm not clear on the use of memsharing in this workflow. As
>>>> described, you pipe save into restore, but the internal magic is lost on
>>>> me. Are you fanning out to multiple restores? That would seem to be the
>>>> case, given the need to update name/disk/vif.
>>>>
>>>> Anyway, I'm inferring. Instead, could you elaborate?
>>>>
>>>> Thanks
>>>> Andre
>>>>
>>>
>>> Hi Andre,
>>> thanks for getting back on this issue. The script I'm using is at
>>> https://github.com/tklengyel/drakvuf/blob/master/tools/clone.pl. The
>>> script simply creates a FIFO pipe (mkfifo) and saves the domain into that
>>> pipe which is immediately read by xl restore with the updated configuration
>>> file. This mainly just to eliminate having to read the memory dump from
>>> disk. That part of the system works as expected and multiple save/restores
>>> running at the same time don't cause any side-effects. Once the domain has
>>> thus been cloned, I run memshare on every page which also works as
>>> expected. This problem only occurs when the cloning procedure runs when a
>>> page unshare operation kicks in on a already active clone (as you see in
>>> the log).
>>>
>>
>> Sorry Tamas, I'm a bit slow here, I looked at your script -- looks
>> allright, no mention of memsharing in there.
>>
>> Re-reading ... memsharing? memshare? Is this memshrtool in tools/testing?
>> How are you running it?
>>
>
>
> Hi Andre,
> the memsharing happens here
> https://github.com/tklengyel/drakvuf/blob/master/src/main.c#L144 after
> the clone script finished. This is effectively the same approach as in
> tools/testing, just automatically looping from 0 to max_gpfn. Afterwards
> all unsharing happens automatically either induced by the guest itself, or
> when I map pages into the my app with xc_map_foreign_range PROT_WRITE.
>

Thanks. Couple of observations on your script
1. sharing all gfns from zero to max is inefficient. There are non trivial
holes in the physmap space that you want to jump over. (Holes are not the
cause of the crash)
2. xc_memshr_add_to_physmap was created exactly for this case. Rather than
deduplicating two pages into one, it grafts a sharing-nominated page
directly onto an otherwise empty p2m entry. Apart from the obvious overhead
reduction benefit, it does not require you to have 2x memory capacity in
order to clone a VM.


>
>
>>
>> Certainly no xen crash should happen with user-space input. I'm just
>> trying to understand what you're doing. The unshare code is not, uhmm,
>> brief, so a NULL deref could happen in half a dozen places at first glance.
>>
>
> Well let me know what I could do help tracing it down. I don't think
> (potentially buggy) userspace tools should crash Xen either =)
>

>From the crash a writable foreign map (qemu -- assuming you run your
memshare tool strictly after xl restore has finished) is triggering the
unshare NULL deref. My main suspicion is the rmap becoming racy. I would
liberally sprinkle printks, retry, see how far printks say you got.

Andres


>
> Tamas
>
>
>>
>> Thanks
>> Andres
>>
>
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Reply via email to