Am 11.11.2015 um 18:30 schrieb Andrea Arcangeli:
> Hi Jason,
>
> On Wed, Nov 11, 2015 at 10:35:16AM -0500, Jason J. Herne wrote:
>> MADV_NOHUGEPAGE processing is too restrictive. kvm already disables
>> hugepage but hugepage_madvise() takes the error path when we ask to turn
>> on the MADV_NOHUGEPAGE bit and the bit is already on. This causes Qemu's
>
> I wonder why KVM disables transparent hugepages on s390. It sounds
> weird to disable transparent hugepages with KVM. In fact on x86 we
> call MADV_HUGEPAGE to be sure transparent hugepages are enabled on the
> guest physical memory, even if the transparent_hugepage/enabled ==
> madvise.
>
>> new postcopy migration feature to fail on s390 because its first action is
>> to madvise the guest address space as NOHUGEPAGE. This patch modifies the
>> code so that the operation succeeds without error now.
>
> The other way is to change qemu to keep track it already called
> MADV_NOHUGEPAGE and not to call it again. I don't have a strong
> opinion on this, I think it's ok to return 0 but it's a visible change
> to userland, I can't imagine it to break anything though. It sounds
> very unlikely that an app could error out if it notices the kernel
> doesn't error out on the second call of MADV_NOHUGEPAGE.
>
> Glad to hear KVM postcopy live migration is already running on s390 too.
Sometimes....we have some issues with userfaultd, which we currently address.
One place is interesting: the kvm code might have to call fixup_user_fault
for a guest address (to map the page writable). Right now we do not pass
FAULT_FLAG_ALLOW_RETRY, which can trigger a warning like
[ 119.414573] FAULT_FLAG_ALLOW_RETRY missing 1
[ 119.414577] CPU: 42 PID: 12853 Comm: qemu-system-s39 Not tainted 4.3.0+ #315
[ 119.414579] 000000011c4579b8 000000011c457a48 0000000000000002
0000000000000000
000000011c457ae8 000000011c457a60 000000011c457a60
0000000000113e26
00000000000002cf 00000000009feef8 0000000000a1e054
000000000000000b
000000011c457aa8 000000011c457a48 0000000000000000
0000000000000000
0000000000000000 0000000000113e26 000000011c457a48
000000011c457aa8
[ 119.414590] Call Trace:
[ 119.414596] ([<0000000000113d16>] show_trace+0xf6/0x148)
[ 119.414598] [<0000000000113dda>] show_stack+0x72/0xf0
[ 119.414600] [<0000000000551b9e>] dump_stack+0x6e/0x90
[ 119.414605] [<000000000032d168>] handle_userfault+0xe0/0x448
[ 119.414609] [<000000000029a2d4>] handle_mm_fault+0x16e4/0x1798
[ 119.414611] [<00000000002930be>] fixup_user_fault+0x86/0x118
[ 119.414614] [<0000000000126bb8>] gmap_ipte_notify+0xa0/0x170
[ 119.414617] [<000000000013ae90>] kvm_arch_vcpu_ioctl_run+0x448/0xc58
[ 119.414619] [<000000000012e4dc>] kvm_vcpu_ioctl+0x37c/0x668
[ 119.414622] [<00000000002eba68>] do_vfs_ioctl+0x3a8/0x508
[ 119.414624] [<00000000002ebc6c>] SyS_ioctl+0xa4/0xb8
[ 119.414627] [<0000000000815c56>] system_call+0xd6/0x264
[ 119.414629] [<000003ff9628721a>] 0x3ff9628721a
I think we can rework this to use something that sets FAULT_FLAG_ALLOW_RETRY,
but this begs the question if a futex operation on userfault backed memory
would also be broken. The futex code also does fixup_user_fault without
FAULT_FLAG_ALLOW_RETRY as far as I can tell.
Christian
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html