Hi David,

Thanks for tackling this!

On Tue, Oct 22, 2019 at 10:13 AM David Hildenbrand <da...@redhat.com> wrote:
>
> This series is based on [2], which should pop up in linux/next soon:
>         https://lkml.org/lkml/2019/10/21/1034
>
> This is the result of a recent discussion with Michal ([1], [2]). Right
> now we set all pages PG_reserved when initializing hotplugged memmaps. This
> includes ZONE_DEVICE memory. In case of system memory, PG_reserved is
> cleared again when onlining the memory, in case of ZONE_DEVICE memory
> never. In ancient times, we needed PG_reserved, because there was no way
> to tell whether the memmap was already properly initialized. We now have
> SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE
> memory is already initialized deferred, and there shouldn't be a visible
> change in that regard.
>
> I remember that some time ago, we already talked about stopping to set
> ZONE_DEVICE pages PG_reserved on the list, but I never saw any patches.
> Also, I forgot who was part of the discussion :)

You got me, Alex, and KVM folks on the Cc, so I'd say that was it.

> One of the biggest fear were side effects. I went ahead and audited all
> users of PageReserved(). The ones that don't need any care (patches)
> can be found below. I will double check and hope I am not missing something
> important.
>
> I am probably a little bit too careful (but I don't want to break things).
> In most places (besides KVM and vfio that are nuts), the
> pfn_to_online_page() check could most probably be avoided by a
> is_zone_device_page() check. However, I usually get suspicious when I see
> a pfn_valid() check (especially after I learned that people mmap parts of
> /dev/mem into user space, including memory without memmaps. Also, people
> could memmap offline memory blocks this way :/). As long as this does not
> hurt performance, I think we should rather do it the clean way.

I'm concerned about using is_zone_device_page() in places that are not
known to already have a reference to the page. Here's an audit of
current usages, and the ones I think need to cleaned up. The "unsafe"
ones do not appear to have any protections against the device page
being removed (get_dev_pagemap()). Yes, some of these were added by
me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
pages into anonymous memory paths and I'm not up to speed on how it
guarantees 'struct page' validity vs device shutdown without using
get_dev_pagemap().

smaps_pmd_entry(): unsafe

put_devmap_managed_page(): safe, page reference is held

is_device_private_page(): safe? gpu driver manages private page lifetime

is_pci_p2pdma_page(): safe, page reference is held

uncharge_page(): unsafe? HMM

add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()

soft_offline_page(): unsafe

remove_migration_pte(): unsafe? HMM

move_to_new_page(): unsafe? HMM

migrate_vma_pages() and helpers: unsafe? HMM

try_to_unmap_one(): unsafe? HMM

__put_page(): safe

release_pages(): safe

I'm hoping all the HMM ones can be converted to
is_device_private_page() directlly and have that routine grow a nice
comment about how it knows it can always safely de-reference its @page
argument.

For the rest I'd like to propose that we add a facility to determine
ZONE_DEVICE by pfn rather than page. The most straightforward why I
can think of would be to just add another bitmap to mem_section_usage
to indicate if a subsection is ZONE_DEVICE or not.

>
> I only gave it a quick test with DIMMs on x86-64, but didn't test the
> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
> on x86-64 and PPC.

I'll give it a spin, but I don't think the kernel wants to grow more
is_zone_device_page() users.

Reply via email to