Re: Reference count on pages held in secondary MMUs
On Sat, Jun 22, 2019 at 03:11:36PM -0400, Andrea Arcangeli wrote: > Hello Christoffer, > > On Tue, Jun 11, 2019 at 01:51:32PM +0200, Christoffer Dall wrote: > > Sorry, what does this mean? Do you mean that we can either do: > > > > on_vm_s2_fault() { > > page = gup(); > > map_page_in_s2_mmu(); > > put_page(); > > } > > > > mmu_notifier_invalidate() { > > unmap_page_in_s2_mmu(); > > } > > > > or > > > > on_vm_s2_fault() { > > page = gup(); > > map_page_in_s2_mmu(); > > } > > > > mmu_notifier_invalidate() { > > unmap_page_in_s2_mmu(); > > put_page(); > > } > > Yes both work, refcounting always works. > > > > and in fact Jerome also thinks > > > like me that we should eventually optimize away the FOLL_GET and not > > > take the refcount in the first place, > > > > So if I understood the above correct, the next point is that there are > > advantages to avoiding keeping the extra reference on that page, because > > we have problematic race conditions related to set_page_dirty(), and we > > can reduce the problem of race conditions further by not getting a > > reference on the page at all when going GUP as part of a KVM fault? > > You could still keep the extra reference until the > invalidate. > > The set_page_dirty however if you do in the context of the secondary > MMU fault (i.e. atomically with the mapping of the page in the > secondary MMU, with respect of MMU notifier invalidates), it solves > the whole problem with the ->mkwrite/mkclean and then you can keep a > GUP long term pin fully safely already. That is a solution that always > works and becomes guaranteed by design by the MMU notifier not to > interfere with the _current_ writeback code in the filesystem. It also > already provides stable pages. > Ok. > > Can you explain, or provide a pointer to, the root cause of the > > problem with holding a reference on the page and setting it dirty? > > The filesystem/VM doesn't possibly expect set_page_dirty to be called > again after it called page_mkclean. Supposedly a wrprotect fault > should have been generated if somebody tried to write to the page > under writeback, so page_mkwrite should have run again before you > could have called set_page_dirty. > > Instead page_mkclean failed to get rid of the long term GUP obtained > with FOLL_WRITE because it simply can't ask the device to release it > without MMU notifier, so the device can later still call > set_page_dirty despite page_mkclean already run. > I see, I'm now able to link this to recent articles on LWN. > > > but a whole different chapter is > > > dedicated on the set_page_dirty_lock crash on MAP_SHARED mappings > > > after long term GUP pins. So since you're looking into how to handle > > > the page struct in the MMU notifier it's worth mentioning the issues > > > related to set_page_dirty too. > > > > Is there some background info on the "set_page_dirty_lock crash on > > MAP_SHARED" ? I'm having trouble following this without the background. > > Jan Kara leaded the topic explained all the details on this filesystem > issue at the LSF-MM and also last year. > > Which is what makes me think there can't be too many uses cases that > require writback to work while long term GUP pin allow some device to > write to the pages at any given time, if nobody requires this to be > urgently fixed. > > You can find coverage on lwn and on linux-mm. > > > > > > > > > To achieve the cleanest writeback fix to avoid crashes in > > > set_page_dirty_lock on long term secondary MMU mappings that supports > > > MMU notifier like KVM shadow MMU, the ideal is to mark the page dirty > > > before establishing a writable the mapping in the secondary MMU like > > > in the model below. > > > > > > The below solution works also for those secondary MMU that are like a > > > TLB and if there are two concurrent invalidates on the same page > > > invoked at the same time (a potential problem Jerome noticed), you > > > don't know which come out first and you would risk to call > > > set_page_dirty twice, which would be still potentially kernel crashing > > > (even if only a theoretical issue like O_DIRECT). > > > > Why is it problematic to call set_page_dirty() twice? I thought that at > > worst it would only lead to writing out data to disk unnecessarily ? > > According to Jerome, after the first set_page_dirty returns, writeback > could start before the second set_page_dirty has been called. So if > there are additional random later invalidates the next ones shouldn't > call set_page_dirty again. > > The problem is if you call set_page_dirty in the invalidate, you've > also to make sure set_page_dirty is being called only once. > > There can be concurrent invalidates for the same page running at the > same time, while the page fault there is only one that runs atomically > with respect to the mmu notifier invalidates (under whatever lock that > serializes the MMU notifier invalida
Re: Reference count on pages held in secondary MMUs
Hello Christoffer, On Tue, Jun 11, 2019 at 01:51:32PM +0200, Christoffer Dall wrote: > Sorry, what does this mean? Do you mean that we can either do: > > on_vm_s2_fault() { > page = gup(); > map_page_in_s2_mmu(); > put_page(); > } > > mmu_notifier_invalidate() { > unmap_page_in_s2_mmu(); > } > > or > > on_vm_s2_fault() { > page = gup(); > map_page_in_s2_mmu(); > } > > mmu_notifier_invalidate() { > unmap_page_in_s2_mmu(); > put_page(); > } Yes both work, refcounting always works. > > and in fact Jerome also thinks > > like me that we should eventually optimize away the FOLL_GET and not > > take the refcount in the first place, > > So if I understood the above correct, the next point is that there are > advantages to avoiding keeping the extra reference on that page, because > we have problematic race conditions related to set_page_dirty(), and we > can reduce the problem of race conditions further by not getting a > reference on the page at all when going GUP as part of a KVM fault? You could still keep the extra reference until the invalidate. The set_page_dirty however if you do in the context of the secondary MMU fault (i.e. atomically with the mapping of the page in the secondary MMU, with respect of MMU notifier invalidates), it solves the whole problem with the ->mkwrite/mkclean and then you can keep a GUP long term pin fully safely already. That is a solution that always works and becomes guaranteed by design by the MMU notifier not to interfere with the _current_ writeback code in the filesystem. It also already provides stable pages. > Can you explain, or provide a pointer to, the root cause of the > problem with holding a reference on the page and setting it dirty? The filesystem/VM doesn't possibly expect set_page_dirty to be called again after it called page_mkclean. Supposedly a wrprotect fault should have been generated if somebody tried to write to the page under writeback, so page_mkwrite should have run again before you could have called set_page_dirty. Instead page_mkclean failed to get rid of the long term GUP obtained with FOLL_WRITE because it simply can't ask the device to release it without MMU notifier, so the device can later still call set_page_dirty despite page_mkclean already run. > > but a whole different chapter is > > dedicated on the set_page_dirty_lock crash on MAP_SHARED mappings > > after long term GUP pins. So since you're looking into how to handle > > the page struct in the MMU notifier it's worth mentioning the issues > > related to set_page_dirty too. > > Is there some background info on the "set_page_dirty_lock crash on > MAP_SHARED" ? I'm having trouble following this without the background. Jan Kara leaded the topic explained all the details on this filesystem issue at the LSF-MM and also last year. Which is what makes me think there can't be too many uses cases that require writback to work while long term GUP pin allow some device to write to the pages at any given time, if nobody requires this to be urgently fixed. You can find coverage on lwn and on linux-mm. > > > > > To achieve the cleanest writeback fix to avoid crashes in > > set_page_dirty_lock on long term secondary MMU mappings that supports > > MMU notifier like KVM shadow MMU, the ideal is to mark the page dirty > > before establishing a writable the mapping in the secondary MMU like > > in the model below. > > > > The below solution works also for those secondary MMU that are like a > > TLB and if there are two concurrent invalidates on the same page > > invoked at the same time (a potential problem Jerome noticed), you > > don't know which come out first and you would risk to call > > set_page_dirty twice, which would be still potentially kernel crashing > > (even if only a theoretical issue like O_DIRECT). > > Why is it problematic to call set_page_dirty() twice? I thought that at > worst it would only lead to writing out data to disk unnecessarily ? According to Jerome, after the first set_page_dirty returns, writeback could start before the second set_page_dirty has been called. So if there are additional random later invalidates the next ones shouldn't call set_page_dirty again. The problem is if you call set_page_dirty in the invalidate, you've also to make sure set_page_dirty is being called only once. There can be concurrent invalidates for the same page running at the same time, while the page fault there is only one that runs atomically with respect to the mmu notifier invalidates (under whatever lock that serializes the MMU notifier invalidates vs the secondary MMU page fault). If you call set_page_dirty twice in a row, you again open the window for the writeback to have already called ->page_mkclean on the page after the first set_page_dirty, so the second set_page_dirty will then crash. You can enforce to call it only once if you have sptes (shadow pagetables) like in KVM has, so this is not
Re: Reference count on pages held in secondary MMUs
On Sun, Jun 09, 2019 at 01:40:24PM -0400, Andrea Arcangeli wrote: > Hello, > > On Sun, Jun 09, 2019 at 11:37:19AM +0200, Paolo Bonzini wrote: > > On 09/06/19 10:18, Christoffer Dall wrote: > > > In some sense, we are thus maintaining a 'hidden', or internal, > > > reference to the page, which is not counted anywhere. > > > > > > I am wondering if it would be equally valid to take a reference on the > > > page, and remove that reference when unmapping via MMU notifiers, and if > > > so, if there would be any advantages/drawbacks in doing so? > > > > If I understand correctly, I think the MMU notifier would not fire if > > you took an actual reference; the page would be pinned in memory and > > could not be swapped out. > > MMU notifiers still fires, the refcount is simple and can be dropped > also in the mmu notifier invalidate Sorry, what does this mean? Do you mean that we can either do: on_vm_s2_fault() { page = gup(); map_page_in_s2_mmu(); put_page(); } mmu_notifier_invalidate() { unmap_page_in_s2_mmu(); } or on_vm_s2_fault() { page = gup(); map_page_in_s2_mmu(); } mmu_notifier_invalidate() { unmap_page_in_s2_mmu(); put_page(); } > and in fact Jerome also thinks > like me that we should eventually optimize away the FOLL_GET and not > take the refcount in the first place, So if I understood the above correct, the next point is that there are advantages to avoiding keeping the extra reference on that page, because we have problematic race conditions related to set_page_dirty(), and we can reduce the problem of race conditions further by not getting a reference on the page at all when going GUP as part of a KVM fault? Can you explain, or provide a pointer to, the root cause of the problem with holding a reference on the page and setting it dirty? > but a whole different chapter is > dedicated on the set_page_dirty_lock crash on MAP_SHARED mappings > after long term GUP pins. So since you're looking into how to handle > the page struct in the MMU notifier it's worth mentioning the issues > related to set_page_dirty too. Is there some background info on the "set_page_dirty_lock crash on MAP_SHARED" ? I'm having trouble following this without the background. > > To achieve the cleanest writeback fix to avoid crashes in > set_page_dirty_lock on long term secondary MMU mappings that supports > MMU notifier like KVM shadow MMU, the ideal is to mark the page dirty > before establishing a writable the mapping in the secondary MMU like > in the model below. > > The below solution works also for those secondary MMU that are like a > TLB and if there are two concurrent invalidates on the same page > invoked at the same time (a potential problem Jerome noticed), you > don't know which come out first and you would risk to call > set_page_dirty twice, which would be still potentially kernel crashing > (even if only a theoretical issue like O_DIRECT). Why is it problematic to call set_page_dirty() twice? I thought that at worst it would only lead to writing out data to disk unnecessarily ? I am also not familiar with a problem related to KVM and O_DIRECT, so I'm having trouble keeping up here as well :( > So the below model > will solve that and it's also valid for KVM/vhost accelleration, > despite KVM can figure out how to issue a single set_page_dirty call > for each spte that gets invalidated by concurrent invalidates on the > same page because it has shadow pagetables and it's not just a TLB. > > access = FOLL_WRITE|FOLL_GET > > repeat: > page = gup(access) > put_page(page) > > spin_lock(mmu_notifier_lock); > if (race with invalidate) { > spin_unlock.. > goto repeat; > } > if (access == FOLL_WRITE) > set_page_dirty(page) > establish writable mapping in secondary MMU on page > spin_unlock > > The above solves the crash in set_page_dirty_lock without having to > modify any filesystem, it should work theoretically safer than the > O_DIRECT short term GUP pin. That is not exactly how we do things today on the arm64 side. We do something that looks like: /* * user_mem_abort is our function for a secondary MMU fault that * resolves to a memslot. */ user_mem_abort() { page = gup(access, &writable); spin_lock(&kvm->mmu_lock); if (mmu_notifier_retry(kvm, mmu_seq)) goto out; /* run the VM again and see what happens */ if (writable) kvm_set_pfn_dirty(page_to_pfn(page)); stage2_set_pte(); /* establish_writable mapping in secondary MMU on page */ out: spin_unlock(&kvm_mmu_lock); put_page(page); } Should we rework this to address the race you are refering to, and are other architectures already safe against this? > > With regard to KVM this should be enough, but we also look for a crash > avoidance solution for those devices that cannot support the MMU > notifier for short and long term GUP pins. Sorry, can you define
Re: Reference count on pages held in secondary MMUs
On 11/06/19 13:21, Christoffer Dall wrote: >> If I understand correctly, I think the MMU notifier would not fire if >> you took an actual reference; the page would be pinned in memory and >> could not be swapped out. >> > That was my understanding too, but I can't find the code path that would > support this theory. Yeah, as Andrea said you could drop the reference in the invalidate callback too. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: Reference count on pages held in secondary MMUs
On Sun, Jun 09, 2019 at 11:37:19AM +0200, Paolo Bonzini wrote: > On 09/06/19 10:18, Christoffer Dall wrote: > > In some sense, we are thus maintaining a 'hidden', or internal, > > reference to the page, which is not counted anywhere. > > > > I am wondering if it would be equally valid to take a reference on the > > page, and remove that reference when unmapping via MMU notifiers, and if > > so, if there would be any advantages/drawbacks in doing so? > > If I understand correctly, I think the MMU notifier would not fire if > you took an actual reference; the page would be pinned in memory and > could not be swapped out. > That was my understanding too, but I can't find the code path that would support this theory. The closest thing I could find was is_page_cache_freeable(), and as far as I'm able to understand that code, that is called (via pageout()) later in shrink_page_list() than try_to_unmap() which fires the MMU notifiers through the rmap code. It is entirely possible that I'm looking at the wrong place and missing something overall though? Thanks, Christoffer ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: Reference count on pages held in secondary MMUs
Hello, On Sun, Jun 09, 2019 at 11:37:19AM +0200, Paolo Bonzini wrote: > On 09/06/19 10:18, Christoffer Dall wrote: > > In some sense, we are thus maintaining a 'hidden', or internal, > > reference to the page, which is not counted anywhere. > > > > I am wondering if it would be equally valid to take a reference on the > > page, and remove that reference when unmapping via MMU notifiers, and if > > so, if there would be any advantages/drawbacks in doing so? > > If I understand correctly, I think the MMU notifier would not fire if > you took an actual reference; the page would be pinned in memory and > could not be swapped out. MMU notifiers still fires, the refcount is simple and can be dropped also in the mmu notifier invalidate and in fact Jerome also thinks like me that we should eventually optimize away the FOLL_GET and not take the refcount in the first place, but a whole different chapter is dedicated on the set_page_dirty_lock crash on MAP_SHARED mappings after long term GUP pins. So since you're looking into how to handle the page struct in the MMU notifier it's worth mentioning the issues related to set_page_dirty too. To achieve the cleanest writeback fix to avoid crashes in set_page_dirty_lock on long term secondary MMU mappings that supports MMU notifier like KVM shadow MMU, the ideal is to mark the page dirty before establishing a writable the mapping in the secondary MMU like in the model below. The below solution works also for those secondary MMU that are like a TLB and if there are two concurrent invalidates on the same page invoked at the same time (a potential problem Jerome noticed), you don't know which come out first and you would risk to call set_page_dirty twice, which would be still potentially kernel crashing (even if only a theoretical issue like O_DIRECT). So the below model will solve that and it's also valid for KVM/vhost accelleration, despite KVM can figure out how to issue a single set_page_dirty call for each spte that gets invalidated by concurrent invalidates on the same page because it has shadow pagetables and it's not just a TLB. access = FOLL_WRITE|FOLL_GET repeat: page = gup(access) put_page(page) spin_lock(mmu_notifier_lock); if (race with invalidate) { spin_unlock.. goto repeat; } if (access == FOLL_WRITE) set_page_dirty(page) establish writable mapping in secondary MMU on page spin_unlock The above solves the crash in set_page_dirty_lock without having to modify any filesystem, it should work theoretically safer than the O_DIRECT short term GUP pin. With regard to KVM this should be enough, but we also look for a crash avoidance solution for those devices that cannot support the MMU notifier for short and long term GUP pins. There's lots of work going on on linux-mm, to try to let those devices support writeback in a safe way (also with stable pages so all fs integrity checks will pass) using bounce buffer if a long term GUP pin is detected by the filesystem. In addition there's other work to make the short term GUP pin theoretically safe by delaying the writeback for the short window the GUP pin is taken by O_DIRECT, so it becomes theoretically safe too (currently it's only practically safe). However I'm not sure if the long term GUP pins really needs to support writeback. To do a coherent snapshot without talking to the device so that it stops writing to the whole mapping, one should write protect the memory, but it can't be write protected without MMU notifier support.. The VM already wrprotect the MAP_SHARED pages before writing them out to provide stable pages, but that's just not going to work with a long term GUP pin that mapped the page as writable in the device. To make a practical example: if the memory under long term GUP pin would be a KVM guest physical memory mapped in MAP_SHARED with vfio/iommu device assignment and if the device or iommu can't support the MMU notifier, there's no value in the filesystem being able to flush the guest physical memory to disk periodically, because if the write-able GUP pin can't be dropped first, it's impossible to take a coherent snapshot of the guest physical memory while the device can still write anywhere in the KVM guest physical memory. Whatever gets written by the complex logic that will do the bounce buffers would be just useless for this use case. If the system crashed, you couldn't possibly start a new guest and pretend that whatever got written was a coherent snapshot of the guest physical memory. To take a coherent snapshot KVM needs to tell the device to stop writing, and only then flush the dirty data in the MAP_SHARED to disk, just calling mprotect won't be enough because that won't get rid of the device writable mapping associated with the long term GUP pin. But if it has to do that, it can also tell the device to temporarily drop the iommu mapping, drop all the GUP pins and to re-take the GUP pins and remap the guest in the iommu only after the data already
Re: Reference count on pages held in secondary MMUs
On 09/06/19 10:18, Christoffer Dall wrote: > In some sense, we are thus maintaining a 'hidden', or internal, > reference to the page, which is not counted anywhere. > > I am wondering if it would be equally valid to take a reference on the > page, and remove that reference when unmapping via MMU notifiers, and if > so, if there would be any advantages/drawbacks in doing so? If I understand correctly, I think the MMU notifier would not fire if you took an actual reference; the page would be pinned in memory and could not be swapped out. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Reference count on pages held in secondary MMUs
Hi, I have been looking at how we deal with page_count(page) on pages held in stage 2 page tables on KVM/arm64. What we do currently is to drop the reference on the page we get from get_user_pages() once the page is inserted into our stage 2 page table, typically leaving page_count(page) == page_mapcount(page) == 1 which represents the userspace stage 1 mapping of the page, and we rely on MMU notifiers to remove the stage 2 mapping if corresponding stage 1 mapping is being unmapped. I believe this is analogous to what other architectures do? In some sense, we are thus maintaining a 'hidden', or internal, reference to the page, which is not counted anywhere. I am wondering if it would be equally valid to take a reference on the page, and remove that reference when unmapping via MMU notifiers, and if so, if there would be any advantages/drawbacks in doing so? Thanks, Christoffer ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm