2008/1/19, Linus Torvalds <[EMAIL PROTECTED]>: > > > On Sat, 19 Jan 2008, Anton Salikhmetov wrote: > > > > The page_check_address() function is called from the > > page_mkclean_one() routine as follows: > > .. and the page_mkclean_one() function is totally different. > > Lookie here, this is the correct and complex sequence: > > > entry = ptep_clear_flush(vma, address, pte); > > entry = pte_wrprotect(entry); > > entry = pte_mkclean(entry); > > set_pte_at(mm, address, pte, entry); > > That's a rather expensive sequence, but it's done exactly because it has > to be done that way. What it does is to > > - *atomically* load the pte entry _and_ clear the old one in memory. > > That's the > > entry = ptep_clear_flush(vma, address, pte); > > thing, and it basically means that it's doing some > architecture-specific magic to make sure that another CPU that accesses > the PTE at the same time will never actually modify the pte (because > it's clear and not valid) > > - it then - while the page table is actually clear and invalid - takes > the old value and turns it into the new one: > > entry = pte_wrprotect(entry); > entry = pte_mkclean(entry); > > - and finally, it replaces the entry with the new one: > > set_pte_at(mm, address, pte, entry); > > which takes care to write the new entry in some specific way that is > atomic wrt other CPU's (ie on 32-bit x86 with a 64-bit page table > entry it writes the high word first, see the write barriers in > "native_set_pte()" in include/asm-x86/pgtable-3level.h > > Now, compare that subtle and correct thing with what is *not* correct: > > if (pte_dirty(*pte) && pte_write(*pte)) > *pte = pte_wrprotect(*pte); > > which makes no effort at all to make sure that it's safe in case another > CPU updates the accessed bit. > > Now, arguably it's unlikely to cause horrible problems at least on x86, > because: > > - we only do this if the pte is already marked dirty, so while we can > lose the accessed bit, we can *not* lose the dirty bit. And the > accessed bit isn't such a big deal. > > - it's not doing any of the "be careful about" ordering things, but since > the really important bits aren't changing, ordering probably won't > practically matter. > > But the problem is that we have something like 24 different architectures, > it's hard to make sure that none of them have issues. > > In other words: it may well work in practice. But when these things go > subtly wrong, they are *really* nasty to find, and the unsafe sequence is > really not how it's supposed to be done. For example, you don't even flush > the TLB, so even if there are no cross-CPU issues, there's probably going > to be writable entries in the TLB that now don't match the page tables. > > Will it matter? Again, probably impossible to see in practice. But ...
Linus, I am very grateful to you for your extremely clear explanation of the issue I have overlooked! Back to the msync() issue, I'm going to come back with a new design for the bug fix. Thank you once again. Anton > > Linus > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/