On Mon, May 4, 2026 at 2:17 AM Jan Kara <[email protected]> wrote: > > On Fri 01-05-26 18:57:52, Matthew Wilcox wrote: > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote: > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <[email protected]> > > > wrote: > > > > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote: > > > > > 1. There is no deterministic latency for I/O completion. It depends on > > > > > both the hardware and the software stack (bio/request queues and the > > > > > block scheduler). Sometimes the latency is short; at other times it > > > > > can > > > > > be quite long. In such cases, a high-priority thread performing > > > > > operations > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to > > > > > wait > > > > > for an unpredictable amount of time. > > > > > > > > But does that actually happen? I find it hard to believe that thread A > > > > unmaps a VMA while thread B is in the middle of taking a page fault in > > > > that same VMA. mprotect() and madvise() are more likely to happen, but > > > > it still seems really unlikely to me. > > > > > > It doesn’t have to involve unmapping or applying mprotect to > > > the entire VMA—just a portion of it is sufficient. > > > > Yes, but that still fails to answer "does this actually happen". How much > > performance is all this complexity in the page fault handler buying us? > > If you don't answer this question, I'm just going to go in and rip it > > all out. > > I fully agree with you we should verify whether the retry code still brings > in real-world advantage today with VMA locks. After all the retry logic has > been introduced in 2010. That being said if there are realistic loads where > one thread needs VMA write lock while another thread is faulting the VMA, > then the latencies can be indeed extreme. For example things like cgroup IO > throttling happen on the IO path and thus can throttle IO of a low-priority > thread for a long time.
I’m quite sure that swap-in and VMA writes can occur concurrently, and this is fairly common. For example, Java GC may use mprotect or userfaultfd on a small portion of a large Java heap while other portions are still under do_swap_page(). If we start exploring different approaches for anon and file, I agree I can revisit this on an Android phone if there is a real, serious case where a file VMA can be written and a page fault occurs at the same time. Please note that, as an Android developer, I am particularly cautious about priority inversion. A recent issue causing severe priority inversion is zram attempting to support preemption[1]. When a task performing compression or decompression is migrated to another CPU and then preempted by other tasks, high-priority tasks waiting on the mutex may be significantly delayed, impacting user experience. > > BTW I'm not sure I quite understand Barry's priority inversion problem > since I'd expect all threads of a task to generally be treated with the > same priority... Exactly not. Maybe these slides[2] and this project[3] can give you a hint—they aim to standardize things on Linux by learning from Apple OS. Basically, tasks are classified into five types: USER_INTERACTIVE: Requires immediate response. USER_INITIATED: Tolerates a short delay, but must respond quickly still. UTILITY: Tolerates long delays, but not prolonged ones. BACKGROUND: Doesn’t mind prolonged delays. DEFAULT: System default behavior. [1] https://lore.kernel.org/linux-mm/[email protected]/ [2] https://lpc.events/event/19/contributions/2089/attachments/1797/3877/Userspace%20Assisted%20Scheduling%20via%20Sched%20QoS.pdf [3] https://lore.kernel.org/lkml/20260415000910.2h5misvwc45bdumu@airbuntu/ Thanks Barry
