On Tue, Jun 30, 2020 at 11:36 AM Oleg Nesterov <o...@redhat.com> wrote: > > I swear, I too thought about changing wake_page_function() but got lost ;) > > The very fact it returns -1 if test_bit() is true suggests it can do more.
wake functions can indeed do more. You can return -1 to say "stop traversing the list, nobody else on this waitlist is relevant" (and that's separate from the "I did the exclusive one". Returning "0" is a success, but implies it shouldn't count as an exclusive wakeup (generally because "try_to_wake_up()" didn't return 1, so the entry on the list wasn't actually sleeping and might already have passed the critical region). And returning "1" means that you _actually_ woke something up, so that the exclusive counting triggers. So that 0-vs-1 is very much exactly because of the race with "I'm an exclusive waiter, and if an event comes in while I'm _actively_ waiting for it, then I promise I will take it". And the difference is exactly about whether try_to_wake_up() actually changed the state of the process or not. But in addition to these "should I continue and/or count exclusive waiters", wake functions can _do_ things too. The common one is "autoremove", which is basically "if you actually woke something up, then remove it from the list" (so that one generally goes along with a successful try_to_wake_up() and returning 1. Doing that can help scalability enormously, because if you stay on the list and only remove yourself in the final "finish_waikt()", then if there are lots of waiters and lots of wakeup events, the wakups will traverse your entry over and over and over again. So autoremove is generally a good thing. An autoremove() thing can also be used at finish_wait() time to decide whether you were actually woken up by that list or not (like the WQ_FLAG_WOKEN, just implicitly). Remember: you can be on _many_ wait lists, and you can be woken up by non-waitlist events like signals etc too, so you can use WQ_FLAG_WOKEN (if you use woken_wake_function) or the "am I still on the list" (if you use autoremove) to decide that _that_ particular wakeup source triggered. But some code wants to _stay_ on the list, because maybe they want to be notified about multiple things and don't want to re-insert themselves onto the list in between. That's the default behavior if you don't do anything at all and just use the normal wake queue init things. It's not generally the best thing to do, but it _can_ be useful, and I think it's the default one purely for historical reasons - ie that's how they all used to work, and the "autoremove" behavior came in because of the scalability concerns. But the waiting side is the one that decides which wakeup function it wants for the wakeup event. So you can have the "please wake me up and remove me from the wait list when you do". _and_ then add your own logic as well, if you want to. Yes, our wait queues are complicated. They are very powerful, though, and they have been designed to easily avoid races many different ways (and to also very naturally work with waiting for multiple very different events, where select and poll are the obvious cases, but even a single "read()" function _could_ decide to use multiple wait-queues if there's another wait-queue for exceptional circumstances, for example). Linus