On Fri, Nov 20 2020, Hillf Danton wrote: > On Fri, 20 Nov 2020 10:07:56 +1100 NeilBrown wrote: >>On Wed, Nov 18 2020, Hillf Danton wrote: >>> On Wed, 18 Nov 2020 16:11:44 +1100 NeilBrown wrote: >>>> On Wed, Nov 18 2020, Hillf Danton wrote: >>>> ... >>>> I don't think this is a good idea. >>> >>> Let me add a few more words. >>> >>>> cond_resched() is expected to be called often. Adding all this extra >>> >>> They are those only invoked in concurrency-managed worker contexts and >>> are thus supposed to be less often than thought; what is more the callers >>> know what they are doing if a schedule() follows up, needless to say it >>> is an ant-antenna-size add-in to check WORKER_CPU_INTENSIVE given >>> WARN_ON_ONCE(workqueue_mustnt_use_cpu()) >>> added in cond_resched(). >> >>"supposed to be less often" is the central point here. > > No, it is not in any shape, see below. > >>Because the facts are that they sometime happen with high frequency >>despite what is "supposed" to happen. > > Feel free to point me to a couple of such workers. I want to see > how high it is and why.
The patch should suggest some. Any work item which calls iput() might find it self in iput_final() and then truncate_inode_pages_range() which will call cond_resched() once for every 16 or fewer pages. If there are millions of pages .... When a reply is received for an async NFS request (e.g. WRITE, but several others), the processing happens in a workqueue (nfsiod), and this will often call iput(), but rarely will that lead to iput_final(). Also, lots of non-workqueue code calls iput(), so adding code to an inner-loop would cost everyone. Any worker which allocates memory might find itself in should_reclaim_retry() which calls cond_resched(). I don't know how frequently this will fire. The slab memory allocator uses a system_wq worker to reap a cache. I don't know exactly what that means but cache_reap() seems to need to call cond_resched() periodically. Maybe it should use be a WQ_CPU_INTENSIVE workqueue, but there isn't a system_cpu_wq.... Using system_unbound_wq() as it is doing per-CPU work. > >>Either the assumption that CM-workers don't call cond_resched() is >>wrong, or the code that schedules such workers on CM-queues is wrong. >> >>I much prefer the perspective that the assumption is wrong. If that is >>agreed then we need to handle that circumstance without making >>cond_resched() more expensive. > > This is the central point I think; it is a mile in between what > you are trying to fix and what you are adding in cond_resched(). My latest patch only adds a WARNING to cond_resched(), so that we can find problem code before it becomes a problem. I did previously try adding more to cond_resched(), and PeterZ didn't like that at all. I agree that fixing the problem cannot be in cond_resched(). I think that finding the scope of the problem is best done by instrumenting cond_resched() (when DEBUG_KERNEL is selected). > >>Note that adding WARN_ON_ONCE() does not make it more expensive as it is >>only enabled with KERNEL_DEBUG (and WQ_WATCHDOG, though the particular >>config option could be changed). It isn't needed in production. > > Because cond_resched() is not the right place from the beginning > for debugging like this, something in workqueue's backyard by > design. It's been there for a while, in production or not. I don't understand your reasoning. I don't see why one subsystem cannot provide debugging to help some other subsystem. Many subsystems add "might_sleep()", not to detect bugs in themselves but to detect bugs in their callers. Adding a WARNING to cond_resched() helps us find bugs in code that calls cond_resched()... >> >>If the workqueue maintainers are unmovable in the position that a > > They are open to any good thoughts, yesterday and tomorrow. > >>CM-workitem must not use excessive CPU ever, and so must not call >>cond_resched(), then I can take that back to the NFS maintainers and >>negotiate different workqueue settings. > > That sounds like an easy road to go without either touching > cond_resched() or adding a couple of lines in workqueue. But > the rising question is why you are branching to a new direction > overnight if you think your thoughts are fine to fix an issue > you observed in wq's property. I'm branching off because I'm getting push-back and so am trying to explore the problem space. My first idea was to add WQ_CPU_INTENSIVE to the nfsiod workqueue, but Trond wondered what was special about NFS. Many filesystems call iput from a workqueue, so finding a solution that helps them all is best. I then suggested getting cond_resched() to do something more useful when called by a worker. PeterZ didn't like the overhead. Also, TJ seemed to be against auto-adjusting for cpu-intensive code, preferring the right sort of workqueue to be chosen up front. I'm not really well placed to assess the validity of any of these objections, so I'm trying to respond to them without completely giving in to any of them. Hence the "new direction overnight". As a "user" of workqueues I would much much rather there was only one, and that it always did the right thing. Maybe I would have to put up with two, but we currently have system_wq, system_highpri_wq, system_long_wq, system_unbound_wq, system_freezable_wq, system_power_efficient_wq, system_freezable_power_efficient_wq plus the ability to create your own. It is an embarrassment of riches and I really wonder how many people know how to choose the right one. So I'm not very keen on "make sure you choose the right type of wq", but if that really is best, I certainly want automatic help to know when I've made a harmful choice. > >>But as I've said, I think this >>is requiring the decision to be made in a place that is not well >>positioned to make it. > > I say no to asking NFS to take a pill because WQ got a cold. That's good to know! I think that means that when a work item happens to consume a lot of CPU, it needs to stop blocking other work items and instead share the CPU with them. Thanks, NeilBrown
signature.asc
Description: PGP signature