Re: [PATCH][RT] xfs: Disable preemption when grabbing all icsb counter locks
On Thu, Apr 30, 2015 at 11:32 AM, Steven Rostedt wrote: > On Thu, 30 Apr 2015 20:07:21 +0200 > Peter Zijlstra wrote: >> The irony, this is distinctly non deterministic code you're putting >> under a RT specific preempt_disable ;-) > > I know :-( > > Unfortunately, a RT behaving fix would be much more invasive and would > probably require the help of the xfs folks. For now, this just prevents > a live lock that can happen and halt the system, where it becomes > deterministic catastrophe. > > -- Steve Would it work to instead create a lock to replace the preempt_enable_rt/preempt_disable_rt pair in XFS? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RT] xfs: Disable preemption when grabbing all icsb counter locks
On Thu, Apr 30, 2015 at 11:32 AM, Steven Rostedt rost...@goodmis.org wrote: On Thu, 30 Apr 2015 20:07:21 +0200 Peter Zijlstra pet...@infradead.org wrote: The irony, this is distinctly non deterministic code you're putting under a RT specific preempt_disable ;-) I know :-( Unfortunately, a RT behaving fix would be much more invasive and would probably require the help of the xfs folks. For now, this just prevents a live lock that can happen and halt the system, where it becomes deterministic catastrophe. -- Steve Would it work to instead create a lock to replace the preempt_enable_rt/preempt_disable_rt pair in XFS? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: fix RLIMIT_RTTIME when PI-boosting to RT
ping? On Wed, Feb 18, 2015 at 4:23 PM, wrote: > From: Brian Silverman > > When non-realtime tasks get priority-inheritance boosted to a realtime > scheduling class, RLIMIT_RTTIME starts to apply to them. However, the > counter used for checking this (the same one used for SCHED_RR > timeslices) was not getting reset. This meant that tasks running with a > non-realtime scheduling class which are repeatedly boosted to a realtime > one, but never block while they are running realtime, eventually hit the > timeout without ever running for a time over the limit. This patch > resets the realtime timeslice counter when un-PI-boosting from an RT to > a non-RT scheduling class. > > I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT > mutex which induces priority boosting and spins while boosted that gets > killed by a SIGXCPU on non-fixed kernels but doesn't with this patch > applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and > does happen eventually with PREEMPT_VOLUNTARY kernels. > > Signed-off-by: Brian Silverman > --- > I am not subscribed to the list so please CC me on any responses. > > kernel/sched/core.c |2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 87b9814..16ad0ed 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3192,6 +3192,8 @@ void rt_mutex_setprio(struct task_struct *p, int prio) > } else { > if (dl_prio(oldprio)) > p->dl.dl_boosted = 0; > + if (rt_prio(oldprio)) > + p->rt.timeout = 0; > p->sched_class = _sched_class; > } > > -- > 1.7.10.4 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: fix RLIMIT_RTTIME when PI-boosting to RT
ping? On Wed, Feb 18, 2015 at 4:23 PM, br...@peloton-tech.com wrote: From: Brian Silverman br...@peloton-tech.com When non-realtime tasks get priority-inheritance boosted to a realtime scheduling class, RLIMIT_RTTIME starts to apply to them. However, the counter used for checking this (the same one used for SCHED_RR timeslices) was not getting reset. This meant that tasks running with a non-realtime scheduling class which are repeatedly boosted to a realtime one, but never block while they are running realtime, eventually hit the timeout without ever running for a time over the limit. This patch resets the realtime timeslice counter when un-PI-boosting from an RT to a non-RT scheduling class. I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT mutex which induces priority boosting and spins while boosted that gets killed by a SIGXCPU on non-fixed kernels but doesn't with this patch applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and does happen eventually with PREEMPT_VOLUNTARY kernels. Signed-off-by: Brian Silverman br...@peloton-tech.com --- I am not subscribed to the list so please CC me on any responses. kernel/sched/core.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 87b9814..16ad0ed 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3192,6 +3192,8 @@ void rt_mutex_setprio(struct task_struct *p, int prio) } else { if (dl_prio(oldprio)) p-dl.dl_boosted = 0; + if (rt_prio(oldprio)) + p-rt.timeout = 0; p-sched_class = fair_sched_class; } -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Sat, Jul 5, 2014 at 1:26 PM, Thomas Gleixner wrote: > On Mon, 30 Jun 2014, Austin Schuh wrote: >> I think I might have an answer for my own question, but I would >> appreciate someone else to double check. If list_empty erroneously >> returns that there is work to do when there isn't work to do, we wake >> up an extra worker which then goes back to sleep. Not a big loss. If >> list_empty erroneously returns that there isn't work to do when there >> is, this should only be because someone is modifying the work list. >> When they finish, as far as I can tell, all callers then check to see >> if a worker needs to be started up, and start one. > > Precisely. A comment there when you put together a polished patch for inclusion would be awesome. I'm assuming that you will put the patch together? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Sat, Jul 5, 2014 at 1:26 PM, Thomas Gleixner t...@linutronix.de wrote: On Mon, 30 Jun 2014, Austin Schuh wrote: I think I might have an answer for my own question, but I would appreciate someone else to double check. If list_empty erroneously returns that there is work to do when there isn't work to do, we wake up an extra worker which then goes back to sleep. Not a big loss. If list_empty erroneously returns that there isn't work to do when there is, this should only be because someone is modifying the work list. When they finish, as far as I can tell, all callers then check to see if a worker needs to be started up, and start one. Precisely. A comment there when you put together a polished patch for inclusion would be awesome. I'm assuming that you will put the patch together? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Tue, Jul 1, 2014 at 12:32 PM, Austin Schuh wrote: > On Mon, Jun 30, 2014 at 8:01 PM, Austin Schuh wrote: >> On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner wrote: >>> Completely untested patch below. I've tested it and looked it over now, and feel pretty confident in the patch. Thanks Thomas! I've been running this fix with my extra lock steps for a couple days now on 3 machines, and haven't seen any issues. On each of the machines, I've got a CAN card generating about 1 CPU worth of load from interrupts, and the rest of the cores are tied up putting work on the work queues with the killer_module. What next? Austin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Tue, Jul 1, 2014 at 12:32 PM, Austin Schuh aus...@peloton-tech.com wrote: On Mon, Jun 30, 2014 at 8:01 PM, Austin Schuh aus...@peloton-tech.com wrote: On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner t...@linutronix.de wrote: Completely untested patch below. I've tested it and looked it over now, and feel pretty confident in the patch. Thanks Thomas! I've been running this fix with my extra lock steps for a couple days now on 3 machines, and haven't seen any issues. On each of the machines, I've got a CAN card generating about 1 CPU worth of load from interrupts, and the rest of the cores are tied up putting work on the work queues with the killer_module. What next? Austin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Mon, Jun 30, 2014 at 8:01 PM, Austin Schuh wrote: > On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner wrote: >> Completely untested patch below. > > By chance, I found this in my boot logs. I'll do some more startup > testing tomorrow. > > Jun 30 19:54:40 vpc5 kernel: [0.670955] [ cut here > ] > Jun 30 19:54:40 vpc5 kernel: [0.670962] WARNING: CPU: 0 PID: 4 at > kernel/workqueue.c:1604 worker_enter_idle+0x65/0x16b() > Jun 30 19:54:40 vpc5 kernel: [0.670970] Modules linked in: > Jun 30 19:54:40 vpc5 kernel: [0.670973] CPU: 0 PID: 4 Comm: > kworker/0:0 Not tainted 3.14.3-rt4abs+ #8 > Jun 30 19:54:40 vpc5 kernel: [0.670974] Hardware name: CompuLab > Intense-PC/Intense-PC, BIOS CR_2.2.0.377 X64 04/10/2013 > Jun 30 19:54:40 vpc5 kernel: [0.670983] 0009 > 88040ce75de8 81510faf 0002 > Jun 30 19:54:40 vpc5 kernel: [0.670985] > 88040ce75e28 81042085 0001 > Jun 30 19:54:40 vpc5 kernel: [0.670987] 81057a60 > 88042d406900 88042da63fc0 88042da64030 > Jun 30 19:54:40 vpc5 kernel: [0.670988] Call Trace: > Jun 30 19:54:40 vpc5 kernel: [0.670995] [] > dump_stack+0x4f/0x7c > Jun 30 19:54:40 vpc5 kernel: [0.670999] [] > warn_slowpath_common+0x81/0x9c > Jun 30 19:54:40 vpc5 kernel: [0.671002] [] ? > worker_enter_idle+0x65/0x16b > Jun 30 19:54:40 vpc5 kernel: [0.671005] [] > warn_slowpath_null+0x1a/0x1c > Jun 30 19:54:40 vpc5 kernel: [0.671007] [] > worker_enter_idle+0x65/0x16b > Jun 30 19:54:40 vpc5 kernel: [0.671010] [] > worker_thread+0x1b3/0x22b > Jun 30 19:54:40 vpc5 kernel: [0.671013] [] ? > rescuer_thread+0x293/0x293 > Jun 30 19:54:40 vpc5 kernel: [0.671015] [] ? > rescuer_thread+0x293/0x293 > Jun 30 19:54:40 vpc5 kernel: [0.671018] [] > kthread+0xdc/0xe4 > Jun 30 19:54:40 vpc5 kernel: [0.671022] [] ? > flush_kthread_worker+0xe1/0xe1 > Jun 30 19:54:40 vpc5 kernel: [0.671025] [] > ret_from_fork+0x7c/0xb0 > Jun 30 19:54:40 vpc5 kernel: [0.671027] [] ? > flush_kthread_worker+0xe1/0xe1 > Jun 30 19:54:40 vpc5 kernel: [0.671029] ---[ end trace 0001 > ]--- Bug in my extra locking... Sorry for the noise. The second diff is a cleaner way of destroying the workers. diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 8900da8..590cc26 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1567,10 +1602,16 @@ static void worker_enter_idle(struct worker *worker) { struct worker_pool *pool = worker->pool; - if (WARN_ON_ONCE(worker->flags & WORKER_IDLE) || - WARN_ON_ONCE(!list_empty(>entry) && -(worker->hentry.next || worker->hentry.pprev))) + if (WARN_ON_ONCE(worker->flags & WORKER_IDLE)) return; + + rt_lock_idle_list(pool); + if (WARN_ON_ONCE(!list_empty(>entry) && +(worker->hentry.next || worker->hentry.pprev))) { + rt_unlock_idle_list(pool); return; + } else { + rt_unlock_idle_list(pool); + } /* can't use worker_set_flags(), also called from start_worker() */ worker->flags |= WORKER_IDLE; @@ -3584,8 +3637,14 @@ static void put_unbound_pool(struct worker_pool *pool) mutex_lock(>manager_mutex); spin_lock_irq(>lock); - while ((worker = first_worker(pool))) + rt_lock_idle_list(pool); + while ((worker = first_worker(pool))) { + rt_unlock_idle_list(pool); destroy_worker(worker); + rt_lock_idle_list(pool); + } + rt_unlock_idle_list(pool); + WARN_ON(pool->nr_workers || pool->nr_idle); spin_unlock_irq(>lock); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Mon, Jun 30, 2014 at 8:01 PM, Austin Schuh aus...@peloton-tech.com wrote: On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner t...@linutronix.de wrote: Completely untested patch below. By chance, I found this in my boot logs. I'll do some more startup testing tomorrow. Jun 30 19:54:40 vpc5 kernel: [0.670955] [ cut here ] Jun 30 19:54:40 vpc5 kernel: [0.670962] WARNING: CPU: 0 PID: 4 at kernel/workqueue.c:1604 worker_enter_idle+0x65/0x16b() Jun 30 19:54:40 vpc5 kernel: [0.670970] Modules linked in: Jun 30 19:54:40 vpc5 kernel: [0.670973] CPU: 0 PID: 4 Comm: kworker/0:0 Not tainted 3.14.3-rt4abs+ #8 Jun 30 19:54:40 vpc5 kernel: [0.670974] Hardware name: CompuLab Intense-PC/Intense-PC, BIOS CR_2.2.0.377 X64 04/10/2013 Jun 30 19:54:40 vpc5 kernel: [0.670983] 0009 88040ce75de8 81510faf 0002 Jun 30 19:54:40 vpc5 kernel: [0.670985] 88040ce75e28 81042085 0001 Jun 30 19:54:40 vpc5 kernel: [0.670987] 81057a60 88042d406900 88042da63fc0 88042da64030 Jun 30 19:54:40 vpc5 kernel: [0.670988] Call Trace: Jun 30 19:54:40 vpc5 kernel: [0.670995] [81510faf] dump_stack+0x4f/0x7c Jun 30 19:54:40 vpc5 kernel: [0.670999] [81042085] warn_slowpath_common+0x81/0x9c Jun 30 19:54:40 vpc5 kernel: [0.671002] [81057a60] ? worker_enter_idle+0x65/0x16b Jun 30 19:54:40 vpc5 kernel: [0.671005] [810420ba] warn_slowpath_null+0x1a/0x1c Jun 30 19:54:40 vpc5 kernel: [0.671007] [81057a60] worker_enter_idle+0x65/0x16b Jun 30 19:54:40 vpc5 kernel: [0.671010] [8105a0a9] worker_thread+0x1b3/0x22b Jun 30 19:54:40 vpc5 kernel: [0.671013] [81059ef6] ? rescuer_thread+0x293/0x293 Jun 30 19:54:40 vpc5 kernel: [0.671015] [81059ef6] ? rescuer_thread+0x293/0x293 Jun 30 19:54:40 vpc5 kernel: [0.671018] [8105f7ab] kthread+0xdc/0xe4 Jun 30 19:54:40 vpc5 kernel: [0.671022] [8105f6cf] ? flush_kthread_worker+0xe1/0xe1 Jun 30 19:54:40 vpc5 kernel: [0.671025] [8151586c] ret_from_fork+0x7c/0xb0 Jun 30 19:54:40 vpc5 kernel: [0.671027] [8105f6cf] ? flush_kthread_worker+0xe1/0xe1 Jun 30 19:54:40 vpc5 kernel: [0.671029] ---[ end trace 0001 ]--- Bug in my extra locking... Sorry for the noise. The second diff is a cleaner way of destroying the workers. diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 8900da8..590cc26 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1567,10 +1602,16 @@ static void worker_enter_idle(struct worker *worker) { struct worker_pool *pool = worker-pool; - if (WARN_ON_ONCE(worker-flags WORKER_IDLE) || - WARN_ON_ONCE(!list_empty(worker-entry) -(worker-hentry.next || worker-hentry.pprev))) + if (WARN_ON_ONCE(worker-flags WORKER_IDLE)) return; + + rt_lock_idle_list(pool); + if (WARN_ON_ONCE(!list_empty(worker-entry) +(worker-hentry.next || worker-hentry.pprev))) { + rt_unlock_idle_list(pool); return; + } else { + rt_unlock_idle_list(pool); + } /* can't use worker_set_flags(), also called from start_worker() */ worker-flags |= WORKER_IDLE; @@ -3584,8 +3637,14 @@ static void put_unbound_pool(struct worker_pool *pool) mutex_lock(pool-manager_mutex); spin_lock_irq(pool-lock); - while ((worker = first_worker(pool))) + rt_lock_idle_list(pool); + while ((worker = first_worker(pool))) { + rt_unlock_idle_list(pool); destroy_worker(worker); + rt_lock_idle_list(pool); + } + rt_unlock_idle_list(pool); + WARN_ON(pool-nr_workers || pool-nr_idle); spin_unlock_irq(pool-lock); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner wrote: > Completely untested patch below. By chance, I found this in my boot logs. I'll do some more startup testing tomorrow. Jun 30 19:54:40 vpc5 kernel: [0.670955] [ cut here ] Jun 30 19:54:40 vpc5 kernel: [0.670962] WARNING: CPU: 0 PID: 4 at kernel/workqueue.c:1604 worker_enter_idle+0x65/0x16b() Jun 30 19:54:40 vpc5 kernel: [0.670970] Modules linked in: Jun 30 19:54:40 vpc5 kernel: [0.670973] CPU: 0 PID: 4 Comm: kworker/0:0 Not tainted 3.14.3-rt4abs+ #8 Jun 30 19:54:40 vpc5 kernel: [0.670974] Hardware name: CompuLab Intense-PC/Intense-PC, BIOS CR_2.2.0.377 X64 04/10/2013 Jun 30 19:54:40 vpc5 kernel: [0.670983] 0009 88040ce75de8 81510faf 0002 Jun 30 19:54:40 vpc5 kernel: [0.670985] 88040ce75e28 81042085 0001 Jun 30 19:54:40 vpc5 kernel: [0.670987] 81057a60 88042d406900 88042da63fc0 88042da64030 Jun 30 19:54:40 vpc5 kernel: [0.670988] Call Trace: Jun 30 19:54:40 vpc5 kernel: [0.670995] [] dump_stack+0x4f/0x7c Jun 30 19:54:40 vpc5 kernel: [0.670999] [] warn_slowpath_common+0x81/0x9c Jun 30 19:54:40 vpc5 kernel: [0.671002] [] ? worker_enter_idle+0x65/0x16b Jun 30 19:54:40 vpc5 kernel: [0.671005] [] warn_slowpath_null+0x1a/0x1c Jun 30 19:54:40 vpc5 kernel: [0.671007] [] worker_enter_idle+0x65/0x16b Jun 30 19:54:40 vpc5 kernel: [0.671010] [] worker_thread+0x1b3/0x22b Jun 30 19:54:40 vpc5 kernel: [0.671013] [] ? rescuer_thread+0x293/0x293 Jun 30 19:54:40 vpc5 kernel: [0.671015] [] ? rescuer_thread+0x293/0x293 Jun 30 19:54:40 vpc5 kernel: [0.671018] [] kthread+0xdc/0xe4 Jun 30 19:54:40 vpc5 kernel: [0.671022] [] ? flush_kthread_worker+0xe1/0xe1 Jun 30 19:54:40 vpc5 kernel: [0.671025] [] ret_from_fork+0x7c/0xb0 Jun 30 19:54:40 vpc5 kernel: [0.671027] [] ? flush_kthread_worker+0xe1/0xe1 Jun 30 19:54:40 vpc5 kernel: [0.671029] ---[ end trace 0001 ]--- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Mon, Jun 30, 2014 at 5:12 PM, Austin Schuh wrote: > On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner wrote: >> On Thu, 26 Jun 2014, Austin Schuh wrote: >>> If I'm reading the rt patch correctly, wq_worker_sleeping was moved >>> out of __schedule to sched_submit_work. It looks like that changes >>> the conditions under which wq_worker_sleeping is called. It used to >>> be called whenever a task was going to sleep (I think). It looks like >>> it is called now if the task is going to sleep, and if the task isn't >>> blocked on a PI mutex (I think). >>> >>> If I try the following experiment >>> >>> static inline void sched_submit_work(struct task_struct *tsk) >>> { >>> + if (tsk->state && tsk->flags & PF_WQ_WORKER) { >>> + wq_worker_sleeping(tsk); >>> + return; >>> + } >>> >>> and then remove the call later in the function, I am able to pass my test. >>> >>> Unfortunately, I then get a recursive pool spinlock BUG_ON after a >>> while (as I would expect), and it all blows up. >> >> Well, that's why the check for !pi_blocked_on is there. >> >>> I'm not sure where to go from there. Any changes to the workpool to >>> try to fix that will be hard, or could affect latency significantly. >> >> Completely untested patch below. >> >> Thanks, >> >> tglx >> >> -> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index 8749d20..621329a 100644 >> diff --git a/kernel/workqueue.c b/kernel/workqueue.c >> index be0ef50..0ca9d97 100644 >> --- a/kernel/workqueue.c >> +++ b/kernel/workqueue.c >> @@ -849,25 +882,18 @@ void wq_worker_sleeping(struct task_struct *task) >> return; >> >> worker->sleeping = 1; >> - spin_lock_irq(>lock); >> + >> /* >> * The counterpart of the following dec_and_test, implied mb, >> * worklist not empty test sequence is in insert_work(). >> * Please read comment there. >> -* >> -* NOT_RUNNING is clear. This means that we're bound to and >> -* running on the local cpu w/ rq lock held and preemption >> -* disabled, which in turn means that none else could be >> -* manipulating idle_list, so dereferencing idle_list without pool >> -* lock is safe. >> */ >> if (atomic_dec_and_test(>nr_running) && >> !list_empty(>worklist)) { > > What guarantees that this pool->worklist access is safe? Preemption > isn't disabled. I think I might have an answer for my own question, but I would appreciate someone else to double check. If list_empty erroneously returns that there is work to do when there isn't work to do, we wake up an extra worker which then goes back to sleep. Not a big loss. If list_empty erroneously returns that there isn't work to do when there is, this should only be because someone is modifying the work list. When they finish, as far as I can tell, all callers then check to see if a worker needs to be started up, and start one. Austin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner wrote: > On Thu, 26 Jun 2014, Austin Schuh wrote: >> If I'm reading the rt patch correctly, wq_worker_sleeping was moved >> out of __schedule to sched_submit_work. It looks like that changes >> the conditions under which wq_worker_sleeping is called. It used to >> be called whenever a task was going to sleep (I think). It looks like >> it is called now if the task is going to sleep, and if the task isn't >> blocked on a PI mutex (I think). >> >> If I try the following experiment >> >> static inline void sched_submit_work(struct task_struct *tsk) >> { >> + if (tsk->state && tsk->flags & PF_WQ_WORKER) { >> + wq_worker_sleeping(tsk); >> + return; >> + } >> >> and then remove the call later in the function, I am able to pass my test. >> >> Unfortunately, I then get a recursive pool spinlock BUG_ON after a >> while (as I would expect), and it all blows up. > > Well, that's why the check for !pi_blocked_on is there. > >> I'm not sure where to go from there. Any changes to the workpool to >> try to fix that will be hard, or could affect latency significantly. > > Completely untested patch below. > > Thanks, > > tglx > > -> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 8749d20..621329a 100644 > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index be0ef50..0ca9d97 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -849,25 +882,18 @@ void wq_worker_sleeping(struct task_struct *task) > return; > > worker->sleeping = 1; > - spin_lock_irq(>lock); > + > /* > * The counterpart of the following dec_and_test, implied mb, > * worklist not empty test sequence is in insert_work(). > * Please read comment there. > -* > -* NOT_RUNNING is clear. This means that we're bound to and > -* running on the local cpu w/ rq lock held and preemption > -* disabled, which in turn means that none else could be > -* manipulating idle_list, so dereferencing idle_list without pool > -* lock is safe. > */ > if (atomic_dec_and_test(>nr_running) && > !list_empty(>worklist)) { What guarantees that this pool->worklist access is safe? Preemption isn't disabled. I'm seeing some inconsistency when accessing the idle list. You seem to only disable preemption when writing to the idle list? I'm unsure if I'm missing something, or what. With that question in mind, I took a stab at adding locking around all the idle_list calls. I went through and double checked to make sure that rt_lock_idle_list and rt_unlock_idle_list surround all idle_list or entry accesses. The following are places where I think you should be locking, but aren't. diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 8900da8..314a098 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -778,8 +805,12 @@ static bool too_many_workers(struct worker_pool *pool) * nr_idle and idle_list may disagree if idle rebinding is in * progress. Never return %true if idle_list is empty. */ - if (list_empty(>idle_list)) + rt_lock_idle_list(pool); + if (list_empty(>idle_list)) { + rt_unlock_idle_list(pool); return false; + } + rt_unlock_idle_list(pool); return nr_idle > 2 && (nr_idle - 2) * MAX_IDLE_WORKERS_RATIO >= nr_busy; } @@ -788,7 +819,7 @@ static bool too_many_workers(struct worker_pool *pool) * Wake up functions. */ -/* Return the first worker. Safe with preemption disabled */ +/* Return the first worker. Preemption must be disabled */ static struct worker *first_worker(struct worker_pool *pool) { if (unlikely(list_empty(>idle_list))) @@ -1567,10 +1598,16 @@ static void worker_enter_idle(struct worker *worker) { struct worker_pool *pool = worker->pool; - if (WARN_ON_ONCE(worker->flags & WORKER_IDLE) || -WARN_ON_ONCE(!list_empty(>entry) && - (worker->hentry.next || worker->hentry.pprev))) - return; + if (WARN_ON_ONCE(worker->flags & WORKER_IDLE)) return; + + rt_lock_idle_list(pool); + if (WARN_ON_ONCE(!list_empty(>entry))) { + rt_unlock_idle_list(pool); + if (worker->hentry.next || worker->hentry.pprev) + return; + } else { + rt_unlock_idle_list(pool); + } /* can't use worker_set_flags(), also called from start_worker() */ worker->flags |= WORKER_IDLE; @@ -1882,7 +1925,9 @@ static void idle_worker_timeout(unsigned long __pool) unsigned long expires; /* idle_list is kept in LIFO order, check the last one */ + rt_lock_idle_list(pool); worker = list_entry(pool->idle_list.prev, struct wor
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner t...@linutronix.de wrote: On Thu, 26 Jun 2014, Austin Schuh wrote: If I'm reading the rt patch correctly, wq_worker_sleeping was moved out of __schedule to sched_submit_work. It looks like that changes the conditions under which wq_worker_sleeping is called. It used to be called whenever a task was going to sleep (I think). It looks like it is called now if the task is going to sleep, and if the task isn't blocked on a PI mutex (I think). If I try the following experiment static inline void sched_submit_work(struct task_struct *tsk) { + if (tsk-state tsk-flags PF_WQ_WORKER) { + wq_worker_sleeping(tsk); + return; + } and then remove the call later in the function, I am able to pass my test. Unfortunately, I then get a recursive pool spinlock BUG_ON after a while (as I would expect), and it all blows up. Well, that's why the check for !pi_blocked_on is there. I'm not sure where to go from there. Any changes to the workpool to try to fix that will be hard, or could affect latency significantly. Completely untested patch below. Thanks, tglx - diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8749d20..621329a 100644 diff --git a/kernel/workqueue.c b/kernel/workqueue.c index be0ef50..0ca9d97 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -849,25 +882,18 @@ void wq_worker_sleeping(struct task_struct *task) return; worker-sleeping = 1; - spin_lock_irq(pool-lock); + /* * The counterpart of the following dec_and_test, implied mb, * worklist not empty test sequence is in insert_work(). * Please read comment there. -* -* NOT_RUNNING is clear. This means that we're bound to and -* running on the local cpu w/ rq lock held and preemption -* disabled, which in turn means that none else could be -* manipulating idle_list, so dereferencing idle_list without pool -* lock is safe. */ if (atomic_dec_and_test(pool-nr_running) !list_empty(pool-worklist)) { What guarantees that this pool-worklist access is safe? Preemption isn't disabled. I'm seeing some inconsistency when accessing the idle list. You seem to only disable preemption when writing to the idle list? I'm unsure if I'm missing something, or what. With that question in mind, I took a stab at adding locking around all the idle_list calls. I went through and double checked to make sure that rt_lock_idle_list and rt_unlock_idle_list surround all idle_list or entry accesses. The following are places where I think you should be locking, but aren't. diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 8900da8..314a098 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -778,8 +805,12 @@ static bool too_many_workers(struct worker_pool *pool) * nr_idle and idle_list may disagree if idle rebinding is in * progress. Never return %true if idle_list is empty. */ - if (list_empty(pool-idle_list)) + rt_lock_idle_list(pool); + if (list_empty(pool-idle_list)) { + rt_unlock_idle_list(pool); return false; + } + rt_unlock_idle_list(pool); return nr_idle 2 (nr_idle - 2) * MAX_IDLE_WORKERS_RATIO = nr_busy; } @@ -788,7 +819,7 @@ static bool too_many_workers(struct worker_pool *pool) * Wake up functions. */ -/* Return the first worker. Safe with preemption disabled */ +/* Return the first worker. Preemption must be disabled */ static struct worker *first_worker(struct worker_pool *pool) { if (unlikely(list_empty(pool-idle_list))) @@ -1567,10 +1598,16 @@ static void worker_enter_idle(struct worker *worker) { struct worker_pool *pool = worker-pool; - if (WARN_ON_ONCE(worker-flags WORKER_IDLE) || -WARN_ON_ONCE(!list_empty(worker-entry) - (worker-hentry.next || worker-hentry.pprev))) - return; + if (WARN_ON_ONCE(worker-flags WORKER_IDLE)) return; + + rt_lock_idle_list(pool); + if (WARN_ON_ONCE(!list_empty(worker-entry))) { + rt_unlock_idle_list(pool); + if (worker-hentry.next || worker-hentry.pprev) + return; + } else { + rt_unlock_idle_list(pool); + } /* can't use worker_set_flags(), also called from start_worker() */ worker-flags |= WORKER_IDLE; @@ -1882,7 +1925,9 @@ static void idle_worker_timeout(unsigned long __pool) unsigned long expires; /* idle_list is kept in LIFO order, check the last one */ + rt_lock_idle_list(pool); worker = list_entry(pool-idle_list.prev, struct worker, entry); + rt_unlock_idle_list(pool); expires = worker-last_active + IDLE_WORKER_TIMEOUT; if (time_before(jiffies, expires)) @@ -2026,7 +2071,9 @@ static bool maybe_destroy_workers(struct worker_pool *pool) struct worker *worker; unsigned long expires; + rt_lock_idle_list(pool); worker = list_entry(pool-idle_list.prev, struct worker, entry); + rt_unlock_idle_list(pool); expires = worker-last_active
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Mon, Jun 30, 2014 at 5:12 PM, Austin Schuh aus...@peloton-tech.com wrote: On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner t...@linutronix.de wrote: On Thu, 26 Jun 2014, Austin Schuh wrote: If I'm reading the rt patch correctly, wq_worker_sleeping was moved out of __schedule to sched_submit_work. It looks like that changes the conditions under which wq_worker_sleeping is called. It used to be called whenever a task was going to sleep (I think). It looks like it is called now if the task is going to sleep, and if the task isn't blocked on a PI mutex (I think). If I try the following experiment static inline void sched_submit_work(struct task_struct *tsk) { + if (tsk-state tsk-flags PF_WQ_WORKER) { + wq_worker_sleeping(tsk); + return; + } and then remove the call later in the function, I am able to pass my test. Unfortunately, I then get a recursive pool spinlock BUG_ON after a while (as I would expect), and it all blows up. Well, that's why the check for !pi_blocked_on is there. I'm not sure where to go from there. Any changes to the workpool to try to fix that will be hard, or could affect latency significantly. Completely untested patch below. Thanks, tglx - diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8749d20..621329a 100644 diff --git a/kernel/workqueue.c b/kernel/workqueue.c index be0ef50..0ca9d97 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -849,25 +882,18 @@ void wq_worker_sleeping(struct task_struct *task) return; worker-sleeping = 1; - spin_lock_irq(pool-lock); + /* * The counterpart of the following dec_and_test, implied mb, * worklist not empty test sequence is in insert_work(). * Please read comment there. -* -* NOT_RUNNING is clear. This means that we're bound to and -* running on the local cpu w/ rq lock held and preemption -* disabled, which in turn means that none else could be -* manipulating idle_list, so dereferencing idle_list without pool -* lock is safe. */ if (atomic_dec_and_test(pool-nr_running) !list_empty(pool-worklist)) { What guarantees that this pool-worklist access is safe? Preemption isn't disabled. I think I might have an answer for my own question, but I would appreciate someone else to double check. If list_empty erroneously returns that there is work to do when there isn't work to do, we wake up an extra worker which then goes back to sleep. Not a big loss. If list_empty erroneously returns that there isn't work to do when there is, this should only be because someone is modifying the work list. When they finish, as far as I can tell, all callers then check to see if a worker needs to be started up, and start one. Austin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner t...@linutronix.de wrote: Completely untested patch below. By chance, I found this in my boot logs. I'll do some more startup testing tomorrow. Jun 30 19:54:40 vpc5 kernel: [0.670955] [ cut here ] Jun 30 19:54:40 vpc5 kernel: [0.670962] WARNING: CPU: 0 PID: 4 at kernel/workqueue.c:1604 worker_enter_idle+0x65/0x16b() Jun 30 19:54:40 vpc5 kernel: [0.670970] Modules linked in: Jun 30 19:54:40 vpc5 kernel: [0.670973] CPU: 0 PID: 4 Comm: kworker/0:0 Not tainted 3.14.3-rt4abs+ #8 Jun 30 19:54:40 vpc5 kernel: [0.670974] Hardware name: CompuLab Intense-PC/Intense-PC, BIOS CR_2.2.0.377 X64 04/10/2013 Jun 30 19:54:40 vpc5 kernel: [0.670983] 0009 88040ce75de8 81510faf 0002 Jun 30 19:54:40 vpc5 kernel: [0.670985] 88040ce75e28 81042085 0001 Jun 30 19:54:40 vpc5 kernel: [0.670987] 81057a60 88042d406900 88042da63fc0 88042da64030 Jun 30 19:54:40 vpc5 kernel: [0.670988] Call Trace: Jun 30 19:54:40 vpc5 kernel: [0.670995] [81510faf] dump_stack+0x4f/0x7c Jun 30 19:54:40 vpc5 kernel: [0.670999] [81042085] warn_slowpath_common+0x81/0x9c Jun 30 19:54:40 vpc5 kernel: [0.671002] [81057a60] ? worker_enter_idle+0x65/0x16b Jun 30 19:54:40 vpc5 kernel: [0.671005] [810420ba] warn_slowpath_null+0x1a/0x1c Jun 30 19:54:40 vpc5 kernel: [0.671007] [81057a60] worker_enter_idle+0x65/0x16b Jun 30 19:54:40 vpc5 kernel: [0.671010] [8105a0a9] worker_thread+0x1b3/0x22b Jun 30 19:54:40 vpc5 kernel: [0.671013] [81059ef6] ? rescuer_thread+0x293/0x293 Jun 30 19:54:40 vpc5 kernel: [0.671015] [81059ef6] ? rescuer_thread+0x293/0x293 Jun 30 19:54:40 vpc5 kernel: [0.671018] [8105f7ab] kthread+0xdc/0xe4 Jun 30 19:54:40 vpc5 kernel: [0.671022] [8105f6cf] ? flush_kthread_worker+0xe1/0xe1 Jun 30 19:54:40 vpc5 kernel: [0.671025] [8151586c] ret_from_fork+0x7c/0xb0 Jun 30 19:54:40 vpc5 kernel: [0.671027] [8105f6cf] ? flush_kthread_worker+0xe1/0xe1 Jun 30 19:54:40 vpc5 kernel: [0.671029] ---[ end trace 0001 ]--- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Fri, Jun 27, 2014 at 8:32 PM, Mike Galbraith wrote: > On Fri, 2014-06-27 at 18:18 -0700, Austin Schuh wrote: > >> It would be more context switches, but I wonder if we could kick the >> workqueue logic completely out of the scheduler into a thread. Have >> the scheduler increment/decrement an atomic pool counter, and wake up >> the monitoring thread to spawn new threads when needed? That would >> get rid of the recursive pool lock problem, and should reduce >> scheduler latency if we would need to spawn a new thread. > > I was wondering the same thing, and not only for workqueue, but also the > plug pulling. It's kind of a wart to have that stuff sitting in the > hear of the scheduler in the first place, would be nice if it just went > away. When a task can't help itself, you _could_ wake a proxy do that > for you. Trouble is, I can imagine that being a heck of a lot of > context switches with some loads.. and who's gonna help the helper when > he blocks while trying to help? > > -Mike For workqueues, as long as the helper doesn't block on a lock which requires the work queue to be freed up, it will eventually become unblocked and make progress. The helper _should_ only need the pool lock, which will wake the helper back up when it is available again. Nothing should go to sleep in an un-recoverable way with the work pool lock held. To drop the extra context switch, you could have a minimum of 2 worker threads around at all times, and have the management thread start the work and delegate to the next management thread. That thread would then wait for the first thread to block, spawn a new thread, and then start the next piece of work. Definitely a bit more complicated. Austin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Fri, Jun 27, 2014 at 8:32 PM, Mike Galbraith umgwanakikb...@gmail.com wrote: On Fri, 2014-06-27 at 18:18 -0700, Austin Schuh wrote: It would be more context switches, but I wonder if we could kick the workqueue logic completely out of the scheduler into a thread. Have the scheduler increment/decrement an atomic pool counter, and wake up the monitoring thread to spawn new threads when needed? That would get rid of the recursive pool lock problem, and should reduce scheduler latency if we would need to spawn a new thread. I was wondering the same thing, and not only for workqueue, but also the plug pulling. It's kind of a wart to have that stuff sitting in the hear of the scheduler in the first place, would be nice if it just went away. When a task can't help itself, you _could_ wake a proxy do that for you. Trouble is, I can imagine that being a heck of a lot of context switches with some loads.. and who's gonna help the helper when he blocks while trying to help? -Mike For workqueues, as long as the helper doesn't block on a lock which requires the work queue to be freed up, it will eventually become unblocked and make progress. The helper _should_ only need the pool lock, which will wake the helper back up when it is available again. Nothing should go to sleep in an un-recoverable way with the work pool lock held. To drop the extra context switch, you could have a minimum of 2 worker threads around at all times, and have the management thread start the work and delegate to the next management thread. That thread would then wait for the first thread to block, spawn a new thread, and then start the next piece of work. Definitely a bit more complicated. Austin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Fri, Jun 27, 2014 at 11:19 AM, Steven Rostedt wrote: > On Fri, 27 Jun 2014 20:07:54 +0200 > Mike Galbraith wrote: > >> > Why do we need the wakeup? the owner of the lock should wake it up >> > shouldn't it? >> >> True, but that can take ages. > > Can it? If the workqueue is of some higher priority, it should boost > the process that owns the lock. Otherwise it just waits like anything > else does. > > I much rather keep the paradigm of the mainline kernel than to add a > bunch of hacks that can cause more unforeseen side effects that may > cause other issues. > > Remember, this would only be for spinlocks converted into a rtmutex, > not for normal mutex or other sleeps. In mainline, the wake up still > would not happen so why are we waking it up here? > > This seems similar to the BKL crap we had to deal with as well. If we > were going to sleep because we were blocked on a spinlock converted > rtmutex we could not release and retake the BKL because we would end up > blocked on two locks. Instead, we made sure that the spinlock would not > release or take the BKL. It kept with the paradigm of mainline and > worked. Sucked, but it worked. > > -- Steve Sounds like you are arguing that we should disable preemption (or whatever the right mechanism is) while holding the pool lock? Workqueues spin up more threads when work that they are executing blocks. This is done through hooks in the scheduler. This means that we have to acquire the pool lock when work blocks on a lock in order to see if there is more work and whether or not we need to spin up a new thread. It would be more context switches, but I wonder if we could kick the workqueue logic completely out of the scheduler into a thread. Have the scheduler increment/decrement an atomic pool counter, and wake up the monitoring thread to spawn new threads when needed? That would get rid of the recursive pool lock problem, and should reduce scheduler latency if we would need to spawn a new thread. Austin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Fri, Jun 27, 2014 at 11:19 AM, Steven Rostedt rost...@goodmis.org wrote: On Fri, 27 Jun 2014 20:07:54 +0200 Mike Galbraith umgwanakikb...@gmail.com wrote: Why do we need the wakeup? the owner of the lock should wake it up shouldn't it? True, but that can take ages. Can it? If the workqueue is of some higher priority, it should boost the process that owns the lock. Otherwise it just waits like anything else does. I much rather keep the paradigm of the mainline kernel than to add a bunch of hacks that can cause more unforeseen side effects that may cause other issues. Remember, this would only be for spinlocks converted into a rtmutex, not for normal mutex or other sleeps. In mainline, the wake up still would not happen so why are we waking it up here? This seems similar to the BKL crap we had to deal with as well. If we were going to sleep because we were blocked on a spinlock converted rtmutex we could not release and retake the BKL because we would end up blocked on two locks. Instead, we made sure that the spinlock would not release or take the BKL. It kept with the paradigm of mainline and worked. Sucked, but it worked. -- Steve Sounds like you are arguing that we should disable preemption (or whatever the right mechanism is) while holding the pool lock? Workqueues spin up more threads when work that they are executing blocks. This is done through hooks in the scheduler. This means that we have to acquire the pool lock when work blocks on a lock in order to see if there is more work and whether or not we need to spin up a new thread. It would be more context switches, but I wonder if we could kick the workqueue logic completely out of the scheduler into a thread. Have the scheduler increment/decrement an atomic pool counter, and wake up the monitoring thread to spawn new threads when needed? That would get rid of the recursive pool lock problem, and should reduce scheduler latency if we would need to spawn a new thread. Austin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Thu, Jun 26, 2014 at 3:35 PM, Thomas Gleixner wrote: > On Thu, 26 Jun 2014, Austin Schuh wrote: >> On Wed, May 21, 2014 at 12:33 AM, Richard Weinberger >> wrote: >> > CC'ing RT folks >> > >> > On Wed, May 21, 2014 at 8:23 AM, Austin Schuh >> > wrote: >> >> On Tue, May 13, 2014 at 7:29 PM, Austin Schuh >> >> wrote: >> >>> Hi, >> >>> >> >>> I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT >> >>> patched kernel. I have currently only triggered it using dpkg. Dave >> >>> Chinner on the XFS mailing list suggested that it was a rt-kernel >> >>> workqueue issue as opposed to a XFS problem after looking at the >> >>> kernel messages. >> >> I've got a 100% reproducible test case that doesn't involve a >> filesystem. I wrote a module that triggers the bug when the device is >> written to, making it easy to enable tracing during the event and >> capture everything. >> >> It looks like rw_semaphores don't trigger wq_worker_sleeping to run >> when work goes to sleep on a rw_semaphore. This only happens with the >> RT patches, not with the mainline kernel. I'm foreseeing a second >> deadlock/bug coming into play shortly. If a task holding the work >> pool spinlock gets preempted, and we need to schedule more work from >> another worker thread which was just blocked by a mutex, we'll then >> end up trying to go to sleep on 2 locks at once. > > I remember vaguely, that I've seen and analyzed that quite some time > ago. I can't page in all the gory details right now, but I have a look > how the related code changed in the last couple of years tomorrow > morning with an awake brain. > > Steven, you did some analysis on that IIRC, or was that just related > to rw_locks? > > Thanks, > > tglx If I'm reading the rt patch correctly, wq_worker_sleeping was moved out of __schedule to sched_submit_work. It looks like that changes the conditions under which wq_worker_sleeping is called. It used to be called whenever a task was going to sleep (I think). It looks like it is called now if the task is going to sleep, and if the task isn't blocked on a PI mutex (I think). If I try the following experiment static inline void sched_submit_work(struct task_struct *tsk) { + if (tsk->state && tsk->flags & PF_WQ_WORKER) { + wq_worker_sleeping(tsk); + return; + } and then remove the call later in the function, I am able to pass my test. Unfortunately, I then get a recursive pool spinlock BUG_ON after a while (as I would expect), and it all blows up. I'm not sure where to go from there. Any changes to the workpool to try to fix that will be hard, or could affect latency significantly. Austin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Wed, May 21, 2014 at 12:33 AM, Richard Weinberger wrote: > CC'ing RT folks > > On Wed, May 21, 2014 at 8:23 AM, Austin Schuh wrote: >> On Tue, May 13, 2014 at 7:29 PM, Austin Schuh >> wrote: >>> Hi, >>> >>> I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT >>> patched kernel. I have currently only triggered it using dpkg. Dave >>> Chinner on the XFS mailing list suggested that it was a rt-kernel >>> workqueue issue as opposed to a XFS problem after looking at the >>> kernel messages. I've got a 100% reproducible test case that doesn't involve a filesystem. I wrote a module that triggers the bug when the device is written to, making it easy to enable tracing during the event and capture everything. It looks like rw_semaphores don't trigger wq_worker_sleeping to run when work goes to sleep on a rw_semaphore. This only happens with the RT patches, not with the mainline kernel. I'm foreseeing a second deadlock/bug coming into play shortly. If a task holding the work pool spinlock gets preempted, and we need to schedule more work from another worker thread which was just blocked by a mutex, we'll then end up trying to go to sleep on 2 locks at once. That is getting a bit deep into the scheduler for me... Any suggestions on how to fix it? Austin #include #include #include #include #include #include static int device_open(struct inode *, struct file *); static int device_release(struct inode *, struct file *); static ssize_t device_read(struct file *, char *, size_t, loff_t *); static ssize_t device_write(struct file *, const char *, size_t, loff_t *); // Dev name as it appears in /proc/devices #define DEVICE_NAME "aschuh" // Major number assigned to our device driver static int major; static struct workqueue_struct *lockup_wq1; static struct workqueue_struct *lockup_wq2; static struct file_operations fops = { .read = device_read, .write = device_write, .open = device_open, .release = device_release }; static int __init init_killer_module(void) { lockup_wq1 = alloc_workqueue("lockup_wq1", WQ_MEM_RECLAIM, 0); if (!lockup_wq1) return -ENOMEM; lockup_wq2 = alloc_workqueue("lockup_wq2", WQ_MEM_RECLAIM, 0); if (!lockup_wq2) { destroy_workqueue(lockup_wq1); return -ENOMEM; } major = register_chrdev(0, DEVICE_NAME, ); if (major < 0) { printk(KERN_ALERT "Registering char device failed with %d\n", major); destroy_workqueue(lockup_wq1); destroy_workqueue(lockup_wq2); return major; } printk(KERN_INFO "'mknod /dev/%s c %d 0'.\n", DEVICE_NAME, major); // A non 0 return means init_module failed; module can't be loaded. return 0; } // Called when a process tries to open the device file. static int device_open(struct inode *inode, struct file *file) { try_module_get(THIS_MODULE); return 0; } // Called when a process closes the device file. static int device_release(struct inode *inode, struct file *file) { // Decrement the usage count, or else once you opened the file, you'll never // get get rid of the module. module_put(THIS_MODULE); return 0; } static ssize_t device_read(struct file *filp, char *buffer, size_t length, loff_t *offset) { return 0; } #if 0 #define SEM_INIT(sem) sema_init(sem, 1) #define SEM_TYPE struct semaphore #define SEM_DOWN(sem) down(sem) #define SEM_UP(sem) up(sem) #else #define SEM_INIT(sem) init_rwsem(sem) #define SEM_TYPE struct rw_semaphore #define SEM_DOWN(sem) down_write_nested(sem, 0) #define SEM_UP(sem) up_write(sem) #endif struct mywork { struct work_struct work; int index; SEM_TYPE *sem; }; static void work1(struct work_struct *work) { struct mywork *my_work = container_of(work, struct mywork, work); trace_printk("work1 Called with index %d\n", my_work->index); } static void work2(struct work_struct *work) { struct mywork *my_work = container_of(work, struct mywork, work); trace_printk("work2 Called with index %d\n", my_work->index); SEM_DOWN(my_work->sem); SEM_UP(my_work->sem); trace_printk("work2 Finished with index %d\n", my_work->index); } static ssize_t device_write(struct file *filp, const char *buff, size_t len, loff_t *off) { SEM_TYPE write_sem; SEM_INIT(_sem); struct mywork my_work1; struct mywork my_work2; trace_printk("lockup_wq1 %p lockup_wq2 %p\n", lockup_wq1, lockup_wq2); trace_printk("Got a write\n"); SEM_DOWN(_sem); my_work1.index = len; my_work1.sem = _sem; INIT_WORK_ONSTACK(_work1.work, work1); my_work2.index = len; my_work2.sem = _sem; INIT_WORK_ONSTACK(_work2.work, work2); queue_work(lockup_wq2, _work2.work); queue_work(lockup_wq1, _work1.work); flush_work(_work1.work); destroy_work_on_stack(_work1.work); SEM_UP(_sem); flus
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Wed, May 21, 2014 at 12:33 AM, Richard Weinberger richard.weinber...@gmail.com wrote: CC'ing RT folks On Wed, May 21, 2014 at 8:23 AM, Austin Schuh aus...@peloton-tech.com wrote: On Tue, May 13, 2014 at 7:29 PM, Austin Schuh aus...@peloton-tech.com wrote: Hi, I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT patched kernel. I have currently only triggered it using dpkg. Dave Chinner on the XFS mailing list suggested that it was a rt-kernel workqueue issue as opposed to a XFS problem after looking at the kernel messages. I've got a 100% reproducible test case that doesn't involve a filesystem. I wrote a module that triggers the bug when the device is written to, making it easy to enable tracing during the event and capture everything. It looks like rw_semaphores don't trigger wq_worker_sleeping to run when work goes to sleep on a rw_semaphore. This only happens with the RT patches, not with the mainline kernel. I'm foreseeing a second deadlock/bug coming into play shortly. If a task holding the work pool spinlock gets preempted, and we need to schedule more work from another worker thread which was just blocked by a mutex, we'll then end up trying to go to sleep on 2 locks at once. That is getting a bit deep into the scheduler for me... Any suggestions on how to fix it? Austin #include linux/module.h #include linux/kernel.h #include linux/init.h #include linux/fs.h #include asm/uaccess.h #include linux/semaphore.h static int device_open(struct inode *, struct file *); static int device_release(struct inode *, struct file *); static ssize_t device_read(struct file *, char *, size_t, loff_t *); static ssize_t device_write(struct file *, const char *, size_t, loff_t *); // Dev name as it appears in /proc/devices #define DEVICE_NAME aschuh // Major number assigned to our device driver static int major; static struct workqueue_struct *lockup_wq1; static struct workqueue_struct *lockup_wq2; static struct file_operations fops = { .read = device_read, .write = device_write, .open = device_open, .release = device_release }; static int __init init_killer_module(void) { lockup_wq1 = alloc_workqueue(lockup_wq1, WQ_MEM_RECLAIM, 0); if (!lockup_wq1) return -ENOMEM; lockup_wq2 = alloc_workqueue(lockup_wq2, WQ_MEM_RECLAIM, 0); if (!lockup_wq2) { destroy_workqueue(lockup_wq1); return -ENOMEM; } major = register_chrdev(0, DEVICE_NAME, fops); if (major 0) { printk(KERN_ALERT Registering char device failed with %d\n, major); destroy_workqueue(lockup_wq1); destroy_workqueue(lockup_wq2); return major; } printk(KERN_INFO 'mknod /dev/%s c %d 0'.\n, DEVICE_NAME, major); // A non 0 return means init_module failed; module can't be loaded. return 0; } // Called when a process tries to open the device file. static int device_open(struct inode *inode, struct file *file) { try_module_get(THIS_MODULE); return 0; } // Called when a process closes the device file. static int device_release(struct inode *inode, struct file *file) { // Decrement the usage count, or else once you opened the file, you'll never // get get rid of the module. module_put(THIS_MODULE); return 0; } static ssize_t device_read(struct file *filp, char *buffer, size_t length, loff_t *offset) { return 0; } #if 0 #define SEM_INIT(sem) sema_init(sem, 1) #define SEM_TYPE struct semaphore #define SEM_DOWN(sem) down(sem) #define SEM_UP(sem) up(sem) #else #define SEM_INIT(sem) init_rwsem(sem) #define SEM_TYPE struct rw_semaphore #define SEM_DOWN(sem) down_write_nested(sem, 0) #define SEM_UP(sem) up_write(sem) #endif struct mywork { struct work_struct work; int index; SEM_TYPE *sem; }; static void work1(struct work_struct *work) { struct mywork *my_work = container_of(work, struct mywork, work); trace_printk(work1 Called with index %d\n, my_work-index); } static void work2(struct work_struct *work) { struct mywork *my_work = container_of(work, struct mywork, work); trace_printk(work2 Called with index %d\n, my_work-index); SEM_DOWN(my_work-sem); SEM_UP(my_work-sem); trace_printk(work2 Finished with index %d\n, my_work-index); } static ssize_t device_write(struct file *filp, const char *buff, size_t len, loff_t *off) { SEM_TYPE write_sem; SEM_INIT(write_sem); struct mywork my_work1; struct mywork my_work2; trace_printk(lockup_wq1 %p lockup_wq2 %p\n, lockup_wq1, lockup_wq2); trace_printk(Got a write\n); SEM_DOWN(write_sem); my_work1.index = len; my_work1.sem = write_sem; INIT_WORK_ONSTACK(my_work1.work, work1); my_work2.index = len; my_work2.sem = write_sem; INIT_WORK_ONSTACK(my_work2.work, work2); queue_work(lockup_wq2, my_work2.work); queue_work(lockup_wq1, my_work1.work); flush_work(my_work1.work); destroy_work_on_stack(my_work1.work); SEM_UP(write_sem); flush_work(my_work2.work); destroy_work_on_stack(my_work2.work
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Thu, Jun 26, 2014 at 3:35 PM, Thomas Gleixner t...@linutronix.de wrote: On Thu, 26 Jun 2014, Austin Schuh wrote: On Wed, May 21, 2014 at 12:33 AM, Richard Weinberger richard.weinber...@gmail.com wrote: CC'ing RT folks On Wed, May 21, 2014 at 8:23 AM, Austin Schuh aus...@peloton-tech.com wrote: On Tue, May 13, 2014 at 7:29 PM, Austin Schuh aus...@peloton-tech.com wrote: Hi, I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT patched kernel. I have currently only triggered it using dpkg. Dave Chinner on the XFS mailing list suggested that it was a rt-kernel workqueue issue as opposed to a XFS problem after looking at the kernel messages. I've got a 100% reproducible test case that doesn't involve a filesystem. I wrote a module that triggers the bug when the device is written to, making it easy to enable tracing during the event and capture everything. It looks like rw_semaphores don't trigger wq_worker_sleeping to run when work goes to sleep on a rw_semaphore. This only happens with the RT patches, not with the mainline kernel. I'm foreseeing a second deadlock/bug coming into play shortly. If a task holding the work pool spinlock gets preempted, and we need to schedule more work from another worker thread which was just blocked by a mutex, we'll then end up trying to go to sleep on 2 locks at once. I remember vaguely, that I've seen and analyzed that quite some time ago. I can't page in all the gory details right now, but I have a look how the related code changed in the last couple of years tomorrow morning with an awake brain. Steven, you did some analysis on that IIRC, or was that just related to rw_locks? Thanks, tglx If I'm reading the rt patch correctly, wq_worker_sleeping was moved out of __schedule to sched_submit_work. It looks like that changes the conditions under which wq_worker_sleeping is called. It used to be called whenever a task was going to sleep (I think). It looks like it is called now if the task is going to sleep, and if the task isn't blocked on a PI mutex (I think). If I try the following experiment static inline void sched_submit_work(struct task_struct *tsk) { + if (tsk-state tsk-flags PF_WQ_WORKER) { + wq_worker_sleeping(tsk); + return; + } and then remove the call later in the function, I am able to pass my test. Unfortunately, I then get a recursive pool spinlock BUG_ON after a while (as I would expect), and it all blows up. I'm not sure where to go from there. Any changes to the workpool to try to fix that will be hard, or could affect latency significantly. Austin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: On-stack work item completion race? (was Re: XFS crash?)
On Wed, Jun 25, 2014 at 7:00 AM, Tejun Heo wrote: > > Hello, > > On Tue, Jun 24, 2014 at 08:05:07PM -0700, Austin Schuh wrote: > > > I can see no reason why manual completion would behave differently > > > from flush_work() in this case. > > > > I went looking for a short trace in my original log to show the problem, > > and instead found evidence of the second problem. I still like the shorter > > flush_work call, but that's not my call. > > So, are you saying that the original issue you reported isn't actually > a problem? But didn't you imply that changing the waiting mechanism > fixed a deadlock or was that a false positive? Correct, that was a false positive. Sorry for the noise. > > I spent some more time debugging, and I am seeing that tsk_is_pi_blocked is > > returning 1 in sched_submit_work (kernel/sched/core.c). It looks > > like sched_submit_work is not detecting that the worker task is blocked on > > a mutex. > > The function unplugs the block layer and doesn't have much to do with > workqueue although it has "_work" in its name. Thomas moved + if (tsk->flags & PF_WQ_WORKER) + wq_worker_sleeping(tsk); into sched_submit_work as part of the RT patchset. > > This looks very RT related right now. I see 2 problems from my reading > > (and experimentation). The first is that the second worker isn't getting > > started because tsk_is_pi_blocked is reporting that the task isn't blocked > > on a mutex. The second is that even if another worker needs to be > > scheduled because the original worker is blocked on a mutex, we need the > > pool lock to schedule another worker. The pool lock can be acquired by any > > CPU, and is a spin_lock. If we end up on the slow path for the pool lock, > > we hit BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on)) > > in task_blocks_on_rt_mutex in rtmutex.c. I'm not sure how to deal with > > either problem. > > > > Hopefully I've got all my facts right... Debugging kernel code is a whole > > new world from userspace code. > > I don't have much idea how RT kernel works either. Can you reproduce > the issues that you see on mainline? > > Thanks. > > -- > tejun I'll see what I can do. Thanks! Austin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: On-stack work item completion race? (was Re: XFS crash?)
On Wed, Jun 25, 2014 at 7:00 AM, Tejun Heo t...@kernel.org wrote: Hello, On Tue, Jun 24, 2014 at 08:05:07PM -0700, Austin Schuh wrote: I can see no reason why manual completion would behave differently from flush_work() in this case. I went looking for a short trace in my original log to show the problem, and instead found evidence of the second problem. I still like the shorter flush_work call, but that's not my call. So, are you saying that the original issue you reported isn't actually a problem? But didn't you imply that changing the waiting mechanism fixed a deadlock or was that a false positive? Correct, that was a false positive. Sorry for the noise. I spent some more time debugging, and I am seeing that tsk_is_pi_blocked is returning 1 in sched_submit_work (kernel/sched/core.c). It looks like sched_submit_work is not detecting that the worker task is blocked on a mutex. The function unplugs the block layer and doesn't have much to do with workqueue although it has _work in its name. Thomas moved + if (tsk-flags PF_WQ_WORKER) + wq_worker_sleeping(tsk); into sched_submit_work as part of the RT patchset. This looks very RT related right now. I see 2 problems from my reading (and experimentation). The first is that the second worker isn't getting started because tsk_is_pi_blocked is reporting that the task isn't blocked on a mutex. The second is that even if another worker needs to be scheduled because the original worker is blocked on a mutex, we need the pool lock to schedule another worker. The pool lock can be acquired by any CPU, and is a spin_lock. If we end up on the slow path for the pool lock, we hit BUG_ON(rt_mutex_real_waiter(task-pi_blocked_on)) in task_blocks_on_rt_mutex in rtmutex.c. I'm not sure how to deal with either problem. Hopefully I've got all my facts right... Debugging kernel code is a whole new world from userspace code. I don't have much idea how RT kernel works either. Can you reproduce the issues that you see on mainline? Thanks. -- tejun I'll see what I can do. Thanks! Austin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: On-stack work item completion race? (was Re: XFS crash?)
[Adding tglx to the cc. Sorry for any double sends] On Mon, Jun 23, 2014 at 8:25 PM, Tejun Heo wrote: > Hello, > > On Tue, Jun 24, 2014 at 01:02:40PM +1000, Dave Chinner wrote: >> start_flush_work() is effectively a special queue_work() >> implementation, so if if it's not safe to call complete() from the >> workqueue as the above patch implies then this code has the same >> problem. >> >> Tejun - is this "do it yourself completion" a known issue w.r.t. >> workqueues? I can't find any documentation that says "don't do >> that" so...? > > It's more complex than using flush_work() but there's nothing > fundamentally wrong with it. A work item is completely unlinked > before its execution starts. It's safe to free the work item once its > work function started, whether through kfree() or returning. > > One difference between flush_work() and manual completion would be > that if the work item gets requeued, flush_work() would wait for the > queued one to finish but given the work item is one-shot this doesn't > make any difference. > > I can see no reason why manual completion would behave differently > from flush_work() in this case. I went looking for a short trace in my original log to show the problem, and instead found evidence of the second problem. I still like the shorter flush_work call, but that's not my call. I did find this comment in the process_one_work function. Sounds like this could be better documented. /* * It is permissible to free the struct work_struct from * inside the function that is called from it, this we need to * take into account for lockdep too. To avoid bogus "held * lock freed" warnings as well as problems when looking into * work->lockdep_map, make a copy and use that here. */ >> As I understand it, what then happens is that the workqueue code >> grabs another kworker thread and runs the next work item in it's >> queue. IOWs, work items can block, but doing that does not prevent >> execution of other work items queued on other work queues or even on >> the same work queue. Tejun, did I get that correct? > > Yes, as long as the workqueue is under its @max_active limit and has > access to an existing kworker or can create a new one, it'll start > executing the next work item immediately; however, the guaranteed > level of concurrency is 1 even for WQ_RECLAIM workqueues. IOW, the > work items queued on a workqueue must be able to make forward progress > with single work item if the work items are being depended upon for > memory reclaim. I mentioned this to Dave when I initially started this thread, but I'm running a RT patched kernel. I don't see forwards progress. The workqueue is only used in 1 spot (xfs_alloc_wq), and has WQ_MEM_RECLAIM set. I started with a 10,000,000 line trace and pulled out the entries which referenced the workqueue and pool leading up to the lockup. scp-4110 [001] 1.3 101.184640: workqueue_queue_work: work struct=8803c782d900 function=xfs_bmapi_allocate_worker workqueue=88040c9f5a00 pool=88042dae3fc0 req_cpu=512 cpu=1 scp-4110 [001] 1.3 101.184641: workqueue_activate_work: work struct 8803c782d900 kworker/1:1-89[001] 1.1 101.184883: workqueue_nr_running: pool=88042dae3fc0 nr_running=1 kworker/1:1-89[001] 1.. 101.184885: workqueue_execute_start: work struct 8803c782d900: function xfs_bmapi_allocate_worker irq/44-ahci-275 [001] 1.5 101.185086: workqueue_queue_work: work struct=8800ae3f01e0 function=xfs_end_io workqueue=88040b459a00 pool=88042dae3fc0 req_cpu=512 cpu=1 irq/44-ahci-275 [001] 1.5 101.185088: workqueue_activate_work: work struct 8800ae3f01e0 scp-4110 [001] 1.. 101.187911: xfs_ilock: dev 8:5 ino 0xf9e flags ILOCK_EXCL caller xfs_iomap_write_allocate scp-4110 [001] 1.3 101.187914: workqueue_queue_work: work struct=8803c782d900 function=xfs_bmapi_allocate_worker workqueue=88040c9f5a00 pool=88042dae3fc0 req_cpu=512 cpu=1 scp-4110 [001] 1.3 101.187915: workqueue_activate_work: work struct 8803c782d900 scp-4110 [001] 1.4 101.187926: workqueue_queue_work: work struct=88040a6a01c8 function=blk_delay_work workqueue=88040c9f4a00 pool=88042dae44c0 req_cpu=512 cpu=1 scp-4110 [001] 1.4 101.187926: workqueue_activate_work: work struct 88040a6a01c8 kworker/1:1-89[001] 1.. 101.187998: workqueue_execute_end: work struct 8803c782d900 kworker/1:1-89[001] 1.. 101.188000: workqueue_execute_start: work struct 8800ae3f01e0: function xfs_end_io kworker/1:1-89[001] 1.. 101.188001: xfs_ilock: dev 8:5 ino 0xf9e flags ILOCK_EXCL caller xfs_setfilesize The last work never runs. Hangcheck triggers shortly after. I spent some more time debugging, and I am seeing that tsk_is_pi_blocked is returning 1 in
Re: On-stack work item completion race? (was Re: XFS crash?)
[Adding tglx to the cc. Sorry for any double sends] On Mon, Jun 23, 2014 at 8:25 PM, Tejun Heo t...@kernel.org wrote: Hello, On Tue, Jun 24, 2014 at 01:02:40PM +1000, Dave Chinner wrote: start_flush_work() is effectively a special queue_work() implementation, so if if it's not safe to call complete() from the workqueue as the above patch implies then this code has the same problem. Tejun - is this do it yourself completion a known issue w.r.t. workqueues? I can't find any documentation that says don't do that so...? It's more complex than using flush_work() but there's nothing fundamentally wrong with it. A work item is completely unlinked before its execution starts. It's safe to free the work item once its work function started, whether through kfree() or returning. One difference between flush_work() and manual completion would be that if the work item gets requeued, flush_work() would wait for the queued one to finish but given the work item is one-shot this doesn't make any difference. I can see no reason why manual completion would behave differently from flush_work() in this case. I went looking for a short trace in my original log to show the problem, and instead found evidence of the second problem. I still like the shorter flush_work call, but that's not my call. I did find this comment in the process_one_work function. Sounds like this could be better documented. /* * It is permissible to free the struct work_struct from * inside the function that is called from it, this we need to * take into account for lockdep too. To avoid bogus held * lock freed warnings as well as problems when looking into * work-lockdep_map, make a copy and use that here. */ As I understand it, what then happens is that the workqueue code grabs another kworker thread and runs the next work item in it's queue. IOWs, work items can block, but doing that does not prevent execution of other work items queued on other work queues or even on the same work queue. Tejun, did I get that correct? Yes, as long as the workqueue is under its @max_active limit and has access to an existing kworker or can create a new one, it'll start executing the next work item immediately; however, the guaranteed level of concurrency is 1 even for WQ_RECLAIM workqueues. IOW, the work items queued on a workqueue must be able to make forward progress with single work item if the work items are being depended upon for memory reclaim. I mentioned this to Dave when I initially started this thread, but I'm running a RT patched kernel. I don't see forwards progress. The workqueue is only used in 1 spot (xfs_alloc_wq), and has WQ_MEM_RECLAIM set. I started with a 10,000,000 line trace and pulled out the entries which referenced the workqueue and pool leading up to the lockup. scp-4110 [001] 1.3 101.184640: workqueue_queue_work: work struct=8803c782d900 function=xfs_bmapi_allocate_worker workqueue=88040c9f5a00 pool=88042dae3fc0 req_cpu=512 cpu=1 scp-4110 [001] 1.3 101.184641: workqueue_activate_work: work struct 8803c782d900 kworker/1:1-89[001] 1.1 101.184883: workqueue_nr_running: pool=88042dae3fc0 nr_running=1 kworker/1:1-89[001] 1.. 101.184885: workqueue_execute_start: work struct 8803c782d900: function xfs_bmapi_allocate_worker irq/44-ahci-275 [001] 1.5 101.185086: workqueue_queue_work: work struct=8800ae3f01e0 function=xfs_end_io workqueue=88040b459a00 pool=88042dae3fc0 req_cpu=512 cpu=1 irq/44-ahci-275 [001] 1.5 101.185088: workqueue_activate_work: work struct 8800ae3f01e0 scp-4110 [001] 1.. 101.187911: xfs_ilock: dev 8:5 ino 0xf9e flags ILOCK_EXCL caller xfs_iomap_write_allocate scp-4110 [001] 1.3 101.187914: workqueue_queue_work: work struct=8803c782d900 function=xfs_bmapi_allocate_worker workqueue=88040c9f5a00 pool=88042dae3fc0 req_cpu=512 cpu=1 scp-4110 [001] 1.3 101.187915: workqueue_activate_work: work struct 8803c782d900 scp-4110 [001] 1.4 101.187926: workqueue_queue_work: work struct=88040a6a01c8 function=blk_delay_work workqueue=88040c9f4a00 pool=88042dae44c0 req_cpu=512 cpu=1 scp-4110 [001] 1.4 101.187926: workqueue_activate_work: work struct 88040a6a01c8 kworker/1:1-89[001] 1.. 101.187998: workqueue_execute_end: work struct 8803c782d900 kworker/1:1-89[001] 1.. 101.188000: workqueue_execute_start: work struct 8800ae3f01e0: function xfs_end_io kworker/1:1-89[001] 1.. 101.188001: xfs_ilock: dev 8:5 ino 0xf9e flags ILOCK_EXCL caller xfs_setfilesize The last work never runs. Hangcheck triggers shortly after. I spent some more time debugging, and I am seeing that tsk_is_pi_blocked is returning 1 in sched_submit_work (kernel/sched/core.c). It looks like
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Wed, May 21, 2014 at 12:30 PM, John Blackwood wrote: >> Date: Wed, 21 May 2014 03:33:49 -0400 >> From: Richard Weinberger >> To: Austin Schuh >> CC: LKML , xfs , rt-users >> >> Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT > >> >> CC'ing RT folks >> >> On Wed, May 21, 2014 at 8:23 AM, Austin Schuh >> wrote: >> > > On Tue, May 13, 2014 at 7:29 PM, Austin Schuh >> > > wrote: >> >> >> Hi, >> >> >> >> >> >> I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT >> >> >> patched kernel. I have currently only triggered it using dpkg. >> >> >> Dave >> >> >> Chinner on the XFS mailing list suggested that it was a rt-kernel >> >> >> workqueue issue as opposed to a XFS problem after looking at the >> >> >> kernel messages. >> >> >> >> >> >> The only modification to the kernel besides the RT patch is that I >> >> >> have applied tglx's "genirq: Sanitize spurious interrupt detection >> >> >> of >> >> >> threaded irqs" patch. >> > > >> > > I upgraded to 3.14.3-rt4, and the problem still persists. >> > > >> > > I turned on event tracing and tracked it down further. I'm able to >> > > lock it up by scping a new kernel debian package to /tmp/ on the >> > > machine. scp is locking the inode, and then scheduling >> > > xfs_bmapi_allocate_worker in the work queue. The work then never gets >> > > run. The kworkers then lock up waiting for the inode lock. >> > > >> > > Here are the relevant events from the trace. 8803e9f10288 >> > > (blk_delay_work) gets run later on in the trace, but 8803b4c158d0 >> > > (xfs_bmapi_allocate_worker) never does. The kernel then warns about >> > > blocked tasks 120 seconds later. > > Austin and Richard, > > I'm not 100% sure that the patch below will fix your problem, but we > saw something that sounds pretty familiar to your issue involving the > nvidia driver and the preempt-rt patch. The nvidia driver uses the > completion support to create their own driver's notion of an internally > used semaphore. > > Some tasks were failing to ever wakeup from wait_for_completion() calls > due to a race in the underlying do_wait_for_common() routine. Hi John, Thanks for the suggestion and patch. The issue is that the work never gets run, not that the work finishes but the waiter never gets woken. I applied it anyways to see if it helps, but I still get the lockup. Thanks, Austin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Tue, May 13, 2014 at 7:29 PM, Austin Schuh wrote: > Hi, > > I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT > patched kernel. I have currently only triggered it using dpkg. Dave > Chinner on the XFS mailing list suggested that it was a rt-kernel > workqueue issue as opposed to a XFS problem after looking at the > kernel messages. > > The only modification to the kernel besides the RT patch is that I > have applied tglx's "genirq: Sanitize spurious interrupt detection of > threaded irqs" patch. I upgraded to 3.14.3-rt4, and the problem still persists. I turned on event tracing and tracked it down further. I'm able to lock it up by scping a new kernel debian package to /tmp/ on the machine. scp is locking the inode, and then scheduling xfs_bmapi_allocate_worker in the work queue. The work then never gets run. The kworkers then lock up waiting for the inode lock. Here are the relevant events from the trace. 8803e9f10288 (blk_delay_work) gets run later on in the trace, but 8803b4c158d0 (xfs_bmapi_allocate_worker) never does. The kernel then warns about blocked tasks 120 seconds later. scp-5393 [001] 1.. 148.883383: xfs_writepage: dev 8:5 ino 0xd8e pgoff 0x4a3e000 size 0x16af02b0 offset 0 length 0 delalloc 1 unwritten 0 scp-5393 [001] 1.. 148.883383: xfs_ilock_nowait: dev 8:5 ino 0xd8e flags ILOCK_SHARED caller xfs_map_blocks scp-5393 [001] 1.. 148.883384: xfs_iunlock: dev 8:5 ino 0xd8e flags ILOCK_SHARED caller xfs_map_blocks scp-5393 [001] 1.. 148.883386: xfs_log_reserve: dev 8:5 type STRAT_WRITE t_ocnt 2 t_cnt 2 t_curr_res 112332 t_unit_res 112332 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 57 grant_reserve_bytes 9250852 grant_write_cycle 57 grant_write_bytes 9250852 curr_cycle 57 curr_block 18031 tail_cycle 57 tail_block 17993 scp-5393 [001] 1.. 148.883386: xfs_log_reserve_exit: dev 8:5 type STRAT_WRITE t_ocnt 2 t_cnt 2 t_curr_res 112332 t_unit_res 112332 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 57 grant_reserve_bytes 9475516 grant_write_cycle 57 grant_write_bytes 9475516 curr_cycle 57 curr_block 18031 tail_cycle 57 tail_block 17993 scp-5393 [001] 1.. 148.883386: xfs_ilock: dev 8:5 ino 0xd8e flags ILOCK_EXCL caller xfs_iomap_write_allocate scp-5393 [001] 1.3 148.883390: workqueue_queue_work: work struct=8803b4c158d0 function=xfs_bmapi_allocate_worker workqueue=8803ea85dc00 req_cpu=512 cpu=1 scp-5393 [001] 1.3 148.883390: workqueue_activate_work: work struct 8803b4c158d0 scp-5393 [001] 1.4 148.883396: workqueue_queue_work: work struct=8803e9f10288 function=blk_delay_work workqueue=8803ecf96200 req_cpu=512 cpu=1 scp-5393 [001] 1.4 148.883396: workqueue_activate_work: work struct 8803e9f10288 scp-5393 [001] dN..3.4 148.883399: sched_wakeup: comm=kworker/1:1H pid=573 prio=100 success=1 target_cpu=001 scp-5393 [001] dN..3.4 148.883400: sched_stat_runtime: comm=scp pid=5393 runtime=32683 [ns] vruntime=658862317 [ns] scp-5393 [001] d...3.4 148.883400: sched_switch: prev_comm=scp prev_pid=5393 prev_prio=120 prev_state=R+ ==> next_comm=kworker/1:1H next_pid=573 next_prio=100 irq/44-ahci-273 [000] d...3.5 148.883647: sched_wakeup: comm=kworker/0:2 pid=732 prio=120 success=1 target_cpu=000 irq/44-ahci-273 [000] d...3.. 148.883667: sched_switch: prev_comm=irq/44-ahci prev_pid=273 prev_prio=49 prev_state=S ==> next_comm=kworker/0:2 next_pid=732 next_prio=120 kworker/0:2-732 [000] 1.. 148.883674: workqueue_execute_start: work struct 8800aea30cb8: function xfs_end_io kworker/0:2-732 [000] 1.. 148.883674: xfs_ilock: dev 8:5 ino 0xd8e flags ILOCK_EXCL caller xfs_setfilesize kworker/0:2-732 [000] d...3.. 148.883677: sched_stat_runtime: comm=kworker/0:2 pid=732 runtime=11392 [ns] vruntime=46907654869 [ns] kworker/0:2-732 [000] d...3.. 148.883679: sched_switch: prev_comm=kworker/0:2 prev_pid=732 prev_prio=120 prev_state=D ==> next_comm=swapper/0 next_pid=0 next_prio=120 kworker/1:1-99[001] 1.. 148.884208: workqueue_execute_start: work struct 8800aea30c20: function xfs_end_io kworker/1:1-99[001] 1.. 148.884208: xfs_ilock: dev 8:5 ino 0xd8e flags ILOCK_EXCL caller xfs_setfilesize kworker/1:1-99[001] d...3.. 148.884210: sched_stat_runtime: comm=kworker/1:1 pid=99 runtime=36705 [ns] vruntime=48354424724 [ns] kworker/1:1-99[001] d...3.. 148.884211: sched_switch: prev_comm=kworker/1:1 prev_pid=99 prev_prio=120 prev_state=R+ ==> next_comm=swapper/1 next_pid=0 next_prio=120 kworker/u16:4-296 [001] 1.. 151.560358: workqueue_execute_start: work
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Tue, May 13, 2014 at 7:29 PM, Austin Schuh aus...@peloton-tech.com wrote: Hi, I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT patched kernel. I have currently only triggered it using dpkg. Dave Chinner on the XFS mailing list suggested that it was a rt-kernel workqueue issue as opposed to a XFS problem after looking at the kernel messages. The only modification to the kernel besides the RT patch is that I have applied tglx's genirq: Sanitize spurious interrupt detection of threaded irqs patch. I upgraded to 3.14.3-rt4, and the problem still persists. I turned on event tracing and tracked it down further. I'm able to lock it up by scping a new kernel debian package to /tmp/ on the machine. scp is locking the inode, and then scheduling xfs_bmapi_allocate_worker in the work queue. The work then never gets run. The kworkers then lock up waiting for the inode lock. Here are the relevant events from the trace. 8803e9f10288 (blk_delay_work) gets run later on in the trace, but 8803b4c158d0 (xfs_bmapi_allocate_worker) never does. The kernel then warns about blocked tasks 120 seconds later. scp-5393 [001] 1.. 148.883383: xfs_writepage: dev 8:5 ino 0xd8e pgoff 0x4a3e000 size 0x16af02b0 offset 0 length 0 delalloc 1 unwritten 0 scp-5393 [001] 1.. 148.883383: xfs_ilock_nowait: dev 8:5 ino 0xd8e flags ILOCK_SHARED caller xfs_map_blocks scp-5393 [001] 1.. 148.883384: xfs_iunlock: dev 8:5 ino 0xd8e flags ILOCK_SHARED caller xfs_map_blocks scp-5393 [001] 1.. 148.883386: xfs_log_reserve: dev 8:5 type STRAT_WRITE t_ocnt 2 t_cnt 2 t_curr_res 112332 t_unit_res 112332 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 57 grant_reserve_bytes 9250852 grant_write_cycle 57 grant_write_bytes 9250852 curr_cycle 57 curr_block 18031 tail_cycle 57 tail_block 17993 scp-5393 [001] 1.. 148.883386: xfs_log_reserve_exit: dev 8:5 type STRAT_WRITE t_ocnt 2 t_cnt 2 t_curr_res 112332 t_unit_res 112332 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 57 grant_reserve_bytes 9475516 grant_write_cycle 57 grant_write_bytes 9475516 curr_cycle 57 curr_block 18031 tail_cycle 57 tail_block 17993 scp-5393 [001] 1.. 148.883386: xfs_ilock: dev 8:5 ino 0xd8e flags ILOCK_EXCL caller xfs_iomap_write_allocate scp-5393 [001] 1.3 148.883390: workqueue_queue_work: work struct=8803b4c158d0 function=xfs_bmapi_allocate_worker workqueue=8803ea85dc00 req_cpu=512 cpu=1 scp-5393 [001] 1.3 148.883390: workqueue_activate_work: work struct 8803b4c158d0 scp-5393 [001] 1.4 148.883396: workqueue_queue_work: work struct=8803e9f10288 function=blk_delay_work workqueue=8803ecf96200 req_cpu=512 cpu=1 scp-5393 [001] 1.4 148.883396: workqueue_activate_work: work struct 8803e9f10288 scp-5393 [001] dN..3.4 148.883399: sched_wakeup: comm=kworker/1:1H pid=573 prio=100 success=1 target_cpu=001 scp-5393 [001] dN..3.4 148.883400: sched_stat_runtime: comm=scp pid=5393 runtime=32683 [ns] vruntime=658862317 [ns] scp-5393 [001] d...3.4 148.883400: sched_switch: prev_comm=scp prev_pid=5393 prev_prio=120 prev_state=R+ == next_comm=kworker/1:1H next_pid=573 next_prio=100 irq/44-ahci-273 [000] d...3.5 148.883647: sched_wakeup: comm=kworker/0:2 pid=732 prio=120 success=1 target_cpu=000 irq/44-ahci-273 [000] d...3.. 148.883667: sched_switch: prev_comm=irq/44-ahci prev_pid=273 prev_prio=49 prev_state=S == next_comm=kworker/0:2 next_pid=732 next_prio=120 kworker/0:2-732 [000] 1.. 148.883674: workqueue_execute_start: work struct 8800aea30cb8: function xfs_end_io kworker/0:2-732 [000] 1.. 148.883674: xfs_ilock: dev 8:5 ino 0xd8e flags ILOCK_EXCL caller xfs_setfilesize kworker/0:2-732 [000] d...3.. 148.883677: sched_stat_runtime: comm=kworker/0:2 pid=732 runtime=11392 [ns] vruntime=46907654869 [ns] kworker/0:2-732 [000] d...3.. 148.883679: sched_switch: prev_comm=kworker/0:2 prev_pid=732 prev_prio=120 prev_state=D == next_comm=swapper/0 next_pid=0 next_prio=120 kworker/1:1-99[001] 1.. 148.884208: workqueue_execute_start: work struct 8800aea30c20: function xfs_end_io kworker/1:1-99[001] 1.. 148.884208: xfs_ilock: dev 8:5 ino 0xd8e flags ILOCK_EXCL caller xfs_setfilesize kworker/1:1-99[001] d...3.. 148.884210: sched_stat_runtime: comm=kworker/1:1 pid=99 runtime=36705 [ns] vruntime=48354424724 [ns] kworker/1:1-99[001] d...3.. 148.884211: sched_switch: prev_comm=kworker/1:1 prev_pid=99 prev_prio=120 prev_state=R+ == next_comm=swapper/1 next_pid=0 next_prio=120 kworker/u16:4-296 [001] 1.. 151.560358: workqueue_execute_start: work struct 8803e9f10660: function bdi_writeback_workfn
Re: Filesystem lockup with CONFIG_PREEMPT_RT
On Wed, May 21, 2014 at 12:30 PM, John Blackwood john.blackw...@ccur.com wrote: Date: Wed, 21 May 2014 03:33:49 -0400 From: Richard Weinberger richard.weinber...@gmail.com To: Austin Schuh aus...@peloton-tech.com CC: LKML linux-kernel@vger.kernel.org, xfs x...@oss.sgi.com, rt-users linux-rt-us...@vger.kernel.org Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT CC'ing RT folks On Wed, May 21, 2014 at 8:23 AM, Austin Schuh aus...@peloton-tech.com wrote: On Tue, May 13, 2014 at 7:29 PM, Austin Schuh aus...@peloton-tech.com wrote: Hi, I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT patched kernel. I have currently only triggered it using dpkg. Dave Chinner on the XFS mailing list suggested that it was a rt-kernel workqueue issue as opposed to a XFS problem after looking at the kernel messages. The only modification to the kernel besides the RT patch is that I have applied tglx's genirq: Sanitize spurious interrupt detection of threaded irqs patch. I upgraded to 3.14.3-rt4, and the problem still persists. I turned on event tracing and tracked it down further. I'm able to lock it up by scping a new kernel debian package to /tmp/ on the machine. scp is locking the inode, and then scheduling xfs_bmapi_allocate_worker in the work queue. The work then never gets run. The kworkers then lock up waiting for the inode lock. Here are the relevant events from the trace. 8803e9f10288 (blk_delay_work) gets run later on in the trace, but 8803b4c158d0 (xfs_bmapi_allocate_worker) never does. The kernel then warns about blocked tasks 120 seconds later. Austin and Richard, I'm not 100% sure that the patch below will fix your problem, but we saw something that sounds pretty familiar to your issue involving the nvidia driver and the preempt-rt patch. The nvidia driver uses the completion support to create their own driver's notion of an internally used semaphore. Some tasks were failing to ever wakeup from wait_for_completion() calls due to a race in the underlying do_wait_for_common() routine. Hi John, Thanks for the suggestion and patch. The issue is that the work never gets run, not that the work finishes but the waiter never gets woken. I applied it anyways to see if it helps, but I still get the lockup. Thanks, Austin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Filesystem lockup with CONFIG_PREEMPT_RT
Hi, I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT patched kernel. I have currently only triggered it using dpkg. Dave Chinner on the XFS mailing list suggested that it was a rt-kernel workqueue issue as opposed to a XFS problem after looking at the kernel messages. $ uname -a Linux vpc5 3.10.24-rt22abs #15 SMP PREEMPT RT Tue May 13 14:42:22 PDT 2014 x86_64 GNU/Linux The only modification to the kernel besides the RT patch is that I have applied tglx's "genirq: Sanitize spurious interrupt detection of threaded irqs" patch. Any ideas on what could be wrong? Is there any information that I can pull before I reboot the machine that would be useful? I have the output of triggering sysrq with l and t if that would be useful. Attached is the kernel blocked task message output. Thanks, Austin vpc5_xfs_lockup_locks Description: Binary data
Filesystem lockup with CONFIG_PREEMPT_RT
Hi, I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT patched kernel. I have currently only triggered it using dpkg. Dave Chinner on the XFS mailing list suggested that it was a rt-kernel workqueue issue as opposed to a XFS problem after looking at the kernel messages. $ uname -a Linux vpc5 3.10.24-rt22abs #15 SMP PREEMPT RT Tue May 13 14:42:22 PDT 2014 x86_64 GNU/Linux The only modification to the kernel besides the RT patch is that I have applied tglx's genirq: Sanitize spurious interrupt detection of threaded irqs patch. Any ideas on what could be wrong? Is there any information that I can pull before I reboot the machine that would be useful? I have the output of triggering sysrq with l and t if that would be useful. Attached is the kernel blocked task message output. Thanks, Austin vpc5_xfs_lockup_locks Description: Binary data
Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
On Mon, Apr 7, 2014 at 1:08 PM, Austin Schuh wrote: > On Mon, Apr 7, 2014 at 1:07 PM, Thomas Gleixner wrote: >> On Mon, 7 Apr 2014, Austin Schuh wrote: >>> You originally sent the patch out. I could send your patch out back >>> to you, but that feels a bit weird ;) >> >> Wheee. Let me dig in my archives > > https://lkml.org/lkml/2013/3/7/222 in case that helps. Did you find the patch? I didn't see anything go by (but I'm not on the main mailing list and didn't find anything with a quick Google search.) It would be nice to not need to run a custom kernel to keep my machine running. I have what is probably a year split between 2 machines of runtime with the patch applied, and I haven't seen any problems with it. Austin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
On Mon, Apr 7, 2014 at 1:08 PM, Austin Schuh aus...@peloton-tech.com wrote: On Mon, Apr 7, 2014 at 1:07 PM, Thomas Gleixner t...@linutronix.de wrote: On Mon, 7 Apr 2014, Austin Schuh wrote: You originally sent the patch out. I could send your patch out back to you, but that feels a bit weird ;) Wheee. Let me dig in my archives https://lkml.org/lkml/2013/3/7/222 in case that helps. Did you find the patch? I didn't see anything go by (but I'm not on the main mailing list and didn't find anything with a quick Google search.) It would be nice to not need to run a custom kernel to keep my machine running. I have what is probably a year split between 2 machines of runtime with the patch applied, and I haven't seen any problems with it. Austin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
On Mon, Apr 7, 2014 at 1:07 PM, Thomas Gleixner wrote: > On Mon, 7 Apr 2014, Austin Schuh wrote: >> You originally sent the patch out. I could send your patch out back >> to you, but that feels a bit weird ;) > > Wheee. Let me dig in my archives https://lkml.org/lkml/2013/3/7/222 in case that helps. Austin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
On Mon, Apr 7, 2014 at 11:41 AM, Thomas Gleixner wrote: > On Mon, 7 Apr 2014, Austin Schuh wrote: > >> Hi Thomas, >> >> Did anything come of this patch? Both Oliver and I have found that it >> fixes real problems. I have multiple machines which have been running >> with the patch since December with no ill effects. > > No, sorry. It fell through the cracks. Care to resend ? > > Thanks, > > tglx > -- > To unsubscribe from this list: send the line "unsubscribe linux-can" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html You originally sent the patch out. I could send your patch out back to you, but that feels a bit weird ;) Austin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
Hi Thomas, Did anything come of this patch? Both Oliver and I have found that it fixes real problems. I have multiple machines which have been running with the patch since December with no ill effects. Thanks, Austin On Mon, Jan 6, 2014 at 5:32 AM, Oliver Hartkopp wrote: > Hi Thomas, > > I just wanted to add my > > Tested-by: Oliver Hartkopp > > In my setup with Core i7 and 20 CAN busses SJA1000 PCIe the problem > disappeared with the discussed patch with the -rt kernel. > > The system was running at full CAN bus load over the weekend more than 72 > hours of operation without problems: > >CPU0 CPU1 CPU2 CPU3 > 0: 40 0 0 0 IO-APIC-edge timer > 1: 1 0 0 0 IO-APIC-edge i8042 > 8: 0 0 1 0 IO-APIC-edge rtc0 > 9: 42 45 45 42 IO-APIC-fasteoi acpi > 16: 9 8 8 8 IO-APIC-fasteoi ahci, > ehci_hcd:usb1, can4, can5, can6, can7 > 17: 441468642 443275488 443609061 441436145 IO-APIC-fasteoi can8, > can10, can11, can9 > 18: 441975412 438811422 437317802 441209092 IO-APIC-fasteoi can12, > can13, can14, can15 > 19: 427310388 428661677 429813687 428095739 IO-APIC-fasteoi can0, > can1, can2, can3, can16, can17, can18, can19 > (..) > > Before the having the patch, it lasted 1 minutes to 1.5 hours (usually ~3 > minutes) until the irq was killed due to the spurious detection using Linux > 3.10.11-rt (Debian linux-image-3.10-0.bpo.3-rt-686-pae). > > I also tested the patch on different latest 3.13-rc5+ (non-rt) kernels for two > weeks now without problems. > > If you want me to test an improved version (as Austin suggested below) please > send a patch. > > Best regards, > Oliver > > On 23.12.2013 20:25, Austin Schuh wrote: >> Hi Thomas, >> >> Did anything happen with your patch to note_interrupt, originally >> posted on May 8th of 2013? (https://lkml.org/lkml/2013/3/7/222) >> >> I am seeing an issue on a machine right now running a >> config-preempt-rt kernel and a SJA1000 CAN card from PEAK. It works >> for ~1 day, and then proceeds to die with a "Disabling IRQ #18" >> message. I posted on the Linux CAN mailing list, and Oliver Hartkopp >> was able to reproduce the issue only on a realtime kernel. A function >> trace ending when the IRQ was disabled shows that note_interrupt is >> being called regularly from the IRQ handler threads, and one of the >> threads is doing work (and therefore calling note_interrupt with >> IRQ_HANDLED). >> >> Oliver Hartkopp and I ran tests over the weekend on numerous machines >> and verified that the patch that you proposed fixes the problem. We >> think that the race condition that Till reported is causing the >> problem here. >> >> In reply to the comment about using the upper bit of >> threads_handled_last for holding the SPURIOUS_DEFERRED flag, while >> that may still be an over-optimization, the code should still work. >> All comparisons are done with the bit set, which just makes it a 31 >> bit counter. It will take 8 more days for the counter to overflow on >> my machine, so I won't know for certain until then. >> >> My only concern is that there may still be a small race condition with >> this new code. If the interrupt handler thread is running at a >> realtime priority, but lower than another task, it may not get run >> until a large number of IRQs get triggered, and then process them >> quickly. With your new handler code, this would be counted as one >> single handled interrupt. With the current constants, this is only a >> problem if more than 1000 calls to the handler happen between IRQs. I >> starved my card's irq threads by running 4 tasks at a higher realtime >> priority than the handler threads, and saw the number of unhandled >> IRQs jump from 1/10 to 3/10, so that problem may not show up >> in practice. >> >> Austin Schuh >> >> Tested-by: Austin Schuh >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-can" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
Hi Thomas, Did anything come of this patch? Both Oliver and I have found that it fixes real problems. I have multiple machines which have been running with the patch since December with no ill effects. Thanks, Austin On Mon, Jan 6, 2014 at 5:32 AM, Oliver Hartkopp socket...@hartkopp.net wrote: Hi Thomas, I just wanted to add my Tested-by: Oliver Hartkopp socket...@hartkopp.net In my setup with Core i7 and 20 CAN busses SJA1000 PCIe the problem disappeared with the discussed patch with the -rt kernel. The system was running at full CAN bus load over the weekend more than 72 hours of operation without problems: CPU0 CPU1 CPU2 CPU3 0: 40 0 0 0 IO-APIC-edge timer 1: 1 0 0 0 IO-APIC-edge i8042 8: 0 0 1 0 IO-APIC-edge rtc0 9: 42 45 45 42 IO-APIC-fasteoi acpi 16: 9 8 8 8 IO-APIC-fasteoi ahci, ehci_hcd:usb1, can4, can5, can6, can7 17: 441468642 443275488 443609061 441436145 IO-APIC-fasteoi can8, can10, can11, can9 18: 441975412 438811422 437317802 441209092 IO-APIC-fasteoi can12, can13, can14, can15 19: 427310388 428661677 429813687 428095739 IO-APIC-fasteoi can0, can1, can2, can3, can16, can17, can18, can19 (..) Before the having the patch, it lasted 1 minutes to 1.5 hours (usually ~3 minutes) until the irq was killed due to the spurious detection using Linux 3.10.11-rt (Debian linux-image-3.10-0.bpo.3-rt-686-pae). I also tested the patch on different latest 3.13-rc5+ (non-rt) kernels for two weeks now without problems. If you want me to test an improved version (as Austin suggested below) please send a patch. Best regards, Oliver On 23.12.2013 20:25, Austin Schuh wrote: Hi Thomas, Did anything happen with your patch to note_interrupt, originally posted on May 8th of 2013? (https://lkml.org/lkml/2013/3/7/222) I am seeing an issue on a machine right now running a config-preempt-rt kernel and a SJA1000 CAN card from PEAK. It works for ~1 day, and then proceeds to die with a Disabling IRQ #18 message. I posted on the Linux CAN mailing list, and Oliver Hartkopp was able to reproduce the issue only on a realtime kernel. A function trace ending when the IRQ was disabled shows that note_interrupt is being called regularly from the IRQ handler threads, and one of the threads is doing work (and therefore calling note_interrupt with IRQ_HANDLED). Oliver Hartkopp and I ran tests over the weekend on numerous machines and verified that the patch that you proposed fixes the problem. We think that the race condition that Till reported is causing the problem here. In reply to the comment about using the upper bit of threads_handled_last for holding the SPURIOUS_DEFERRED flag, while that may still be an over-optimization, the code should still work. All comparisons are done with the bit set, which just makes it a 31 bit counter. It will take 8 more days for the counter to overflow on my machine, so I won't know for certain until then. My only concern is that there may still be a small race condition with this new code. If the interrupt handler thread is running at a realtime priority, but lower than another task, it may not get run until a large number of IRQs get triggered, and then process them quickly. With your new handler code, this would be counted as one single handled interrupt. With the current constants, this is only a problem if more than 1000 calls to the handler happen between IRQs. I starved my card's irq threads by running 4 tasks at a higher realtime priority than the handler threads, and saw the number of unhandled IRQs jump from 1/10 to 3/10, so that problem may not show up in practice. Austin Schuh Tested-by: Austin Schuh aus...@peloton-tech.com -- To unsubscribe from this list: send the line unsubscribe linux-can in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
On Mon, Apr 7, 2014 at 11:41 AM, Thomas Gleixner t...@linutronix.de wrote: On Mon, 7 Apr 2014, Austin Schuh wrote: Hi Thomas, Did anything come of this patch? Both Oliver and I have found that it fixes real problems. I have multiple machines which have been running with the patch since December with no ill effects. No, sorry. It fell through the cracks. Care to resend ? Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe linux-can in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html You originally sent the patch out. I could send your patch out back to you, but that feels a bit weird ;) Austin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs
On Mon, Apr 7, 2014 at 1:07 PM, Thomas Gleixner t...@linutronix.de wrote: On Mon, 7 Apr 2014, Austin Schuh wrote: You originally sent the patch out. I could send your patch out back to you, but that feels a bit weird ;) Wheee. Let me dig in my archives https://lkml.org/lkml/2013/3/7/222 in case that helps. Austin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/