Re: [PATCH 2/2] Debug handling of early spurious interrupts
On Tue, 31 Jul 2007 11:25:26 +0900 Fernando Luis Vázquez Cao <[EMAIL PROTECTED]> wrote: > > runtime. Some drivers don't get used by many people and users of some > > architectures (esp embedded) tend to lag kernel.org by a long time. So it > > could be years before all the fallout from this change is finally wrapped > > up. > Yes, that is a big concern. However, the same embedded people is > starting to use both kexec and kdump, so they may suffer the issues we > are trying to weed out anyway, even if these patches are not applied. > The difference is that with this new functionality it is possible to > catch potential problems relatively easily, because any incorrect > behaviour this may cause will be easily reproducible and, in most cases, > will reveal itself early at boot time. > > As things stand now, I guess we will keep seeing occasional crashes and > strange behaviour in kexec-booted kernels, which in some cases will be > due to incorrect handling of spurious interrupts. Besides, such problems > are really difficult to reproduce because, commonly, we would need to > hit an obscure corner case. Please have a think about what we can do to aid this debugging. For example, add an uncondtional printk into request_irq() for now which tells us what irq is being registered - a print_symbol() of the irq_handler_t would be pretty good, although I suspect there are a lot of interrupt handlers with the same name.. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] Debug handling of early spurious interrupts
On Mon, 2007-07-30 at 11:22 -0700, Andrew Morton wrote: > On Mon, 30 Jul 2007 18:58:14 +0900 > Fernando Luis V__zquez Cao <[EMAIL PROTECTED]> wrote: > > > > > > > So bad things might happen because of this change. And if they do, they > > > will take a lng time to be discovered, because non-shared interrupt > > > handlers tend to dwell in crufty old drivers which not many people use. > > > > > > So I'm wondering if it would be more prudent to only do all this for > > > shared > > > handlers? > > > > I have been testing this patches on all my machines, which spans three > > different architectures: i386, x86_64 (EM64T and Opteron), and ia64. > > > > The good news is that nothing catastrophic happened and, in the process, > > I managed to find some somewhat buggy interrupt handlers. > > > > I will present a brief summary of my findings here. > > That's quite a lot of breakage for such a small sample of machines. I do > suspect that if we were to merge this lot into mainline, all sorts of bad > stuff would happen. Yup. > otoh, as you point out, pretty much everthing which goes wrong is due to > incorrect or dubious driver behaviour, and there is value in weeding these > things out. > > But the problem with this process is that we're weeding things out at > runtime. Some drivers don't get used by many people and users of some > architectures (esp embedded) tend to lag kernel.org by a long time. So it > could be years before all the fallout from this change is finally wrapped > up. Yes, that is a big concern. However, the same embedded people is starting to use both kexec and kdump, so they may suffer the issues we are trying to weed out anyway, even if these patches are not applied. The difference is that with this new functionality it is possible to catch potential problems relatively easily, because any incorrect behaviour this may cause will be easily reproducible and, in most cases, will reveal itself early at boot time. As things stand now, I guess we will keep seeing occasional crashes and strange behaviour in kexec-booted kernels, which in some cases will be due to incorrect handling of spurious interrupts. Besides, such problems are really difficult to reproduce because, commonly, we would need to hit an obscure corner case. > > If we find drivers that are not fixable we should probably consider this > > new behaviour on a per-driver basis, as Andrew suggested. > > We haven't found any such yet, have we? Not yet, fortunately. > > This would > > probably require passing a new flag into request_irq. Besides, when such > > a driver is detected we should emit a warning that it may not work > > properly in a kdump kernel. > > > > I would appreciate your comments on this. > > Oh well, let's just keep maintaining it in -mm for now, gather more > information so that we can make a more informed decision later on. Makes sense. I will look into all the issues I have found so far, do some more testing (I think a new approach is needed to speed up testing of these issues...), and get back to you. Thank you. Fernando - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] Debug handling of early spurious interrupts
On Mon, 30 Jul 2007 18:58:14 +0900 Fernando Luis V__zquez Cao <[EMAIL PROTECTED]> wrote: > > > > So bad things might happen because of this change. And if they do, they > > will take a lng time to be discovered, because non-shared interrupt > > handlers tend to dwell in crufty old drivers which not many people use. > > > > So I'm wondering if it would be more prudent to only do all this for shared > > handlers? > > I have been testing this patches on all my machines, which spans three > different architectures: i386, x86_64 (EM64T and Opteron), and ia64. > > The good news is that nothing catastrophic happened and, in the process, > I managed to find some somewhat buggy interrupt handlers. > > I will present a brief summary of my findings here. That's quite a lot of breakage for such a small sample of machines. I do suspect that if we were to merge this lot into mainline, all sorts of bad stuff would happen. otoh, as you point out, pretty much everthing which goes wrong is due to incorrect or dubious driver behaviour, and there is value in weeding these things out. But the problem with this process is that we're weeding things out at runtime. Some drivers don't get used by many people and users of some architectures (esp embedded) tend to lag kernel.org by a long time. So it could be years before all the fallout from this change is finally wrapped up. > > If we find drivers that are not fixable we should probably consider this > new behaviour on a per-driver basis, as Andrew suggested. We haven't found any such yet, have we? > This would > probably require passing a new flag into request_irq. Besides, when such > a driver is detected we should emit a warning that it may not work > properly in a kdump kernel. > > I would appreciate your comments on this. Oh well, let's just keep maintaining it in -mm for now, gather more information so that we can make a more informed decision later on. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] Debug handling of early spurious interrupts
On Fri, 2007-07-20 at 14:43 -0700, Andrew Morton wrote: > On Fri, 20 Jul 2007 11:20:43 +0900 > Fernando Luis V__zquez Cao <[EMAIL PROTECTED]> wrote: > > > With the advent of kdump it is possible that device drivers receive > > interrupts generated in the context of a previous kernel. Ideally > > quiescing the underlying devices should suffice but not all drivers do > > this, either because it is not possible or because they did not > > contemplate this case. Thus drivers ought to be able to handle > > interrupts coming in as soon as the interrupt handler is registered. > > > > Signed-off-by: Fernando Luis Vazquez Cao <[EMAIL PROTECTED]> > > --- > > > > diff -urNp linux-2.6.22-orig/kernel/irq/manage.c > > linux-2.6.22-pendirq/kernel/irq/manage.c > > --- linux-2.6.22-orig/kernel/irq/manage.c 2007-07-19 18:18:53.0 > > +0900 > > +++ linux-2.6.22-pendirq/kernel/irq/manage.c2007-07-19 > > 19:43:41.0 +0900 > > @@ -533,21 +533,32 @@ int request_irq(unsigned int irq, irq_ha > > > > select_smp_affinity(irq); > > > > - if (irqflags & IRQF_SHARED) { > > - /* > > -* It's a shared IRQ -- the driver ought to be prepared for it > > -* to happen immediately, so let's make sure > > -* We do this before actually registering it, to make sure that > > -* a 'real' IRQ doesn't run in parallel with our fake > > -*/ > > - if (irqflags & IRQF_DISABLED) { > > - unsigned long flags; > > + /* > > +* With the advent of kdump it possible that device drivers receive > > +* interrupts generated in the context of a previous kernel. Ideally > > +* quiescing the underlying devices should suffice but not all drivers > > +* do this, either because it is not possible or because they did not > > +* contemplate this case. Thus drivers ought to be able to handle > > +* interrupts coming in as soon as the interrupt handler is registered. > > +* > > +* Besides, if it is a shared IRQ the driver ought to be prepared for > > +* it to happen immediately too. > > +* > > +* We do this before actually registering it, to make sure that > > +* a 'real' IRQ doesn't run in parallel with our fake. > > +*/ > > + if (irqflags & IRQF_DISABLED) { > > + unsigned long flags; > > > > - local_irq_save(flags); > > - handler(irq, dev_id); > > - local_irq_restore(flags); > > - } else > > - handler(irq, dev_id); > > + local_irq_save(flags); > > + retval = handler(irq, dev_id); > > + local_irq_restore(flags); > > + } else > > + retval = handler(irq, dev_id); > > + if (retval == IRQ_HANDLED) { > > + printk(KERN_WARNING > > + "%s (IRQ %d) handled a spurious interrupt\n", > > + devname, irq); > > } > > > > This change means that we'll run the irq handler at request_irq()-time even > for non-shared interrupts. > > This is a bit of a worry. See, shared-interrupt handlers are required to > be able to cope with being called when their device isn't interrupting. > But nobody ever said that non-shared interrupt handlers need to be able to > cope with that. > > Hence these non-shared handlers are within their rights to emit warning > printks, go BUG or accuse the operator of having busted hardware. > > So bad things might happen because of this change. And if they do, they > will take a lng time to be discovered, because non-shared interrupt > handlers tend to dwell in crufty old drivers which not many people use. > > So I'm wondering if it would be more prudent to only do all this for shared > handlers? I have been testing this patches on all my machines, which spans three different architectures: i386, x86_64 (EM64T and Opteron), and ia64. The good news is that nothing catastrophic happened and, in the process, I managed to find some somewhat buggy interrupt handlers. I will present a brief summary of my findings here. That said, I have to admit that these are still preliminary results and have not had time to dig deep into all the issues. I would like hear from you first. - Test environment -- x86_64 [EM64T] CPU: Intel(R) Xeon(TM) CPU 3.20GHz (2 cpus) SCSI: Adaptec AIC-7902B U320 IDE: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE Controller -- x86_64 [Opteron] CPU: AMD Opteron(tm) Processor 240 (2 cpus) IDE: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03) -- ia64 CPU: Itanium 2 Madison (2 cpus) SCSI: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI IDE: Intel Corporation 82801DB (ICH4) IDE Controller -- i386 CPU: Intel(R) Xeon(TM) CPU 2.40GHz (2 cpus) IDE: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE Controller (rev 02) - Test results for i386 --i8042 (IRQ 12) handled a spurious interrupt -> i8042 The culprit here was the i
Re: [PATCH 2/2] Debug handling of early spurious interrupts
On Fri, 20 Jul 2007 11:20:43 +0900 Fernando Luis V__zquez Cao <[EMAIL PROTECTED]> wrote: > With the advent of kdump it is possible that device drivers receive > interrupts generated in the context of a previous kernel. Ideally > quiescing the underlying devices should suffice but not all drivers do > this, either because it is not possible or because they did not > contemplate this case. Thus drivers ought to be able to handle > interrupts coming in as soon as the interrupt handler is registered. > > Signed-off-by: Fernando Luis Vazquez Cao <[EMAIL PROTECTED]> > --- > > diff -urNp linux-2.6.22-orig/kernel/irq/manage.c > linux-2.6.22-pendirq/kernel/irq/manage.c > --- linux-2.6.22-orig/kernel/irq/manage.c 2007-07-19 18:18:53.0 > +0900 > +++ linux-2.6.22-pendirq/kernel/irq/manage.c 2007-07-19 19:43:41.0 > +0900 > @@ -533,21 +533,32 @@ int request_irq(unsigned int irq, irq_ha > > select_smp_affinity(irq); > > - if (irqflags & IRQF_SHARED) { > - /* > - * It's a shared IRQ -- the driver ought to be prepared for it > - * to happen immediately, so let's make sure > - * We do this before actually registering it, to make sure that > - * a 'real' IRQ doesn't run in parallel with our fake > - */ > - if (irqflags & IRQF_DISABLED) { > - unsigned long flags; > + /* > + * With the advent of kdump it possible that device drivers receive > + * interrupts generated in the context of a previous kernel. Ideally > + * quiescing the underlying devices should suffice but not all drivers > + * do this, either because it is not possible or because they did not > + * contemplate this case. Thus drivers ought to be able to handle > + * interrupts coming in as soon as the interrupt handler is registered. > + * > + * Besides, if it is a shared IRQ the driver ought to be prepared for > + * it to happen immediately too. > + * > + * We do this before actually registering it, to make sure that > + * a 'real' IRQ doesn't run in parallel with our fake. > + */ > + if (irqflags & IRQF_DISABLED) { > + unsigned long flags; > > - local_irq_save(flags); > - handler(irq, dev_id); > - local_irq_restore(flags); > - } else > - handler(irq, dev_id); > + local_irq_save(flags); > + retval = handler(irq, dev_id); > + local_irq_restore(flags); > + } else > + retval = handler(irq, dev_id); > + if (retval == IRQ_HANDLED) { > + printk(KERN_WARNING > +"%s (IRQ %d) handled a spurious interrupt\n", > +devname, irq); > } > This change means that we'll run the irq handler at request_irq()-time even for non-shared interrupts. This is a bit of a worry. See, shared-interrupt handlers are required to be able to cope with being called when their device isn't interrupting. But nobody ever said that non-shared interrupt handlers need to be able to cope with that. Hence these non-shared handlers are within their rights to emit warning printks, go BUG or accuse the operator of having busted hardware. So bad things might happen because of this change. And if they do, they will take a lng time to be discovered, because non-shared interrupt handlers tend to dwell in crufty old drivers which not many people use. So I'm wondering if it would be more prudent to only do all this for shared handlers? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] Debug handling of early spurious interrupts
With the advent of kdump it is possible that device drivers receive interrupts generated in the context of a previous kernel. Ideally quiescing the underlying devices should suffice but not all drivers do this, either because it is not possible or because they did not contemplate this case. Thus drivers ought to be able to handle interrupts coming in as soon as the interrupt handler is registered. Signed-off-by: Fernando Luis Vazquez Cao <[EMAIL PROTECTED]> --- diff -urNp linux-2.6.22-orig/kernel/irq/manage.c linux-2.6.22-pendirq/kernel/irq/manage.c --- linux-2.6.22-orig/kernel/irq/manage.c 2007-07-19 18:18:53.0 +0900 +++ linux-2.6.22-pendirq/kernel/irq/manage.c2007-07-19 19:43:41.0 +0900 @@ -533,21 +533,32 @@ int request_irq(unsigned int irq, irq_ha select_smp_affinity(irq); - if (irqflags & IRQF_SHARED) { - /* -* It's a shared IRQ -- the driver ought to be prepared for it -* to happen immediately, so let's make sure -* We do this before actually registering it, to make sure that -* a 'real' IRQ doesn't run in parallel with our fake -*/ - if (irqflags & IRQF_DISABLED) { - unsigned long flags; + /* +* With the advent of kdump it possible that device drivers receive +* interrupts generated in the context of a previous kernel. Ideally +* quiescing the underlying devices should suffice but not all drivers +* do this, either because it is not possible or because they did not +* contemplate this case. Thus drivers ought to be able to handle +* interrupts coming in as soon as the interrupt handler is registered. +* +* Besides, if it is a shared IRQ the driver ought to be prepared for +* it to happen immediately too. +* +* We do this before actually registering it, to make sure that +* a 'real' IRQ doesn't run in parallel with our fake. +*/ + if (irqflags & IRQF_DISABLED) { + unsigned long flags; - local_irq_save(flags); - handler(irq, dev_id); - local_irq_restore(flags); - } else - handler(irq, dev_id); + local_irq_save(flags); + retval = handler(irq, dev_id); + local_irq_restore(flags); + } else + retval = handler(irq, dev_id); + if (retval == IRQ_HANDLED) { + printk(KERN_WARNING + "%s (IRQ %d) handled a spurious interrupt\n", + devname, irq); } retval = setup_irq(irq, action); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/