Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
Hi Milton, sorry for the delayed answer, was on vacation. linux-kernel-ow...@vger.kernel.org wrote on 21.05.2010 11:18:01: linux-kernel-ow...@vger.kernel.org On Thu, 20 May 2010 at 10:21:36 +0200 (CEST) Thomas Gleixner wrote: On Thu, 20 May 2010, Michael Ellerman wrote: On Wed, 2010-05-19 at 16:38 +0200, Thomas Gleixner wrote: On Wed, 19 May 2010, Darren Hart wrote: On 05/18/2010 06:25 PM, Michael Ellerman wrote: On Tue, 2010-05-18 at 15:22 -0700, Darren Hart wrote: The result of the discussion about two years ago on this was that we needed a custom flow handler for XICS on RT. I'm still not clear on why the ultimate solution wasn't to have XICS report edge triggered as edge triggered. Probably some complexity of the entire power stack that I am ignorant of. Apart from the issue of loosing interrupts there is also the fact that masking on the XICS requires an RTAS call which takes a global lock. Right, I'd love to avoid that but with real level interrupts we'd run into an interrupt storm. Though another solution would be to issue the EOI after the threaded handler finished, that'd work as well, but needs testing. Yeah I think that was the idea for the custom flow handler. We'd reset the processor priority so we can take other interrupts (which the EOI usually does for you), then do the actual EOI after the handler finished. That only works when the card does not issue new interrupts until the EOI happens. If the EOI is only relevant for the interrupt controller, then you are going to lose any edge which comes in before the EOI as well. Well, the real MSIs have an extra bit to allow the eoi to dally behind the mmio on another path and that should cover this race when the irq is left enabled. Jan-Bernd HEA has that change, right? I don't now. We never hit problems so we did not look very deep into this area. We probably have to talk to the HEA HW developers to be sure. Regards, Jan-Bernd ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
On Thu, 20 May 2010 at 10:21:36 +0200 (CEST) Thomas Gleixner wrote: On Thu, 20 May 2010, Michael Ellerman wrote: On Wed, 2010-05-19 at 16:38 +0200, Thomas Gleixner wrote: On Wed, 19 May 2010, Darren Hart wrote: On 05/18/2010 06:25 PM, Michael Ellerman wrote: On Tue, 2010-05-18 at 15:22 -0700, Darren Hart wrote: The result of the discussion about two years ago on this was that we needed a custom flow handler for XICS on RT. I'm still not clear on why the ultimate solution wasn't to have XICS report edge triggered as edge triggered. Probably some complexity of the entire power stack that I am ignorant of. Apart from the issue of loosing interrupts there is also the fact that masking on the XICS requires an RTAS call which takes a global lock. Right, I'd love to avoid that but with real level interrupts we'd run into an interrupt storm. Though another solution would be to issue the EOI after the threaded handler finished, that'd work as well, but needs testing. Yeah I think that was the idea for the custom flow handler. We'd reset the processor priority so we can take other interrupts (which the EOI usually does for you), then do the actual EOI after the handler finished. That only works when the card does not issue new interrupts until the EOI happens. If the EOI is only relevant for the interrupt controller, then you are going to lose any edge which comes in before the EOI as well. Well, the real MSIs have an extra bit to allow the eoi to dally behind the mmio on another path and that should cover this race when the irq is left enabled. Jan-Bernd HEA has that change, right? milton ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
On Thu May 20 at 11:28:36 EST in 2010, Michael Ellerman wrote: On Wed, 2010-05-19 at 07:16 -0700, Darren Hart wrote: On 05/18/2010 06:25 PM, Michael Ellerman wrote: On Tue, 2010-05-18 at 15:22 -0700, Darren Hart wrote: On 05/18/2010 02:52 PM, Brian King wrote: Is IRQF_NODELAY something specific to the RT kernel? I don't see it in mainline... Yes, it basically says don't make this handler threaded. That is a good fix for EHEA, but the threaded handling is still broken for anything else that is edge triggered isn't it? No, I don't believe so. Edge triggered interrupts that are reported as edge triggered interrupts will use the edge handler (which was the approach Sebastien took to make this work back in 2008). Since XICS presents all interrupts as Level Triggered, they use the fasteoi path. But that's the point, no interrupts on XICS are reported as edge, even if they are actually edge somewhere deep in the hardware. I don't think we have any reliable way to determine what is what. The platform doesn't tell us this information. The driver might know but we don't need this information. The result of the discussion about two years ago on this was that we needed a custom flow handler for XICS on RT. I'm still not clear on why the ultimate solution wasn't to have XICS report edge triggered as edge triggered. Probably some complexity of the entire power stack that I am ignorant of. I'm not really sure either, but I think it's a case of a leaky abstraction on the part of the hypervisor. Edge interrupts behave as level as long as you handle the irq before EOI, but if you mask they don't. But Milton's the expert on that. More like the hardware actually converts them. They are all handled with the same presentation. The XICS interrupt system is highly scalable and distributed in implementation, with multiple priority delivery and unlimited nesting. First, a few features and description: The hardware has two bits of storage for every LSI interrupt source in the system to say that interrupt is idle, pending, or was rejected and will be retried later. The hardware also stores a destination and delivery priority, settable by software. The destination can be a specific cpu thread, or a global distribution queue of all (online) threads (in the partition). While the hardware used to have 256 priority levels available (255 usable, one for cpu not interrupted), some bits have been stolen and today we only guarantee 16 levels are avalabile to the OS (15 for delivery and one for source disabled / cpu not processing any interrupt). [The current linux kernel delivers all device interrupts at one level but IPIs at a higher level. To avoid overflowing the irq stack we don't allow device interrupts while processing any external interrupt.] The interrupt presentation layer likewise scales, with a seperate instance for each cpu thread in the system. A single IPI source per thread is part of this instance; when a cpu wants to interrupt another it writes the priority of the IPI to that cpus presentation logic. When an interrupt is signaled, the hardware checks the state of that interrupt, and if previously idle it sends an interrupt request with its source number and priority towards the programmed destination, either a specific cpu thread or the global queue of all processors in the system. If that cpu is already handling an interrupt of the same or higher (lower valued) priority either the incoming interrupt will be passed to the next cpu (if the destnation was global) or it will be rejected and the isu will update its state and try again later. If the cpu had a prior interrupt pending at a lower priority then the old interrupt will be rejected back to its isu instead. The normal behavior is a load to a presentation logic register causes the interrupt source number and previous priority of the cpu to be delivered to the cpu and the cpu priority to be raised to that of the incoming interrupt. The external interrupt indication to the cpu is removed. At this point the presentation hardware forgets all history of this interrupt. A store to the same register resets the priority of the cpu (which would naturally be the level before it was interrupted if it stores the value loaded) and sends an EOI (end of interrupt) to the interrupt source specified in the write. This resets the two bits from pending to idle. The software is allowed to reset the cpu priority to allow other interrupts of equal (or even lower) priority to be presented independently of creating the EOI for this source. However, until software creates an EOI for a specific source it will not be presented until the machine is reset. The only rule is you can't raise your priority (which might have to reject a pending interrupt) when you send create (write) the EOI. A cpu can also change its priority to tell the hardware to reject this interrupt (possibly representing to
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
Hi, Michael Ellerman mich...@ellerman.id.au wrote on 20.05.2010 03:34:08: Subject: Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY) On Wed, 2010-05-19 at 23:08 +0200, Thomas Gleixner wrote: On Wed, 19 May 2010, Thomas Gleixner wrote: I'm still not clear on why the ultimate solution wasn't to have XICS report edge triggered as edge triggered. Probably some complexity of the entire power stack that I am ignorant of. Apart from the issue of loosing interrupts there is also thefact that masking on the XICS requires an RTAS call which takes a global lock. Right, I'd love to avoid that but with real level interrupts we'd run into an interrupt storm. Though another solution would be to issue the EOI after the threaded handler finished, that'd work as well, but needs testing. Thought more about that. The case at hand (ehea) is nasty: The driver does _NOT_ disable the rx interrupt in the card in the rx interrupt handler - for whatever reason. Yeah I saw that, but I don't know why it's written that way. Perhaps Jan-Bernd or Doug will chime in and enlighten us? :) From our perspective there is no need to disable interrupts for the RX side as the chip does not fire further interrupts until we tell the chip to do so for a particular queue. We have multiple receive queues with an own interrupt each so that the interrupts can arrive on multiple CPUs in parallel. Interrupts are enabled again when we leave the NAPI Poll function for the corresponding receive queue. Michael, does this answer your question? Regards, Jan-Bernd ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
On Thu, 20 May 2010, Jan-Bernd Themann wrote: Thought more about that. The case at hand (ehea) is nasty: The driver does _NOT_ disable the rx interrupt in the card in the rx interrupt handler - for whatever reason. Yeah I saw that, but I don't know why it's written that way. Perhaps Jan-Bernd or Doug will chime in and enlighten us? :) From our perspective there is no need to disable interrupts for the RX side as the chip does not fire further interrupts until we tell the chip to do so for a particular queue. We have multiple receive The traces tell a different story though: ehea_recv_irq_handler() napi_reschedule() eoi() ehea_poll() ... ehea_recv_irq_handler() ??? napi_reschedule() ... napi_complete() Can't tell whether you can see the same behaviour in mainline, but I don't see a reason why not. queues with an own interrupt each so that the interrupts can arrive on multiple CPUs in parallel. Interrupts are enabled again when we leave the NAPI Poll function for the corresponding receive queue. I can't see a piece of code which does that, but that's probably just lack of detailed hardware knowledge on my side. Thanks, tglx ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
On Thu, 20 May 2010, Michael Ellerman wrote: On Wed, 2010-05-19 at 16:38 +0200, Thomas Gleixner wrote: On Wed, 19 May 2010, Darren Hart wrote: On 05/18/2010 06:25 PM, Michael Ellerman wrote: On Tue, 2010-05-18 at 15:22 -0700, Darren Hart wrote: The result of the discussion about two years ago on this was that we needed a custom flow handler for XICS on RT. I'm still not clear on why the ultimate solution wasn't to have XICS report edge triggered as edge triggered. Probably some complexity of the entire power stack that I am ignorant of. Apart from the issue of loosing interrupts there is also the fact that masking on the XICS requires an RTAS call which takes a global lock. Right, I'd love to avoid that but with real level interrupts we'd run into an interrupt storm. Though another solution would be to issue the EOI after the threaded handler finished, that'd work as well, but needs testing. Yeah I think that was the idea for the custom flow handler. We'd reset the processor priority so we can take other interrupts (which the EOI usually does for you), then do the actual EOI after the handler finished. That only works when the card does not issue new interrupts until the EOI happens. If the EOI is only relevant for the interrupt controller, then you are going to lose any edge which comes in before the EOI as well. Thanks, tglx ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
Hi Thomas Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY) On Thu, 20 May 2010, Jan-Bernd Themann wrote: Thought more about that. The case at hand (ehea) is nasty: The driver does _NOT_ disable the rx interrupt in the card in the rx interrupt handler - for whatever reason. Yeah I saw that, but I don't know why it's written that way. Perhaps Jan-Bernd or Doug will chime in and enlighten us? :) From our perspective there is no need to disable interrupts for the RX side as the chip does not fire further interrupts until we tell the chip to do so for a particular queue. We have multiple receive The traces tell a different story though: ehea_recv_irq_handler() napi_reschedule() eoi() ehea_poll() ... ehea_recv_irq_handler() ??? napi_reschedule() ... napi_complete() Can't tell whether you can see the same behaviour in mainline, but I don't see a reason why not. Is this the same interrupt we are seeing here, or do we see a second other interrupt popping up on the same CPU? As I said, with multiple receive queues (if enabled) you can have multiple interrupts in parallel. Pleaes check if multiple queues are enabled. The following module parameter is used for that: MODULE_PARM_DESC(use_mcs, 0:NAPI, 1:Multiple receive queues, Default = 0 ); you should also see the number of used HEA interrupts in /proc/interrupts queues with an own interrupt each so that the interrupts can arrive on multiple CPUs in parallel. Interrupts are enabled again when we leave the NAPI Poll function for the corresponding receive queue. I can't see a piece of code which does that, but that's probably just lack of detailed hardware knowledge on my side. If you mean the re-enable piece of code, it is not very obvious, you are right. Interrupts are only generated if a particular register for our completion queues is written. We do this in the following line: ehea_reset_cq_ep(pr-recv_cq); ehea_reset_cq_ep(pr-send_cq); ehea_reset_cq_n1(pr-recv_cq); ehea_reset_cq_n1(pr-send_cq); So this is in a way an indirect way to ask for interrupts when new completions were written to memory. We don't really disable/enable interrupts on the HEA chip itself. I think there are some mechanisms build in the HEA chip that should prevent that interrupts don't get lost. But that is something that is / was completely hidden from us, so my skill is very limited there. If more details are needed here we should involve the PHYP guys + eHEA HW guys if not already done. Did anyone already talk to them? Regards, Jan-Bernd ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
Jan-Bernd, On Thu, 20 May 2010, Jan-Bernd Themann wrote: Hi Thomas Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY) On Thu, 20 May 2010, Jan-Bernd Themann wrote: Thought more about that. The case at hand (ehea) is nasty: The driver does _NOT_ disable the rx interrupt in the card in the rx interrupt handler - for whatever reason. Yeah I saw that, but I don't know why it's written that way. Perhaps Jan-Bernd or Doug will chime in and enlighten us? :) From our perspective there is no need to disable interrupts for the RX side as the chip does not fire further interrupts until we tell the chip to do so for a particular queue. We have multiple receive The traces tell a different story though: ehea_recv_irq_handler() napi_reschedule() eoi() ehea_poll() ... ehea_recv_irq_handler() ??? napi_reschedule() ... napi_complete() Can't tell whether you can see the same behaviour in mainline, but I don't see a reason why not. Is this the same interrupt we are seeing here, or do we see a second other interrupt popping up on the same CPU? As I said, with multiple receive queues (if enabled) you can have multiple interrupts in parallel. According to the traces it's the very same interrupt number. Pleaes check if multiple queues are enabled. The following module parameter is used for that: MODULE_PARM_DESC(use_mcs, 0:NAPI, 1:Multiple receive queues, Default = 0 ); you should also see the number of used HEA interrupts in /proc/interrupts I leave that for Will and Darren, they have the hardware :) queues with an own interrupt each so that the interrupts can arrive on multiple CPUs in parallel. Interrupts are enabled again when we leave the NAPI Poll function for the corresponding receive queue. I can't see a piece of code which does that, but that's probably just lack of detailed hardware knowledge on my side. If you mean the re-enable piece of code, it is not very obvious, you are right. Interrupts are only generated if a particular register for our completion queues is written. We do this in the following line: ehea_reset_cq_ep(pr-recv_cq); ehea_reset_cq_ep(pr-send_cq); ehea_reset_cq_n1(pr-recv_cq); ehea_reset_cq_n1(pr-send_cq); So this is in a way an indirect way to ask for interrupts when new completions were written to memory. We don't really disable/enable interrupts on the HEA chip itself. Ah, ok. That's after the napi_complete which looks correct. I think there are some mechanisms build in the HEA chip that should prevent that interrupts don't get lost. But that is something that is / was completely hidden from us, so my skill is very limited there. If more details are needed here we should involve the PHYP guys + eHEA HW guys if not already done. Did anyone already talk to them? Will or Darren might have, but lets gather more information first before we rack their nerves :) Thanks, tglx ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
Thomas Gleixner wrote: Pleaes check if multiple queues are enabled. The following module parameter is used for that: MODULE_PARM_DESC(use_mcs, 0:NAPI, 1:Multiple receive queues, Default = 0 ); you should also see the number of used HEA interrupts in /proc/interrupts I leave that for Will and Darren, they have the hardware :) 16: 477477 ... XICS Level IPI 17:129 ... XICS Level hvc_console 18: 0 ... XICS Level RAS_EPOW 33: 139232 ... XICS Level mlx4_core 256: 3 ... XICS Level ehea_neq 259: 0 ... XICS Level eth0-aff 260:2082153 ... XICS Level eth0-queue0 289: 119166 ... XICS Level ipr 305: 0 ... XICS Level ohci_hcd:usb2 306: 0 ... XICS Level ohci_hcd:usb3 307:2389839 ... XICS Level ehci_hcd:usb1 Nope, multiple rx queues not enabled. thanks, Nivedita ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
On 05/20/2010 01:14 AM, Thomas Gleixner wrote: On Thu, 20 May 2010, Jan-Bernd Themann wrote: Thought more about that. The case at hand (ehea) is nasty: The driver does _NOT_ disable the rx interrupt in the card in the rx interrupt handler - for whatever reason. Yeah I saw that, but I don't know why it's written that way. Perhaps Jan-Bernd or Doug will chime in and enlighten us? :) From our perspective there is no need to disable interrupts for the RX side as the chip does not fire further interrupts until we tell the chip to do so for a particular queue. We have multiple receive The traces tell a different story though: ehea_recv_irq_handler() napi_reschedule() eoi() ehea_poll() ... ehea_recv_irq_handler() ??? napi_reschedule() ... napi_complete() Can't tell whether you can see the same behaviour in mainline, but I don't see a reason why not. I was going to suggest that because these are threaded handlers, perhaps they are rescheduled on a different CPU and then receive the interrupt for the other CPU/queue that Jan was mentioning. But, the handlers are affined if I remember correctly, and we aren't running with multiple receive queues. So, we're back to the same question, why are we seeing another irq. It comes in before napi_complete() and therefor before the ehea_reset*() block of calls which do the equivalent of re-enabling interrupts. -- Darren queues with an own interrupt each so that the interrupts can arrive on multiple CPUs in parallel. Interrupts are enabled again when we leave the NAPI Poll function for the corresponding receive queue. I can't see a piece of code which does that, but that's probably just lack of detailed hardware knowledge on my side. Thanks, tglx -- Darren Hart IBM Linux Technology Center Real-Time Linux Team ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
On Thu, 20 May 2010, Darren Hart wrote: On 05/20/2010 01:14 AM, Thomas Gleixner wrote: On Thu, 20 May 2010, Jan-Bernd Themann wrote: Thought more about that. The case at hand (ehea) is nasty: The driver does _NOT_ disable the rx interrupt in the card in the rx interrupt handler - for whatever reason. Yeah I saw that, but I don't know why it's written that way. Perhaps Jan-Bernd or Doug will chime in and enlighten us? :) From our perspective there is no need to disable interrupts for the RX side as the chip does not fire further interrupts until we tell the chip to do so for a particular queue. We have multiple receive The traces tell a different story though: ehea_recv_irq_handler() napi_reschedule() eoi() ehea_poll() ... ehea_recv_irq_handler() ??? napi_reschedule() ... napi_complete() Can't tell whether you can see the same behaviour in mainline, but I don't see a reason why not. I was going to suggest that because these are threaded handlers, perhaps they are rescheduled on a different CPU and then receive the interrupt for the other CPU/queue that Jan was mentioning. But, the handlers are affined if I remember correctly, and we aren't running with multiple receive queues. So, we're back to the same question, why are we seeing another irq. It comes in before napi_complete() and therefor before the ehea_reset*() block of calls which do the equivalent of re-enabling interrupts. Can you slap a few trace points into that driver with a stock mainline kernel and verify that ? Thanks, tglx ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
On Thu, 2010-05-20 at 11:05 +0200, Jan-Bernd Themann wrote: Hi Thomas Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY) On Thu, 20 May 2010, Jan-Bernd Themann wrote: Thought more about that. The case at hand (ehea) is nasty: The driver does _NOT_ disable the rx interrupt in the card in the rx interrupt handler - for whatever reason. Yeah I saw that, but I don't know why it's written that way. Perhaps Jan-Bernd or Doug will chime in and enlighten us? :) From our perspective there is no need to disable interrupts for the RX side as the chip does not fire further interrupts until we tell the chip to do so for a particular queue. We have multiple receive The traces tell a different story though: ehea_recv_irq_handler() napi_reschedule() eoi() ehea_poll() ... ehea_recv_irq_handler() ??? napi_reschedule() ... napi_complete() Can't tell whether you can see the same behaviour in mainline, but I don't see a reason why not. Is this the same interrupt we are seeing here, or do we see a second other interrupt popping up on the same CPU? As I said, with multiple receive queues (if enabled) you can have multiple interrupts in parallel. Same interrupt number (260). Per the trace data, the first ehea_recv_irq_handler (at 117.904525) was on cpu 0, the second (at 117.904689) was on cpu 1. ...-2180 [000] 117.904525: .ehea_recv_irq_handler: ENTER 0 c000e8bd08b0 ...-2180 [000] 117.904527: .ehea_recv_irq_handler: napi_reschedule COMpleted c000e8bd08b0 ...-2180 [000] 117.904528: .ehea_recv_irq_handler: EXIT reschedule(1) 1 c000e8bd08b0 ...-2180 [000] 117.904529: .xics_unmask_irq: xics: unmask virq 260 772 ...-2180 [000] 117.904547: .xics_unmask_irq: xics: unmask virq pre-xive 260 772 0 status:0 ff ...-2180 [000] 117.904586: .xics_unmask_irq: xics: unmask virq post-xive 260 772 0 D:11416 status:0 5 ...-2180 [000] 117.904602: .handle_fasteoi_irq: 260 8004000 ...-2180 [000] 117.904603: .xics_mask_irq: xics: mask virq 260 772 ...-2180 [000] 117.904634: .xics_mask_real_irq: xics: before: mask_real 772 status:0 5 ...-2180 [000] 117.904668: .xics_mask_real_irq: xics: after: mask_real 772 status:0 ff ...-2180 [000] 117.904669: .handle_fasteoi_irq: pre-action: 260 8004100 ...-2180 [000] 117.904671: .handle_fasteoi_irq: post-action: 260 8004100 ...-2180 [000] 117.904672: .handle_fasteoi_irq: exit. 260 8004000 ...-7 [000] 117.904681: .ehea_poll: ENTER 1 c000e8bd08b0 poll_counter:0 force:0 ...-7 [000] 117.904683: .ehea_proc_rwqes: ehea_check_cqe 0 ...-2180 [001] 117.904689: .ehea_recv_irq_handler: ENTER 1 c000e8bd08b0 ...-7 [000] 117.904690: .ehea_proc_rwqes: ehea_check_cqe 0 ...-2180 [001] 117.904691: .ehea_recv_irq_handler: napi_reschedule inCOMplete c000e8bd08b0 ...-2180 [001] 117.904692: .ehea_recv_irq_handler: EXIT reschedule(0) 1 c000e8bd08b0 ...-2180 [001] 117.904694: .xics_unmask_irq: xics: unmask virq 260 772 ...-7 [000] 117.904702: .ehea_refill_rq2: ehea_refill_rq2 ...-7 [000] 117.904703: .ehea_refill_rq_def: ehea_refill_rq_def ...-7 [000] 117.904704: .ehea_refill_rq3: ehea_refill_rq3 ...-7 [000] 117.904705: .ehea_refill_rq_def: ehea_refill_rq_def ...-7 [000] 117.904706: .napi_complete: napi_complete: ENTER state: 1 c000e8bd08b0 ...-7 [000] 117.904707: .napi_complete: napi_complete: EXIT state: 0 c000e8bd08b0 ...-7 [000] 117.904710: .ehea_poll: EXIT !cqe rx(2). 0 c000e8bd08b0 ...-2180 [001] 117.904719: .xics_unmask_irq: xics: unmask virq pre-xive 260 772 0 status:0 ff ...-2180 [001] 117.904761: .xics_unmask_irq: xics: unmask virq post-xive 260 772 0 D:12705 status:0 5 Pleaes check if multiple queues are enabled. The following module parameter is used for that: MODULE_PARM_DESC(use_mcs, 0:NAPI, 1:Multiple receive queues, Default = 0 ); No module parameters were used, should be plain old defaults. you should also see the number of used HEA interrupts in /proc/interrupts 256: 100000 00 XICS Level ehea_neq 259: 000000 00 XICS Level eth0-aff 260: 36196500000 00 XICS Level eth0-queue0 queues with an own interrupt each so that the interrupts can arrive on multiple CPUs in parallel. Interrupts are enabled again when we leave the NAPI Poll function for the corresponding receive queue. I can't see a piece of code which does that, but that's probably just lack of detailed hardware knowledge on my side. If you mean the re-enable piece of code, it is not very obvious, you are right. Interrupts are only generated if a particular register for our completion queues is written. We do this in the following line
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
On Thu, 2010-05-20 at 16:45 +0200, Thomas Gleixner wrote: On Thu, 20 May 2010, Darren Hart wrote: On 05/20/2010 01:14 AM, Thomas Gleixner wrote: On Thu, 20 May 2010, Jan-Bernd Themann wrote: Thought more about that. The case at hand (ehea) is nasty: The driver does _NOT_ disable the rx interrupt in the card in the rx interrupt handler - for whatever reason. Yeah I saw that, but I don't know why it's written that way. Perhaps Jan-Bernd or Doug will chime in and enlighten us? :) From our perspective there is no need to disable interrupts for the RX side as the chip does not fire further interrupts until we tell the chip to do so for a particular queue. We have multiple receive The traces tell a different story though: ehea_recv_irq_handler() napi_reschedule() eoi() ehea_poll() ... ehea_recv_irq_handler() ??? napi_reschedule() ... napi_complete() Can't tell whether you can see the same behaviour in mainline, but I don't see a reason why not. I was going to suggest that because these are threaded handlers, perhaps they are rescheduled on a different CPU and then receive the interrupt for the other CPU/queue that Jan was mentioning. But, the handlers are affined if I remember correctly, and we aren't running with multiple receive queues. So, we're back to the same question, why are we seeing another irq. It comes in before napi_complete() and therefor before the ehea_reset*() block of calls which do the equivalent of re-enabling interrupts. Can you slap a few trace points into that driver with a stock mainline kernel and verify that ? 2.6.33.4 (non-rt kernel) with similar trace_printk hooks in place... Most data lumps look like so: idle-0 [000] 1097.685337: .handle_fasteoi_irq: ENTER 260 4000 idle-0 [000] 1097.685339: .handle_fasteoi_irq: pre-action 260 4100 idle-0 [000] 1097.685339: .ehea_recv_irq_handler: ENTER c000e8980700 idle-0 [000] 1097.685340: .ehea_recv_irq_handler: napi_schedule ... c000e8980700 idle-0 [000] 1097.685341: .ehea_recv_irq_handler: napi_schedule Calling __napi_schedule ... c000e8980700 idle-0 [000] 1097.685342: .ehea_recv_irq_handler: EXIT c000e8980700 idle-0 [000] 1097.685343: .handle_fasteoi_irq: post-action 260 4100 idle-0 [000] 1097.685344: .handle_fasteoi_irq: EXIT. 260 4000 idle-0 [000] 1097.685346: .ehea_poll: ENTER c000e8980700 idle-0 [000] 1097.685352: .napi_complete: napi_complete: ENTER c000e8980700 idle-0 [000] 1097.685352: .napi_complete: napi_complete: EXIT c000e8980700 idle-0 [000] 1097.685355: .ehea_poll: EXIT !cqe rx(1) c000e8980700 But I did see one like this, which shows a ehea_recv_irq_handler ENTER within a ehea_poll ENTER. (which I think is what you were expecting, or wanted to verify..) idle-0 [000] 1097.616261: .handle_fasteoi_irq: ENTER 260 4000 idle-0 [000] 1097.616262: .handle_fasteoi_irq: pre-action 260 4100 * idle-0 [000] 1097.616263: .ehea_recv_irq_handler: ENTER c000e8980700 idle-0 [000] 1097.616264: .ehea_recv_irq_handler: napi_schedule ... c000e8980700 idle-0 [000] 1097.616265: .ehea_recv_irq_handler: napi_schedule Calling __napi_schedule ... c000e8980700 idle-0 [000] 1097.616265: .ehea_recv_irq_handler: EXIT c000e8980700 idle-0 [000] 1097.616266: .handle_fasteoi_irq: post-action 260 4100 idle-0 [000] 1097.616268: .handle_fasteoi_irq: EXIT. 260 4000 * idle-0 [000] 1097.616270: .ehea_poll: ENTER c000e8980700 idle-0 [000] 1097.616282: .handle_fasteoi_irq: ENTER 260 4000 idle-0 [000] 1097.616283: .handle_fasteoi_irq: pre-action 260 4100 * idle-0 [000] 1097.616284: .ehea_recv_irq_handler: ENTER c000e8980700 idle-0 [000] 1097.616285: .ehea_recv_irq_handler: napi_schedule ... c000e8980700 idle-0 [000] 1097.616286: .ehea_recv_irq_handler: napi_schedule NOT Calling __napi_schedule... c000e8980700 idle-0 [000] 1097.616286: .ehea_recv_irq_handler: EXIT c000e8980700 idle-0 [000] 1097.616287: .handle_fasteoi_irq: post-action 260 4100 idle-0 [000] 1097.616289: .handle_fasteoi_irq: EXIT. 260 4000 idle-0 [000] 1097.616299: .napi_complete: napi_complete: ENTER c000e8980700 idle-0 [000] 1097.616300: .napi_complete: napi_complete: EXIT c000e8980700 idle-0 [000] 1097.616302: .ehea_poll: napi_reschedule COMpleted c000e8980700 idle-0 [000]
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
On 05/18/2010 06:25 PM, Michael Ellerman wrote: On Tue, 2010-05-18 at 15:22 -0700, Darren Hart wrote: On 05/18/2010 02:52 PM, Brian King wrote: Is IRQF_NODELAY something specific to the RT kernel? I don't see it in mainline... Yes, it basically says don't make this handler threaded. That is a good fix for EHEA, but the threaded handling is still broken for anything else that is edge triggered isn't it? No, I don't believe so. Edge triggered interrupts that are reported as edge triggered interrupts will use the edge handler (which was the approach Sebastien took to make this work back in 2008). Since XICS presents all interrupts as Level Triggered, they use the fasteoi path. The result of the discussion about two years ago on this was that we needed a custom flow handler for XICS on RT. I'm still not clear on why the ultimate solution wasn't to have XICS report edge triggered as edge triggered. Probably some complexity of the entire power stack that I am ignorant of. Apart from the issue of loosing interrupts there is also the fact that masking on the XICS requires an RTAS call which takes a global lock. Right, one of may reasons why we felt this was the right fix. The other is that there is no real additional overhead in running this as non-threaded since the receive handler is so short (just napi_schedule()). -- Darren Hart IBM Linux Technology Center Real-Time Linux Team ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
On Wed, 19 May 2010, Darren Hart wrote: On 05/18/2010 06:25 PM, Michael Ellerman wrote: On Tue, 2010-05-18 at 15:22 -0700, Darren Hart wrote: On 05/18/2010 02:52 PM, Brian King wrote: Is IRQF_NODELAY something specific to the RT kernel? I don't see it in mainline... Yes, it basically says don't make this handler threaded. That is a good fix for EHEA, but the threaded handling is still broken for anything else that is edge triggered isn't it? No, I don't believe so. Edge triggered interrupts that are reported as edge triggered interrupts will use the edge handler (which was the approach Sebastien took to make this work back in 2008). Since XICS presents all interrupts as Level Triggered, they use the fasteoi path. I wonder whether the XICS interrupts which are edge type can be identified from the irq_desc-flags. Then we could avoid the masking for those in the fasteoi_handler in general. The result of the discussion about two years ago on this was that we needed a custom flow handler for XICS on RT. I'm still not clear on why the ultimate solution wasn't to have XICS report edge triggered as edge triggered. Probably some complexity of the entire power stack that I am ignorant of. Apart from the issue of loosing interrupts there is also the fact that masking on the XICS requires an RTAS call which takes a global lock. Right, I'd love to avoid that but with real level interrupts we'd run into an interrupt storm. Though another solution would be to issue the EOI after the threaded handler finished, that'd work as well, but needs testing. Right, one of may reasons why we felt this was the right fix. The other is that there is no real additional overhead in running this as non-threaded since the receive handler is so short (just napi_schedule()). Yes, in the case at hand it's the right thing to do, as we avoid another wakeup/context switch. Thanks, tglx ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
On Wed, 19 May 2010, Thomas Gleixner wrote: I'm still not clear on why the ultimate solution wasn't to have XICS report edge triggered as edge triggered. Probably some complexity of the entire power stack that I am ignorant of. Apart from the issue of loosing interrupts there is also the fact that masking on the XICS requires an RTAS call which takes a global lock. Right, I'd love to avoid that but with real level interrupts we'd run into an interrupt storm. Though another solution would be to issue the EOI after the threaded handler finished, that'd work as well, but needs testing. Thought more about that. The case at hand (ehea) is nasty: The driver does _NOT_ disable the rx interrupt in the card in the rx interrupt handler - for whatever reason. So even in mainline you get repeated rx interrupts when packets arrive while napi is processing the poll, which is suboptimal at least. In fact it is counterproductive as the whole purpose of NAPI is to _NOT_ get interrupts for consecutive incoming packets while the poll is active. Most of the other network drivers do: rx_irq() disable rx interrupts on card napi_schedule() Now when the napi poll is done (no more packets available) then the driver reenables the rx interrupt on the card. Thanks, tglx ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
On Wed, 2010-05-19 at 07:16 -0700, Darren Hart wrote: On 05/18/2010 06:25 PM, Michael Ellerman wrote: On Tue, 2010-05-18 at 15:22 -0700, Darren Hart wrote: On 05/18/2010 02:52 PM, Brian King wrote: Is IRQF_NODELAY something specific to the RT kernel? I don't see it in mainline... Yes, it basically says don't make this handler threaded. That is a good fix for EHEA, but the threaded handling is still broken for anything else that is edge triggered isn't it? No, I don't believe so. Edge triggered interrupts that are reported as edge triggered interrupts will use the edge handler (which was the approach Sebastien took to make this work back in 2008). Since XICS presents all interrupts as Level Triggered, they use the fasteoi path. But that's the point, no interrupts on XICS are reported as edge, even if they are actually edge somewhere deep in the hardware. I don't think we have any reliable way to determine what is what. The result of the discussion about two years ago on this was that we needed a custom flow handler for XICS on RT. I'm still not clear on why the ultimate solution wasn't to have XICS report edge triggered as edge triggered. Probably some complexity of the entire power stack that I am ignorant of. I'm not really sure either, but I think it's a case of a leaky abstraction on the part of the hypervisor. Edge interrupts behave as level as long as you handle the irq before EOI, but if you mask they don't. But Milton's the expert on that. Apart from the issue of loosing interrupts there is also the fact that masking on the XICS requires an RTAS call which takes a global lock. Right, one of may reasons why we felt this was the right fix. The other is that there is no real additional overhead in running this as non-threaded since the receive handler is so short (just napi_schedule()). True. It's not a fix in general though. I'm worried that we're going to see the exact same bug for MSI(-X) interrupts. cheers signature.asc Description: This is a digitally signed message part ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
On Wed, 2010-05-19 at 16:38 +0200, Thomas Gleixner wrote: On Wed, 19 May 2010, Darren Hart wrote: On 05/18/2010 06:25 PM, Michael Ellerman wrote: On Tue, 2010-05-18 at 15:22 -0700, Darren Hart wrote: On 05/18/2010 02:52 PM, Brian King wrote: Is IRQF_NODELAY something specific to the RT kernel? I don't see it in mainline... Yes, it basically says don't make this handler threaded. That is a good fix for EHEA, but the threaded handling is still broken for anything else that is edge triggered isn't it? No, I don't believe so. Edge triggered interrupts that are reported as edge triggered interrupts will use the edge handler (which was the approach Sebastien took to make this work back in 2008). Since XICS presents all interrupts as Level Triggered, they use the fasteoi path. I wonder whether the XICS interrupts which are edge type can be identified from the irq_desc-flags. Then we could avoid the masking for those in the fasteoi_handler in general. I'm not sure they can be. I know on other similar HW we can detect LSI vs MSI, but that's without the HV in the equation. The result of the discussion about two years ago on this was that we needed a custom flow handler for XICS on RT. I'm still not clear on why the ultimate solution wasn't to have XICS report edge triggered as edge triggered. Probably some complexity of the entire power stack that I am ignorant of. Apart from the issue of loosing interrupts there is also the fact that masking on the XICS requires an RTAS call which takes a global lock. Right, I'd love to avoid that but with real level interrupts we'd run into an interrupt storm. Though another solution would be to issue the EOI after the threaded handler finished, that'd work as well, but needs testing. Yeah I think that was the idea for the custom flow handler. We'd reset the processor priority so we can take other interrupts (which the EOI usually does for you), then do the actual EOI after the handler finished. cheers signature.asc Description: This is a digitally signed message part ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
On Wed, 2010-05-19 at 23:08 +0200, Thomas Gleixner wrote: On Wed, 19 May 2010, Thomas Gleixner wrote: I'm still not clear on why the ultimate solution wasn't to have XICS report edge triggered as edge triggered. Probably some complexity of the entire power stack that I am ignorant of. Apart from the issue of loosing interrupts there is also the fact that masking on the XICS requires an RTAS call which takes a global lock. Right, I'd love to avoid that but with real level interrupts we'd run into an interrupt storm. Though another solution would be to issue the EOI after the threaded handler finished, that'd work as well, but needs testing. Thought more about that. The case at hand (ehea) is nasty: The driver does _NOT_ disable the rx interrupt in the card in the rx interrupt handler - for whatever reason. Yeah I saw that, but I don't know why it's written that way. Perhaps Jan-Bernd or Doug will chime in and enlighten us? :) cheers signature.asc Description: This is a digitally signed message part ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
Is IRQF_NODELAY something specific to the RT kernel? I don't see it in mainline... -Brian On 05/18/2010 04:33 PM, dvh...@linux.vnet.ibm.com wrote: From ad81664794e33d785f533c5edee37aaba20dd92d Mon Sep 17 00:00:00 2001 From: Darren Hart dvh...@us.ibm.com Date: Tue, 18 May 2010 11:07:13 -0700 Subject: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY) The underlying hardware is edge triggered but presented by XICS as level triggered. The edge triggered interrupts are not reissued after masking. This is not a problem in mainline which does not mask the interrupt (relying on the EOI mechanism instead). The threaded interrupts in PREEMPT_RT do mask the interrupt, and can lose interrupts that occurred while masked, resulting in a hung ethernet interface. The receive handler simply calls napi_schedule(), as such, there is no significant additional overhead in making this non-threaded, since we either wakeup the threaded irq handler to call napi_schedule(), or just call napi_schedule() directly to wakeup the softirqs. As the receive handler is lockless, there is no need to convert any of the ehea spinlock_t's to atomic_spinlock_t's. Without this patch, a simple scp file copy loop would fail quickly (usually seconds). We have over two hours of sustained scp activity with the patch applied. Credit goes to Will Schmidt for lots of instrumentation and tracing which clarified the scenario and to Thomas Gleixner for the incredibly simple solution. Signed-off-by: Darren Hart dvh...@us.ibm.com Acked-by: Will Schmidt will_schm...@vnet.ibm.com Cc: Thomas Gleixner t...@linuxtronix.de Cc: Jan-Bernd Themann them...@de.ibm.com Cc: Nivedita Singhvi n...@us.ibm.com Cc: Brian King bjki...@us.ibm.com Cc: Michael Ellerman eller...@au1.ibm.com Cc: Doug Maxey doug.ma...@us.ibm.com --- drivers/net/ehea/ehea_main.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c index 977c3d3..2c53df2 100644 --- a/drivers/net/ehea/ehea_main.c +++ b/drivers/net/ehea/ehea_main.c @@ -1263,7 +1263,7 @@ static int ehea_reg_interrupts(struct net_device *dev) %s-queue%d, dev-name, i); ret = ibmebus_request_irq(pr-eq-attr.ist1, ehea_recv_irq_handler, - IRQF_DISABLED, pr-int_send_name, + IRQF_DISABLED | IRQF_NODELAY, pr-int_send_name, pr); if (ret) { ehea_error(failed registering irq for ehea_queue -- Brian King Linux on Power Virtualization IBM Linux Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
Brian King wrote: Is IRQF_NODELAY something specific to the RT kernel? I don't see it in mainline... Yep, this is RT only. thanks, Nivedita ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
On 05/18/2010 02:52 PM, Brian King wrote: Is IRQF_NODELAY something specific to the RT kernel? I don't see it in mainline... Yes, it basically says don't make this handler threaded. -- Darren -Brian On 05/18/2010 04:33 PM, dvh...@linux.vnet.ibm.com wrote: From ad81664794e33d785f533c5edee37aaba20dd92d Mon Sep 17 00:00:00 2001 From: Darren Hartdvh...@us.ibm.com Date: Tue, 18 May 2010 11:07:13 -0700 Subject: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY) The underlying hardware is edge triggered but presented by XICS as level triggered. The edge triggered interrupts are not reissued after masking. This is not a problem in mainline which does not mask the interrupt (relying on the EOI mechanism instead). The threaded interrupts in PREEMPT_RT do mask the interrupt, and can lose interrupts that occurred while masked, resulting in a hung ethernet interface. The receive handler simply calls napi_schedule(), as such, there is no significant additional overhead in making this non-threaded, since we either wakeup the threaded irq handler to call napi_schedule(), or just call napi_schedule() directly to wakeup the softirqs. As the receive handler is lockless, there is no need to convert any of the ehea spinlock_t's to atomic_spinlock_t's. Without this patch, a simple scp file copy loop would fail quickly (usually seconds). We have over two hours of sustained scp activity with the patch applied. Credit goes to Will Schmidt for lots of instrumentation and tracing which clarified the scenario and to Thomas Gleixner for the incredibly simple solution. Signed-off-by: Darren Hartdvh...@us.ibm.com Acked-by: Will Schmidtwill_schm...@vnet.ibm.com Cc: Thomas Gleixnert...@linuxtronix.de Cc: Jan-Bernd Themannthem...@de.ibm.com Cc: Nivedita Singhvin...@us.ibm.com Cc: Brian Kingbjki...@us.ibm.com Cc: Michael Ellermaneller...@au1.ibm.com Cc: Doug Maxeydoug.ma...@us.ibm.com --- drivers/net/ehea/ehea_main.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c index 977c3d3..2c53df2 100644 --- a/drivers/net/ehea/ehea_main.c +++ b/drivers/net/ehea/ehea_main.c @@ -1263,7 +1263,7 @@ static int ehea_reg_interrupts(struct net_device *dev) %s-queue%d, dev-name, i); ret = ibmebus_request_irq(pr-eq-attr.ist1, ehea_recv_irq_handler, - IRQF_DISABLED, pr-int_send_name, + IRQF_DISABLED | IRQF_NODELAY, pr-int_send_name, pr); if (ret) { ehea_error(failed registering irq for ehea_queue -- Darren Hart IBM Linux Technology Center Real-Time Linux Team ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
On Tue, 2010-05-18 at 15:22 -0700, Darren Hart wrote: On 05/18/2010 02:52 PM, Brian King wrote: Is IRQF_NODELAY something specific to the RT kernel? I don't see it in mainline... Yes, it basically says don't make this handler threaded. That is a good fix for EHEA, but the threaded handling is still broken for anything else that is edge triggered isn't it? The result of the discussion about two years ago on this was that we needed a custom flow handler for XICS on RT. Apart from the issue of loosing interrupts there is also the fact that masking on the XICS requires an RTAS call which takes a global lock. cheers signature.asc Description: This is a digitally signed message part ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev