On 2012-09-17 14:15, Henri Roosen wrote: > On Mon, Sep 17, 2012 at 1:14 PM, Gilles Chanteperdrix > <gilles.chanteperd...@xenomai.org> wrote: >> On 09/17/2012 12:39 PM, Henri Roosen wrote: >> >>> On Mon, Sep 17, 2012 at 12:00 PM, Gilles Chanteperdrix >>> <gilles.chanteperd...@xenomai.org> wrote: >>>> On 09/17/2012 11:42 AM, Jan Kiszka wrote: >>>>> On 2012-09-17 11:29, Gilles Chanteperdrix wrote: >>>>>> On 09/17/2012 11:07 AM, Jan Kiszka wrote: >>>>>>> On 2012-09-17 10:32, Gilles Chanteperdrix wrote: >>>>>>>> On 09/17/2012 10:18 AM, Jan Kiszka wrote: >>>>>>>>> On 2012-09-17 10:07, Gilles Chanteperdrix wrote: >>>>>>>>>> On 09/17/2012 09:43 AM, Jan Kiszka wrote: >>>>>>>>>>> On 2012-09-17 08:30, Gilles Chanteperdrix wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> looking at x86 latencies, I found that what was taking long on my >>>>>>>>>>>> atom >>>>>>>>>>>> was masking the fasteoi interrupts at IO-APIC level. So, I >>>>>>>>>>>> experimented >>>>>>>>>>>> an idea: masking at LAPIC level instead of IO-APIC, by using the >>>>>>>>>>>> "task >>>>>>>>>>>> priority" register. This seems to improve latencies on my atom: >>>>>>>>>>>> >>>>>>>>>>>> http://sisyphus.hd.free.fr/~gilles/core-3.4-latencies/atom.png >>>>>>>>>>>> >>>>>>>>>>>> This implies splitting the LAPIC vectors in a high priority and low >>>>>>>>>>>> priority sets, the final implementation would use >>>>>>>>>>>> ipipe_enable_irqdesc >>>>>>>>>>>> to detect a high priority domain, and change the vector at that >>>>>>>>>>>> time. >>>>>>>>>>>> >>>>>>>>>>>> This also improves the latencies on my old PIII with a VIA >>>>>>>>>>>> chipset, but >>>>>>>>>>>> it generates spurious interrupts (I do not know if it really is a >>>>>>>>>>>> matter, as handling a spurious interrupt is still faster than >>>>>>>>>>>> masking an >>>>>>>>>>>> IO-APIC interrupt), the spurious interrupts in that case are a >>>>>>>>>>>> documented behaviour of the LAPIC. >>>>>>>>>>>> >>>>>>>>>>>> Is there any interest in pursuing this idea, or are x86 with slow >>>>>>>>>>>> IO-APIC the exception more than the rule, or having to split the >>>>>>>>>>>> vector >>>>>>>>>>>> space appears too great a restriction? >>>>>>>>>>> >>>>>>>>>>> Line-based interrupts are legacy, of decreasing relevance for PCI >>>>>>>>>>> devices - likely what we are primarily interesting in here - due to >>>>>>>>>>> MSI. >>>>>>>>>> >>>>>>>>>> Even if I enable MSI, the kernel still uses these irqs for the >>>>>>>>>> peripherals integrated to the chipset, such as the USB HCI, or ATA >>>>>>>>>> driver (IOW, non PCI devices). >>>>>>>>> >>>>>>>>> Those are all PCI as well. And modern chipsets include variants of >>>>>>>>> them >>>>>>>>> with MSI(-X) support. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> atom login: root >>>>>>>>>> # cat /proc/interrupts >>>>>>>>>> CPU0 CPU1 >>>>>>>>>> 0: 41 0 IO-APIC-edge timer >>>>>>>>>> 4: 39 0 IO-APIC-edge serial >>>>>>>>>> 9: 0 0 IO-APIC-fasteoi acpi >>>>>>>>>> 14: 0 0 IO-APIC-edge ata_piix >>>>>>>>>> 15: 0 0 IO-APIC-edge ata_piix >>>>>>>>>> 16: 0 0 IO-APIC-fasteoi uhci_hcd:usb5 >>>>>>>>>> 18: 0 0 IO-APIC-fasteoi uhci_hcd:usb4 >>>>>>>>>> 19: 0 0 IO-APIC-fasteoi ata_piix, >>>>>>>>>> uhci_hcd:usb3 >>>>>>>>>> 23: 6598 0 IO-APIC-fasteoi ehci_hcd:usb1, >>>>>>>>>> uhci_hcd:usb2 >>>>>>>>>> 43: 2704 0 PCI-MSI-edge eth0 >>>>>>>>>> 44: 249 0 PCI-MSI-edge snd_hda_intel >>>>>>>>>> NMI: 0 0 Non-maskable interrupts >>>>>>>>>> LOC: 661 644 Local timer interrupts >>>>>>>>>> SPU: 0 0 Spurious interrupts >>>>>>>>>> PMI: 0 0 Performance monitoring interrupts >>>>>>>>>> IWI: 0 0 IRQ work interrupts >>>>>>>>>> RTR: 0 0 APIC ICR read retries >>>>>>>>>> RES: 1582 2225 Rescheduling interrupts >>>>>>>>>> CAL: 26 48 Function call interrupts >>>>>>>>>> TLB: 10 19 TLB shootdowns >>>>>>>>>> ERR: 0 >>>>>>>>>> MIS: 0 >>>>>>>>>> >>>>>>>>>> I do not think peripherals integrated to chipsets can really be >>>>>>>>>> considered "legacy". And they tend to be used in the field... >>>>>>>>> >>>>>>>>> The good news is that, even on your low-end atom, you can avoid those >>>>>>>>> latencies by CPU assignment, i.e. isolating the Linux IRQ load on one >>>>>>>>> core and the RT on the other. That's getting easier and easier due to >>>>>>>>> the inflation of cores. >>>>>>>> >>>>>>>> What if you want to use RTUSB for instance? >>>>>>> >>>>>>> Then I will likely not worry about a few micros of additional latency >>>>>>> due to IO-APIC accesses. >>>>>> >>>>>> On my atom, taking an IO-APIC fasteoi interrupt, acking and masking it, >>>>>> takes 10us in UP, and 20us in SMP (with the tracer on). >>>>> >>>>> ...and on more appropriate chipsets? I bet the Atom is (once again) off >>>>> here. >>>> >>>> I do not know, do you care for sharing your traces with us? I only run >>>> Xenomai on atom (which I am not sure do not qualify as "modern", new >>>> atoms seem to be produced), geode (ok, this one is definitely dead, but >>>> there seem to be people still running xenomai on them), and an old >>>> pentium III with an old VIA686 chipset, where masking the IO-APIC is >>>> even slower than acking the i8259. >>>> >>>> Anyway, the IO-APIC registers accesses does not look designed for speed: >>>> it has an indirect scheme that seem more designed to save space in the >>>> processor mapping and to be configured once and for all when >>>> enabling/disabling interrupt, not at each and every interrupt. >>>> >>>> The point is: people may want to use Xenomai on atoms. We do not really >>>> know on what kind of x86 people run xenomai, knowing that would help us >>>> directing our efforts. >>> >>> We are currently investigating whether we can use Atom's for our >>> future products. We have to stick to the x86 architecture and our >>> products should work without big cooling fans. Currently running tests >>> on Atom D2700 (which I know is EOL, but for research purposes should >>> give us a good indication). >>> >>> A 20us latency gain is a lot and would be very welcome in our system! >> >> >> If you enable CONFIG_MSI, do you still see some IO-APIC-fasteoi in >> /proc/interrupts? >> > > The kernel config has no CONFIG_MSI, but instead: > CONFIG_ARCH_SUPPORTS_MSI=y > CONFIG_PCI_MSI=y > > There is still IO-APIC-fasteoi in /proc/interrupts: > > # cat /proc/interrupts > CPU0 CPU1 > 0: 250 0 IO-APIC-edge timer > 4: 71 0 IO-APIC-edge serial > 7: 29 0 IO-APIC-edge > 8: 0 0 IO-APIC-edge rtc0 > 9: 0 0 IO-APIC-fasteoi acpi > 16: 0 0 IO-APIC-fasteoi uhci_hcd:usb5 > 18: 0 0 IO-APIC-fasteoi uhci_hcd:usb4 > 19: 41 0 IO-APIC-fasteoi ata_piix, uhci_hcd:usb3 > 23: 5440 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb2 > 40: 940 0 PCI-MSI-edge eth0 > 41: 21 0 PCI-MSI-edge xhci_hcd > 42: 0 0 PCI-MSI-edge xhci_hcd > 43: 0 0 PCI-MSI-edge xhci_hcd > NMI: 0 0 Non-maskable interrupts > LOC: 29559 25129 Local timer interrupts > SPU: 0 0 Spurious interrupts > PMI: 0 0 Performance monitoring interrupts > IWI: 0 0 IRQ work interrupts > RTR: 0 0 APIC ICR read retries > RES: 20 0 Rescheduling interrupts > CAL: 0 8 Function call interrupts > TLB: 9 5 TLB shootdowns > ERR: 74 > MIS: 0
Unless you are short on CPU resources: isolcpus=1. At least bind all Linux IRQs to one CPU. That's independent of any potential low-level optimizations. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux _______________________________________________ Xenomai mailing list Xenomai@xenomai.org http://www.xenomai.org/mailman/listinfo/xenomai