On 09/17/2012 03:54 PM, Jan Kiszka wrote:
> On 2012-09-17 15:46, Gilles Chanteperdrix wrote:
>> On 09/17/2012 02:27 PM, Jan Kiszka wrote:
>>> On 2012-09-17 14:15, Henri Roosen wrote:
>>>> On Mon, Sep 17, 2012 at 1:14 PM, Gilles Chanteperdrix
>>>> <gilles.chanteperd...@xenomai.org> wrote:
>>>>> On 09/17/2012 12:39 PM, Henri Roosen wrote:
>>>>>
>>>>>> On Mon, Sep 17, 2012 at 12:00 PM, Gilles Chanteperdrix
>>>>>> <gilles.chanteperd...@xenomai.org> wrote:
>>>>>>> On 09/17/2012 11:42 AM, Jan Kiszka wrote:
>>>>>>>> On 2012-09-17 11:29, Gilles Chanteperdrix wrote:
>>>>>>>>> On 09/17/2012 11:07 AM, Jan Kiszka wrote:
>>>>>>>>>> On 2012-09-17 10:32, Gilles Chanteperdrix wrote:
>>>>>>>>>>> On 09/17/2012 10:18 AM, Jan Kiszka wrote:
>>>>>>>>>>>> On 2012-09-17 10:07, Gilles Chanteperdrix wrote:
>>>>>>>>>>>>> On 09/17/2012 09:43 AM, Jan Kiszka wrote:
>>>>>>>>>>>>>> On 2012-09-17 08:30, Gilles Chanteperdrix wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> looking at x86 latencies, I found that what was taking long on 
>>>>>>>>>>>>>>> my atom
>>>>>>>>>>>>>>> was masking the fasteoi interrupts at IO-APIC level. So, I 
>>>>>>>>>>>>>>> experimented
>>>>>>>>>>>>>>> an idea: masking at LAPIC level instead of IO-APIC, by using 
>>>>>>>>>>>>>>> the "task
>>>>>>>>>>>>>>> priority" register. This seems to improve latencies on my atom:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> http://sisyphus.hd.free.fr/~gilles/core-3.4-latencies/atom.png
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This implies splitting the LAPIC vectors in a high priority and 
>>>>>>>>>>>>>>> low
>>>>>>>>>>>>>>> priority sets, the final implementation would use 
>>>>>>>>>>>>>>> ipipe_enable_irqdesc
>>>>>>>>>>>>>>> to detect a high priority domain, and change the vector at that 
>>>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This also improves the latencies on my old PIII with a VIA 
>>>>>>>>>>>>>>> chipset, but
>>>>>>>>>>>>>>> it generates spurious interrupts (I do not know if it really is 
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> matter, as handling a spurious interrupt is still faster than 
>>>>>>>>>>>>>>> masking an
>>>>>>>>>>>>>>> IO-APIC interrupt), the spurious interrupts in that case are a
>>>>>>>>>>>>>>> documented behaviour of the LAPIC.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is there any interest in pursuing this idea, or are x86 with 
>>>>>>>>>>>>>>> slow
>>>>>>>>>>>>>>> IO-APIC the exception more than the rule, or having to split 
>>>>>>>>>>>>>>> the vector
>>>>>>>>>>>>>>> space appears too great a restriction?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Line-based interrupts are legacy, of decreasing relevance for PCI
>>>>>>>>>>>>>> devices - likely what we are primarily interesting in here - due 
>>>>>>>>>>>>>> to MSI.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Even if I enable MSI, the kernel still uses these irqs for the
>>>>>>>>>>>>> peripherals integrated to the chipset, such as the USB HCI, or ATA
>>>>>>>>>>>>> driver (IOW, non PCI devices).
>>>>>>>>>>>>
>>>>>>>>>>>> Those are all PCI as well. And modern chipsets include variants of 
>>>>>>>>>>>> them
>>>>>>>>>>>> with MSI(-X) support.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> atom login: root
>>>>>>>>>>>>> # cat /proc/interrupts
>>>>>>>>>>>>>            CPU0       CPU1
>>>>>>>>>>>>>   0:         41          0   IO-APIC-edge      timer
>>>>>>>>>>>>>   4:         39          0   IO-APIC-edge      serial
>>>>>>>>>>>>>   9:          0          0   IO-APIC-fasteoi   acpi
>>>>>>>>>>>>>  14:          0          0   IO-APIC-edge      ata_piix
>>>>>>>>>>>>>  15:          0          0   IO-APIC-edge      ata_piix
>>>>>>>>>>>>>  16:          0          0   IO-APIC-fasteoi   uhci_hcd:usb5
>>>>>>>>>>>>>  18:          0          0   IO-APIC-fasteoi   uhci_hcd:usb4
>>>>>>>>>>>>>  19:          0          0   IO-APIC-fasteoi   ata_piix, 
>>>>>>>>>>>>> uhci_hcd:usb3
>>>>>>>>>>>>>  23:       6598          0   IO-APIC-fasteoi   ehci_hcd:usb1, 
>>>>>>>>>>>>> uhci_hcd:usb2
>>>>>>>>>>>>>  43:       2704          0   PCI-MSI-edge      eth0
>>>>>>>>>>>>>  44:        249          0   PCI-MSI-edge      snd_hda_intel
>>>>>>>>>>>>> NMI:          0          0   Non-maskable interrupts
>>>>>>>>>>>>> LOC:        661        644   Local timer interrupts
>>>>>>>>>>>>> SPU:          0          0   Spurious interrupts
>>>>>>>>>>>>> PMI:          0          0   Performance monitoring interrupts
>>>>>>>>>>>>> IWI:          0          0   IRQ work interrupts
>>>>>>>>>>>>> RTR:          0          0   APIC ICR read retries
>>>>>>>>>>>>> RES:       1582       2225   Rescheduling interrupts
>>>>>>>>>>>>> CAL:         26         48   Function call interrupts
>>>>>>>>>>>>> TLB:         10         19   TLB shootdowns
>>>>>>>>>>>>> ERR:          0
>>>>>>>>>>>>> MIS:          0
>>>>>>>>>>>>>
>>>>>>>>>>>>> I do not think peripherals integrated to chipsets can really be
>>>>>>>>>>>>> considered "legacy". And they tend to be used in the field...
>>>>>>>>>>>>
>>>>>>>>>>>> The good news is that, even on your low-end atom, you can avoid 
>>>>>>>>>>>> those
>>>>>>>>>>>> latencies by CPU assignment, i.e. isolating the Linux IRQ load on 
>>>>>>>>>>>> one
>>>>>>>>>>>> core and the RT on the other. That's getting easier and easier due 
>>>>>>>>>>>> to
>>>>>>>>>>>> the inflation of cores.
>>>>>>>>>>>
>>>>>>>>>>> What if you want to use RTUSB for instance?
>>>>>>>>>>
>>>>>>>>>> Then I will likely not worry about a few micros of additional latency
>>>>>>>>>> due to IO-APIC accesses.
>>>>>>>>>
>>>>>>>>> On my atom, taking an IO-APIC fasteoi interrupt, acking and masking 
>>>>>>>>> it,
>>>>>>>>> takes 10us in UP, and 20us in SMP (with the tracer on).
>>>>>>>>
>>>>>>>> ...and on more appropriate chipsets? I bet the Atom is (once again) off
>>>>>>>> here.
>>>>>>>
>>>>>>> I do not know, do you care for sharing your traces with us? I only run
>>>>>>> Xenomai on atom (which I am not sure do not qualify as "modern", new
>>>>>>> atoms seem to be produced), geode (ok, this one is definitely dead, but
>>>>>>> there seem to be people still running xenomai on them), and an old
>>>>>>> pentium III with an old VIA686 chipset, where masking the IO-APIC is
>>>>>>> even slower than acking the i8259.
>>>>>>>
>>>>>>> Anyway, the IO-APIC registers accesses does not look designed for speed:
>>>>>>> it has an indirect scheme that seem more designed to save space in the
>>>>>>> processor mapping and to be configured once and for all when
>>>>>>> enabling/disabling interrupt, not at each and every interrupt.
>>>>>>>
>>>>>>> The point is: people may want to use Xenomai on atoms. We do not really
>>>>>>> know on what kind of x86 people run xenomai, knowing that would help us
>>>>>>> directing our efforts.
>>>>>>
>>>>>> We are currently investigating whether we can use Atom's for our
>>>>>> future products. We have to stick to the x86 architecture and our
>>>>>> products should work without big cooling fans. Currently running tests
>>>>>> on Atom D2700 (which I know is EOL, but for research purposes should
>>>>>> give us a good indication).
>>>>>>
>>>>>> A 20us latency gain is a lot and would be very welcome in our system!
>>>>>
>>>>>
>>>>> If you enable CONFIG_MSI, do you still see some IO-APIC-fasteoi in
>>>>> /proc/interrupts?
>>>>>
>>>>
>>>> The kernel config has no CONFIG_MSI, but instead:
>>>> CONFIG_ARCH_SUPPORTS_MSI=y
>>>> CONFIG_PCI_MSI=y
>>>>
>>>> There is still IO-APIC-fasteoi in /proc/interrupts:
>>>>
>>>> # cat /proc/interrupts
>>>>            CPU0       CPU1
>>>>   0:        250          0   IO-APIC-edge      timer
>>>>   4:         71          0   IO-APIC-edge      serial
>>>>   7:         29          0   IO-APIC-edge
>>>>   8:          0          0   IO-APIC-edge      rtc0
>>>>   9:          0          0   IO-APIC-fasteoi   acpi
>>>>  16:          0          0   IO-APIC-fasteoi   uhci_hcd:usb5
>>>>  18:          0          0   IO-APIC-fasteoi   uhci_hcd:usb4
>>>>  19:         41          0   IO-APIC-fasteoi   ata_piix, uhci_hcd:usb3
>>>>  23:       5440          0   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb2
>>>>  40:        940          0   PCI-MSI-edge      eth0
>>>>  41:         21          0   PCI-MSI-edge      xhci_hcd
>>>>  42:          0          0   PCI-MSI-edge      xhci_hcd
>>>>  43:          0          0   PCI-MSI-edge      xhci_hcd
>>>> NMI:          0          0   Non-maskable interrupts
>>>> LOC:      29559      25129   Local timer interrupts
>>>> SPU:          0          0   Spurious interrupts
>>>> PMI:          0          0   Performance monitoring interrupts
>>>> IWI:          0          0   IRQ work interrupts
>>>> RTR:          0          0   APIC ICR read retries
>>>> RES:         20          0   Rescheduling interrupts
>>>> CAL:          0          8   Function call interrupts
>>>> TLB:          9          5   TLB shootdowns
>>>> ERR:         74
>>>> MIS:          0
>>>
>>> Unless you are short on CPU resources: isolcpus=1. At least bind all
>>> Linux IRQs to one CPU. That's independent of any potential low-level
>>> optimizations.
>>
>> The advantage of the masking at LAPIC using elevated priority I propose
>> is that for most APICs, the IO-APIC will forward the interrupts to the
>> cpus not currently running with elevated priority (that is what the
>> dest_LowestPrio constant means). Dynamically.
> 
> And the advantage of isolcpus is that it avoids any kind of disturbances
> due to dynamics, thus provides the best latency.

And the worst scalability. But I agree on machines where the cache is
not shared between cores (which I believe is not the case of atom), the
fact to not send the irq to the same core every time is detrimental to
no real-time performances.

> 
> Also, I'm not sure how what efforts will be required to handle cases
> Linux or some RT driver decides to keep an IRQ masked for a longer
> period. That would block everything below that level and is surely not
> what we want.

It is only the masking done by the I-pipe (aka
desc->ipipe_ack/desc->ipipe_end) which use this method. The real masking
is still done by masking at IO-APIC level.

-- 
                                            Gilles.

_______________________________________________
Xenomai mailing list
Xenomai@xenomai.org
http://www.xenomai.org/mailman/listinfo/xenomai

Reply via email to