Re: [Xenomai] IO-APIC latencies

Gilles Chanteperdrix Mon, 17 Sep 2012 10:46:23 -0700

On 09/17/2012 04:35 PM, Jan Kiszka wrote:

> On 2012-09-17 16:02, Gilles Chanteperdrix wrote:
>> On 09/17/2012 03:54 PM, Jan Kiszka wrote:
>>> On 2012-09-17 15:46, Gilles Chanteperdrix wrote:
>>>> On 09/17/2012 02:27 PM, Jan Kiszka wrote:
>>>>> On 2012-09-17 14:15, Henri Roosen wrote:
>>>>>> On Mon, Sep 17, 2012 at 1:14 PM, Gilles Chanteperdrix
>>>>>> <[email protected]> wrote:
>>>>>>> On 09/17/2012 12:39 PM, Henri Roosen wrote:
>>>>>>>
>>>>>>>> On Mon, Sep 17, 2012 at 12:00 PM, Gilles Chanteperdrix
>>>>>>>> <[email protected]> wrote:
>>>>>>>>> On 09/17/2012 11:42 AM, Jan Kiszka wrote:
>>>>>>>>>> On 2012-09-17 11:29, Gilles Chanteperdrix wrote:
>>>>>>>>>>> On 09/17/2012 11:07 AM, Jan Kiszka wrote:
>>>>>>>>>>>> On 2012-09-17 10:32, Gilles Chanteperdrix wrote:
>>>>>>>>>>>>> On 09/17/2012 10:18 AM, Jan Kiszka wrote:
>>>>>>>>>>>>>> On 2012-09-17 10:07, Gilles Chanteperdrix wrote:
>>>>>>>>>>>>>>> On 09/17/2012 09:43 AM, Jan Kiszka wrote:
>>>>>>>>>>>>>>>> On 2012-09-17 08:30, Gilles Chanteperdrix wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> looking at x86 latencies, I found that what was taking long 
>>>>>>>>>>>>>>>>> on my atom
>>>>>>>>>>>>>>>>> was masking the fasteoi interrupts at IO-APIC level. So, I 
>>>>>>>>>>>>>>>>> experimented
>>>>>>>>>>>>>>>>> an idea: masking at LAPIC level instead of IO-APIC, by using 
>>>>>>>>>>>>>>>>> the "task
>>>>>>>>>>>>>>>>> priority" register. This seems to improve latencies on my 
>>>>>>>>>>>>>>>>> atom:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> http://sisyphus.hd.free.fr/~gilles/core-3.4-latencies/atom.png
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This implies splitting the LAPIC vectors in a high priority 
>>>>>>>>>>>>>>>>> and low
>>>>>>>>>>>>>>>>> priority sets, the final implementation would use 
>>>>>>>>>>>>>>>>> ipipe_enable_irqdesc
>>>>>>>>>>>>>>>>> to detect a high priority domain, and change the vector at 
>>>>>>>>>>>>>>>>> that time.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This also improves the latencies on my old PIII with a VIA 
>>>>>>>>>>>>>>>>> chipset, but
>>>>>>>>>>>>>>>>> it generates spurious interrupts (I do not know if it really 
>>>>>>>>>>>>>>>>> is a
>>>>>>>>>>>>>>>>> matter, as handling a spurious interrupt is still faster than 
>>>>>>>>>>>>>>>>> masking an
>>>>>>>>>>>>>>>>> IO-APIC interrupt), the spurious interrupts in that case are a
>>>>>>>>>>>>>>>>> documented behaviour of the LAPIC.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is there any interest in pursuing this idea, or are x86 with 
>>>>>>>>>>>>>>>>> slow
>>>>>>>>>>>>>>>>> IO-APIC the exception more than the rule, or having to split 
>>>>>>>>>>>>>>>>> the vector
>>>>>>>>>>>>>>>>> space appears too great a restriction?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Line-based interrupts are legacy, of decreasing relevance for 
>>>>>>>>>>>>>>>> PCI
>>>>>>>>>>>>>>>> devices - likely what we are primarily interesting in here - 
>>>>>>>>>>>>>>>> due to MSI.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Even if I enable MSI, the kernel still uses these irqs for the
>>>>>>>>>>>>>>> peripherals integrated to the chipset, such as the USB HCI, or 
>>>>>>>>>>>>>>> ATA
>>>>>>>>>>>>>>> driver (IOW, non PCI devices).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Those are all PCI as well. And modern chipsets include variants 
>>>>>>>>>>>>>> of them
>>>>>>>>>>>>>> with MSI(-X) support.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> atom login: root
>>>>>>>>>>>>>>> # cat /proc/interrupts
>>>>>>>>>>>>>>>            CPU0       CPU1
>>>>>>>>>>>>>>>   0:         41          0   IO-APIC-edge      timer
>>>>>>>>>>>>>>>   4:         39          0   IO-APIC-edge      serial
>>>>>>>>>>>>>>>   9:          0          0   IO-APIC-fasteoi   acpi
>>>>>>>>>>>>>>>  14:          0          0   IO-APIC-edge      ata_piix
>>>>>>>>>>>>>>>  15:          0          0   IO-APIC-edge      ata_piix
>>>>>>>>>>>>>>>  16:          0          0   IO-APIC-fasteoi   uhci_hcd:usb5
>>>>>>>>>>>>>>>  18:          0          0   IO-APIC-fasteoi   uhci_hcd:usb4
>>>>>>>>>>>>>>>  19:          0          0   IO-APIC-fasteoi   ata_piix, 
>>>>>>>>>>>>>>> uhci_hcd:usb3
>>>>>>>>>>>>>>>  23:       6598          0   IO-APIC-fasteoi   ehci_hcd:usb1, 
>>>>>>>>>>>>>>> uhci_hcd:usb2
>>>>>>>>>>>>>>>  43:       2704          0   PCI-MSI-edge      eth0
>>>>>>>>>>>>>>>  44:        249          0   PCI-MSI-edge      snd_hda_intel
>>>>>>>>>>>>>>> NMI:          0          0   Non-maskable interrupts
>>>>>>>>>>>>>>> LOC:        661        644   Local timer interrupts
>>>>>>>>>>>>>>> SPU:          0          0   Spurious interrupts
>>>>>>>>>>>>>>> PMI:          0          0   Performance monitoring interrupts
>>>>>>>>>>>>>>> IWI:          0          0   IRQ work interrupts
>>>>>>>>>>>>>>> RTR:          0          0   APIC ICR read retries
>>>>>>>>>>>>>>> RES:       1582       2225   Rescheduling interrupts
>>>>>>>>>>>>>>> CAL:         26         48   Function call interrupts
>>>>>>>>>>>>>>> TLB:         10         19   TLB shootdowns
>>>>>>>>>>>>>>> ERR:          0
>>>>>>>>>>>>>>> MIS:          0
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I do not think peripherals integrated to chipsets can really be
>>>>>>>>>>>>>>> considered "legacy". And they tend to be used in the field...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The good news is that, even on your low-end atom, you can avoid 
>>>>>>>>>>>>>> those
>>>>>>>>>>>>>> latencies by CPU assignment, i.e. isolating the Linux IRQ load 
>>>>>>>>>>>>>> on one
>>>>>>>>>>>>>> core and the RT on the other. That's getting easier and easier 
>>>>>>>>>>>>>> due to
>>>>>>>>>>>>>> the inflation of cores.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What if you want to use RTUSB for instance?
>>>>>>>>>>>>
>>>>>>>>>>>> Then I will likely not worry about a few micros of additional 
>>>>>>>>>>>> latency
>>>>>>>>>>>> due to IO-APIC accesses.
>>>>>>>>>>>
>>>>>>>>>>> On my atom, taking an IO-APIC fasteoi interrupt, acking and masking 
>>>>>>>>>>> it,
>>>>>>>>>>> takes 10us in UP, and 20us in SMP (with the tracer on).
>>>>>>>>>>
>>>>>>>>>> ...and on more appropriate chipsets? I bet the Atom is (once again) 
>>>>>>>>>> off
>>>>>>>>>> here.
>>>>>>>>>
>>>>>>>>> I do not know, do you care for sharing your traces with us? I only run
>>>>>>>>> Xenomai on atom (which I am not sure do not qualify as "modern", new
>>>>>>>>> atoms seem to be produced), geode (ok, this one is definitely dead, 
>>>>>>>>> but
>>>>>>>>> there seem to be people still running xenomai on them), and an old
>>>>>>>>> pentium III with an old VIA686 chipset, where masking the IO-APIC is
>>>>>>>>> even slower than acking the i8259.
>>>>>>>>>
>>>>>>>>> Anyway, the IO-APIC registers accesses does not look designed for 
>>>>>>>>> speed:
>>>>>>>>> it has an indirect scheme that seem more designed to save space in the
>>>>>>>>> processor mapping and to be configured once and for all when
>>>>>>>>> enabling/disabling interrupt, not at each and every interrupt.
>>>>>>>>>
>>>>>>>>> The point is: people may want to use Xenomai on atoms. We do not 
>>>>>>>>> really
>>>>>>>>> know on what kind of x86 people run xenomai, knowing that would help 
>>>>>>>>> us
>>>>>>>>> directing our efforts.
>>>>>>>>
>>>>>>>> We are currently investigating whether we can use Atom's for our
>>>>>>>> future products. We have to stick to the x86 architecture and our
>>>>>>>> products should work without big cooling fans. Currently running tests
>>>>>>>> on Atom D2700 (which I know is EOL, but for research purposes should
>>>>>>>> give us a good indication).
>>>>>>>>
>>>>>>>> A 20us latency gain is a lot and would be very welcome in our system!
>>>>>>>
>>>>>>>
>>>>>>> If you enable CONFIG_MSI, do you still see some IO-APIC-fasteoi in
>>>>>>> /proc/interrupts?
>>>>>>>
>>>>>>
>>>>>> The kernel config has no CONFIG_MSI, but instead:
>>>>>> CONFIG_ARCH_SUPPORTS_MSI=y
>>>>>> CONFIG_PCI_MSI=y
>>>>>>
>>>>>> There is still IO-APIC-fasteoi in /proc/interrupts:
>>>>>>
>>>>>> # cat /proc/interrupts
>>>>>>            CPU0       CPU1
>>>>>>   0:        250          0   IO-APIC-edge      timer
>>>>>>   4:         71          0   IO-APIC-edge      serial
>>>>>>   7:         29          0   IO-APIC-edge
>>>>>>   8:          0          0   IO-APIC-edge      rtc0
>>>>>>   9:          0          0   IO-APIC-fasteoi   acpi
>>>>>>  16:          0          0   IO-APIC-fasteoi   uhci_hcd:usb5
>>>>>>  18:          0          0   IO-APIC-fasteoi   uhci_hcd:usb4
>>>>>>  19:         41          0   IO-APIC-fasteoi   ata_piix, uhci_hcd:usb3
>>>>>>  23:       5440          0   IO-APIC-fasteoi   ehci_hcd:usb1, 
>>>>>> uhci_hcd:usb2
>>>>>>  40:        940          0   PCI-MSI-edge      eth0
>>>>>>  41:         21          0   PCI-MSI-edge      xhci_hcd
>>>>>>  42:          0          0   PCI-MSI-edge      xhci_hcd
>>>>>>  43:          0          0   PCI-MSI-edge      xhci_hcd
>>>>>> NMI:          0          0   Non-maskable interrupts
>>>>>> LOC:      29559      25129   Local timer interrupts
>>>>>> SPU:          0          0   Spurious interrupts
>>>>>> PMI:          0          0   Performance monitoring interrupts
>>>>>> IWI:          0          0   IRQ work interrupts
>>>>>> RTR:          0          0   APIC ICR read retries
>>>>>> RES:         20          0   Rescheduling interrupts
>>>>>> CAL:          0          8   Function call interrupts
>>>>>> TLB:          9          5   TLB shootdowns
>>>>>> ERR:         74
>>>>>> MIS:          0
>>>>>
>>>>> Unless you are short on CPU resources: isolcpus=1. At least bind all
>>>>> Linux IRQs to one CPU. That's independent of any potential low-level
>>>>> optimizations.
>>>>
>>>> The advantage of the masking at LAPIC using elevated priority I propose
>>>> is that for most APICs, the IO-APIC will forward the interrupts to the
>>>> cpus not currently running with elevated priority (that is what the
>>>> dest_LowestPrio constant means). Dynamically.
>>>
>>> And the advantage of isolcpus is that it avoids any kind of disturbances
>>> due to dynamics, thus provides the best latency.
>>
>> And the worst scalability.
> 
> There is no free lunch.
> 
>> But I agree on machines where the cache is
>> not shared between cores (which I believe is not the case of atom), the
>> fact to not send the irq to the same core every time is detrimental to
>> no real-time performances.
>>
>>>
>>> Also, I'm not sure how what efforts will be required to handle cases
>>> Linux or some RT driver decides to keep an IRQ masked for a longer
>>> period. That would block everything below that level and is surely not
>>> what we want.
>>
>> It is only the masking done by the I-pipe (aka
>> desc->ipipe_ack/desc->ipipe_end) which use this method. The real masking
>> is still done by masking at IO-APIC level.
> 
> Then you need to prevent the (mis-)use of XN_ISR_NOENABLE.



ipipe_end is a nop when called from primary domain, yes, but this is not
very different from edge irqs. Also, fasteoi become a bit like MSI: in
the same way as we can not mask MSI from primary domain, we should not
mask IO-APIC fasteoi irqs, because the cost is too prohibitive. If we
can live with MSI without masking them in primary mode, I guess we can
do the same with fasteoi irqs.

> 
> And when is the end executed for Linux IRQs? Do you want to migrate from
> TPR-based based masking to standard IO-APIC masking when switching to
> the Linux domain?


The end for Linux irqs is executed in handle_fasteoi_irq, as ususal,
through the ->irq_release callback. This irq_release callback restores
the LAPIC priority to the low priority state. The current implementation
of the irq_hold and irq_release callbacks is:

void ipipe_mute_pic(void)
{
#if 0
        int *mutedp = __this_cpu_ptr(&__ipipe_pic_muted);
        int muted = *mutedp;
        *mutedp = muted | 1;
        if (muted == 0)
                apic_write(APIC_TASKPRI, 0x70);
#else
        apic_write(APIC_TASKPRI, 0x70);
        __this_cpu_write(__ipipe_pic_muted, 1);
#endif
}

void ipipe_unmute_pic(void)
{
#if 0
        int *mutedp = __this_cpu_ptr(&__ipipe_pic_muted);
        int muted = *mutedp & ~1;
        *mutedp = muted;
        if (muted == 0)
                apic_write(APIC_TASKPRI, 0);
#else
        apic_write(APIC_TASKPRI, 0);
        __this_cpu_write(__ipipe_pic_muted, 0);
#endif
}

static void hold_ioapic_irq(struct irq_data *data)
{
#if 0
        unsigned cpu = ipipe_processor_id();
        int *mutedp = &per_cpu(__ipipe_pic_muted, cpu);
        int muted = *mutedp;
        *mutedp = muted | 2;
        if (muted == 0)
                apic_write(APIC_TASKPRI, 0x70);
#else
        if (__this_cpu_read(__ipipe_pic_muted) == 0)
                apic_write(APIC_TASKPRI, 0x70);
#endif
        ack_apic_level(data);
}

static void release_ioapic_irq(struct irq_data *data)
{
        unsigned long flags = hard_local_irq_save();
#if 0
        unsigned cpu = ipipe_processor_id();
        int *mutedp = &per_cpu(__ipipe_pic_muted, cpu);
        int muted = *mutedp & ~2;
        *mutedp = muted;
        if (muted == 0)
                apic_write(APIC_TASKPRI, 0);
#else
        if (__this_cpu_read(__ipipe_pic_muted) == 0)
                apic_write(APIC_TASKPRI, 0);
#endif
        hard_local_irq_restore(flags);
}

Both implementation work, but the one in #if 0 seems to have a higher
overhead, though it is more correct (though not completely, what we
would want is to restore the TPRR when we have handled all the pending
linux irqs, to avoid retriggering some of them).

So, we NEVER use the IO-APIC masking, that is the point, except when
really masking.

> But that will not avoid the IO-APIC access latency,
> just reshuffle it.


The IO-APIC access latency only happens when someone masks an IO-APIC
irq, which should be rare (but I agree, maybe with network drivers we
have an issue here).

What I did was a quick test to see whether we gain something, and it
seems we do, I do not claim to have covered all the details.

-- 
                                                                Gilles.

_______________________________________________
Xenomai mailing list
[email protected]
http://www.xenomai.org/mailman/listinfo/xenomai

Re: [Xenomai] IO-APIC latencies

Reply via email to