On 02/06/2013 08:22 PM, Jan Kiszka wrote:
> On 2013-02-06 19:40, Gilles Chanteperdrix wrote:
>> On 02/06/2013 07:35 PM, Jan Kiszka wrote:
>>
>>> On 2013-02-06 19:31, Gilles Chanteperdrix wrote:
>>>> On 02/06/2013 07:26 PM, Jan Kiszka wrote:
>>>>
>>>>> On 2013-02-06 18:51, Gilles Chanteperdrix wrote:
>>>>>> On 02/06/2013 06:47 PM, Jan Kiszka wrote:
>>>>>>
>>>>>>> On 2013-02-06 18:44, Gilles Chanteperdrix wrote:
>>>>>>>> On 02/06/2013 06:40 PM, Jan Kiszka wrote:
>>>>>>>>
>>>>>>>>> On 2013-02-06 18:35, Gilles Chanteperdrix wrote:
>>>>>>>>>> On 02/06/2013 06:33 PM, Jan Kiszka wrote:
>>>>>>>>>>
>>>>>>>>>>> On 2013-02-06 18:09, Gilles Chanteperdrix wrote:
>>>>>>>>>>>> On 02/06/2013 06:03 PM, Jan Kiszka wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Gilles,
>>>>>>>>>>>>>
>>>>>>>>>>>>> do you remember if this core-3.4 change was a performance
>>>>>>>>>>>>> optimization
>>>>>>>>>>>>> or a necessary fix? Also, I'm not yet understanding why we need
>>>>>>>>>>>>> all the
>>>>>>>>>>>>> #ifdefs except for the first one which forces fpu.preload to 0.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> It is a performance optimization, without it, we systematically
>>>>>>>>>>>> hit the
>>>>>>>>>>>> maximum latency when the timer would tick during a context switch
>>>>>>>>>>>> which
>>>>>>>>>>>> restores the FPU. Note that if you change that, you will probably
>>>>>>>>>>>> break
>>>>>>>>>>>> -forge.
>>>>>>>>>>>
>>>>>>>>>>> According to the Intel folks who introduced eagerfpu, xsave, or at
>>>>>>>>>>> least
>>>>>>>>>>> xsaveopt (which I didn't implemented yet) is now faster than
>>>>>>>>>>> serializing
>>>>>>>>>>> clts/stts. On the other hand, the worst case is a full SSE + AVX
>>>>>>>>>>> restore
>>>>>>>>>>> while the target RT task is not depending on the FPU.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Without xsave, we never restore fpu if the RT task never used it.
>>>>>>>>>> This
>>>>>>>>>> changes with xsave?
>>>>>>>>>
>>>>>>>>> This would change with eagerfpu which depends on xsave. The kernel
>>>>>>>>> sticks with lazy switching in the absence of xsaveopt.
>>>>>>>>
>>>>>>>>
>>>>>>>> I am not sure you understand what I mean, so, I am going to
>>>>>>>> reformulate.
>>>>>>>> Without xsave, Linux uses lazy fpu restore, and Xenomai uses eager fpu
>>>>>>>> restore. But Xenomai eager fpu restore is a nop if the RT task never
>>>>>>>> used FPU since its inception (and all the parents from which it is
>>>>>>>> cloned never used FPU either). Does Linux eager switching mean the same
>>>>>>>> thing?
>>>>>>>
>>>>>>> eagerfpu means: always call xsaveopt/xrstor, it will optimize the case
>>>>>>> that the FPU was unused by the source/destination. And no fiddling with
>>>>>>> TS anymore, at no time.
>>>>>>
>>>>>>
>>>>>> I still do not understand this sentence then: "the worst case is a full
>>>>>> SSE + AVX restore while the target RT task is not depending on the FPU."
>>>>>> If the RT task does not depend on the FPU, why would xsaveopt/xrstor
>>>>>> restore SSE and AVX context?
>>>>>
>>>>> Switching between two tasks that both use the full state space defines
>>>>> the maximum latency of the FPU save/restore step. We cannot interrupt
>>>>> xsave or xrstor instructions, but we couldn't interrupt fxsave either.
>>>>>
>>>>> What we can do, though, is to ensure that we have at least an preemption
>>>>> point between both. Do we have such thing so far, a chance to handle a
>>>>> Xenomai IRQ between some FPU save for Linux task A and a FPU restore for
>>>>> the following task B? If not, the discussion is mood and we are just
>>>>> shifting probabilities of the very same worst case.
>>>>
>>>>
>>>> We can implement unlocked context switch support on x86 as we do on
>>>> other platforms. I tried that on atom actually and it did not really
>>>> improve latencies. You do not answer my question though, why would
>>>> xsave/xrstor do anything if the RT thread has not used FPU (and all its
>>>> parents have not used fpu) ?
>>>
>>> We first of all would have to wait for the unrelated switch between
>>> those two Linux tasks before we could handle the IRQ and switch to the
>>> FPU-free RT task. __switch_to is atomic, also for Linux->Linux, no?
>>
>>
>> Only the *IP and *SP switch need to be atomic, the whole __switch_to can
>> be split in several atomic sections, this is what I tested on atom. But
>> as I said, it did not lead to any latency improvement.
>
> Ok, so back to the patch about which this discussion started: It
> enforced that Linux only saves the FPU state on switches, never directly
> restores it but enforces lazy restoring, right? To ensure that
> save+restore for Linux tasks is always interruptible in the middle.
> However, that sounds pretty expensive when applying FPU/SSE/etc. load on
> Linux.
To the contrary, the overhead is the cost of the fault (with the
user/kernel and kernel/user switches), so, the larger the context
switch, the smaller the overhead in proportion.
>
> Instead of always doing stts for the new task, we could do the restore
> later, after the hard_local_irq_enable of __ipipe_switch_tail. That
> should allow the eager model for Linux as well without making
> save+restore of Linux-Linux switches atomic.
That could be done, but it is probably simpler to implement unlocked
context switch, and split __switch_to into several atomic sections.
Anyway, any change in this area will probably break the work done for
kthreads on -forge, so, can't we postpone this?
--
Gilles.
_______________________________________________
Xenomai mailing list
[email protected]
http://www.xenomai.org/mailman/listinfo/xenomai