Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Gilles Chanteperdrix wrote:
>>
>>> Jan Kiszka wrote:
>>>> Jan Kiszka wrote:
>>>> ...
>>>>> fast-tsc-to-ns-v2.patch
>>>>>
>>>>>     [Rebased, improved rounding of least significant digit]
>>>> Rounding in the fast path for the sake of the last digit was silly.
>>>> Instead, I'm now addressing the ugly interval printing via
>>>> xnarch_precise_tsc_to_ns when converting the timer interval back into
>>>> nanos. -v3 incorporating this has just been uploaded.
>>> Hi,
>>>
>>> I had a look at the fast-tsc-to-ns implementation, here is how I would
>>> rewrite it:
>>>
>>> static inline void xnarch_init_llmulshft(const unsigned m_in,
>>>                                      const unsigned d_in,
>>>                                      unsigned *m_out,
>>>                                      unsigned *s_out)
>>> {
>>>     unsigned long long mult;
>>>
>>>     *s_out = 31;
>>>     while (1) {
>>>             mult = ((unsigned long long)m_in) << *s_out;
>>>             do_div(mult, d_in);
>>>             if (mult <= INT_MAX)
>>>                     break;
>>>             (*s_out)--;
>>>     }
>>>     *m_out = (unsigned)mult;
>>> }
>>>
>>> /* Non x86. */
>>> #define __rthal_u96shift(h, m, l, s) ({             \
>>>     unsigned _l = (l);                      \
>>>     unsigned _m = (m);                      \
>>>     unsigned _s = (s);                      \
>>>     _l >>= _s;                              \
>>>     _m >>= s;                               \
>>>     _l |= (_m << (32 - s));                 \
>>>     _m |= ((h) << (32 - s));                \
>>>        __rthal_u64fromu32(_m, _l);          \
>>> })
>>>
>>> /* x86 */
>>> #define __rthal_u96shift(h, m, l, s) ({             \
>>>     unsigned _l = (l);                      \
>>>     unsigned _m = (m);                      \
>>>     unsigned _s = (s);                      \
>>>     asm ("shrdl\t%%cl,%1,%0"                \
>>>          : "+r,?m"(_l)                      \
>>>          : "r,r"(_m), "c,c"(_s));           \
>>>     asm ("shrdl\t%%cl,%1,%0"                \
>>>          : "+r,?m"(_m)                      \
>>>          : "r,r"(h), "c,c"(_s));            \
>>>     __rthal_u64fromu32(_m, _l);             \
>>> })
>>>
>>> static inline long long rthal_llmi(int i, int j)
>>> {
>>>        /* Signed fast 32x32->64 multiplication */
>>>     return (long long) i * j;
>>> }
>>>
>>> static inline long long gilles_llmulshft(const long long op,
>>>                                      const unsigned m,
>>>                                      const unsigned s)
>>> {
>>>     unsigned oph, opl, tlh, tll, thh, thl;
>>>     unsigned long long th, tl;
>>>
>>>     __rthal_u64tou32(op, oph, opl);
>>>     tl = rthal_ullmul(opl, m);
>>>     __rthal_u64tou32(tl, tlh, tll);
>>>     th = rthal_llmi(oph, m);
>>>     th += tlh;
>>>     __rthal_u64tou32(th, thh, thl);
>>>     
>>>     return __rthal_u96shift(thh, thl, tll, s);
>>> }
>>>
>>>
>>
>> Thanks for your suggestion.
>>
>> While your generic version produces comparable code, the x86 variant is
>> about twice as large as the full-assembly version. And code size
>> translates into I-cache occupation, which may have latency costs.
>>
>> [gcc 4.1, i386]
>> -O2 -mregparm=3 -fomit-frame-pointer:
>>     63: 08048490   119 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>     68: 08048510   121 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>     77: 08048450    57 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>     78: 080483c0   135 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>
>> -Os -mregparm=3 -fomit-frame-pointer:
>>     63: 0804843b    93 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>     68: 08048498    97 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>     77: 08048410    43 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>     78: 080483b4    92 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>
>> -O2:
>>     63: 08048480   120 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>     68: 08048500   105 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>     77: 08048440    60 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>     78: 080483c0   117 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>
>> -Os:
>>     63: 08048438   104 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>     68: 080484a0    83 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>     77: 0804840b    45 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>     78: 080483b4    87 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>
>> I'm not arguing we should turn each and every Xenomai arch code into
>> pure assembly. But in this case it already happened, it's less scattered
>> source code-wise, and it is compacter object-wise. So I would prefer to
>> keep it as is.
> 
> I would say the advantage of having a C version outperform the
> advantages of the full assembly version. C is really easier to
> understand and debug.

Personally, I prefer the clear (and commented) assembly over the nested
macros and inlines.

> 
> The differences between the two versions are some register moves, which
> cost almost nothing, especially since each operation in the assembly

Cycle-wise, you are right. But what bites us more in the worst case are
memory accesses, specifically when they are not cached. Code size
matters more according to my experience.

> version depends on the result of the previous operation, which means
> lots of pipeline stall, the register moves will just feed the pipeline.
> I do not think they really matter. Look at the assembly produced for
> gilles_llmulshft on ARM, a low end architecture where each instruction
> really costs:
> gilles_llmulshft:
>         @ args = 0, pretend = 0, frame = 0
>         @ frame_needed = 0, uses_anonymous_args = 0
>         @ link register save eliminated.
>         stmfd   sp!, {r4, r5, r6, r7}
>         umull   r6, r7, r0, r2
>         mov     r4, r7
>         mov     r5, #0
>         smlal   r4, r5, r2, r1
>         rsb     ip, r3, #32
>         mov     r2, r4, lsr r3
>         orr     r1, r2, r5, asl ip
>         mov     r2, r2, asl ip
>         orr     r0, r2, r6, lsr r3
>         @ lr needed for prologue
>         ldmfd   sp!, {r4, r5, r6, r7}
>         mov     pc, lr
> 
> pretty minimal, no ?

OK, your version can perfectly go into the ARM arch. But i386 is
different: less registers, thus easily a lot of variable shuffling...

> 
> The full assembly version has another big drawback, it is a big block
> that the optimizer can not split, whereas in a C version, the optimizer
> can decide to interleave the surrounding code. So a C version will
> inline better.

We are not inlining that service anymore, at least not for its primary
usage tsc-to-ns. Inlining costs object size, thus increases the latency
(although it saves us a few cycles).

> 
> There is one thing I do not like with llmulshft (any implementation), it
> is the rounding policy towards minus infinity. llmulshft(-1, 2/3)
> returns -1 whereas llimd would return 0.

See other postings: rounding of the last digit doesn't matter with
scaled math, it's already inaccurate by nature. That's also why we have
it only one-way.

Jan

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Reply via email to