Having thought about this some more, I think your suggestion to have
rtapi_clocks_to_ns (and possibly rtapi_ns_to_clocks) makes sense.
Encouraging use of delta times mitigates any rollovers that may be
inherent in the ns<->clock conversions.

Computing nanosecond time from tsc suffers a discontinuity at least when
rdtsc() wraps, but now I think that the rtai implementation may have a
discontinuity much more frequently--every time (int)rdtsc() wraps.

The comment on llimd says that it
    /* Returns (long long)ll = (int)ll*(int)(mult)/(int)div. */
so the discontinuity actually happens when the TSC crosses a 2^31
(2^32?) boundary, not only when the 64-bit quantity wraps around back to
0.  

Better would be a routine that takes u64 a, u32 b, and u8 s and
calculates the lower 64 bits of the arbitrary-precision
    (a * b) >> s
gcc is able to generate efficient code for this on x86 (two integer multiplies,
about 21 cycles per invocation in a tight loop on a core2 CPU).  This
algorithm should have a discontinuity only at full TSC rollover, not at
32-bit rollovers.  It's also faster by a factor of 10 or so than the
rtai implementation of today.

The code:
//----------------------------------------------------------------------
static inline uint64_t mul_32x32_64(uint32_t a, uint32_t b)
    __attribute__((always_inline));
static inline uint64_t mul_32x32_64(uint32_t a, uint32_t b) {
    /* gcc is able to do this with a single 32x32 -> 64 multiply on x86 */
    return ((uint64_t)a) * b;
}

/**
 * Compute the lower 64 bits of '(a * b) >> s', s<=32
 * the temporary (a*b) is 96 bits, not truncated to 64 bits
 */
static inline uint64_t ullms(uint64_t a, uint32_t b, uint8_t s)
{
    uint32_t hi = (a>>32), lo = a & UINT32_C(0xffffffff);
    uint64_t mul_hi = mul_32x32_64(hi, b), mul_lo = mul_32x32_64(lo, b);
    return (mul_hi << (32-s)) + (mul_lo >> s);
}

/**
 * b = get_scale_factor(num, denom, &s): 
 * Compute 'b' and 's' so that ullms32(a,b,s) is approximately (a * num / 
denom) 
 *
 * When using the same num and denom repeatedly, this is much more
 * efficient than the implementation that actually performs the
 * division.  (In 2011 on x86, a single integer division is still about
 * 10x the time of a single multiplication)
 *
 * However, get_scale_factor itself is not particularly efficient (this
 * implementation uses fp arithmetic), so it should only be used to
 * compute b and s for "constant" num / denom pairs
 */
uint32_t get_scale_factor(uint32_t num, uint32_t denom, uint8_t *scale) {
    double d = (double) num / denom;
    uint8_t s = 0;
    while(d < 2147483647) { d *= 2; s++; }
    
    *scale = s;
    return (uint32_t)(round(d));
}
//----------------------------------------------------------------------

Then the rtapi code would look like this:

//----------------------------------------------------------------------
// globals to rtapi.ko
uint32_t tsc2ns_factor, ns2tsc_factor;
uint8_t tsc2ns_shift, ns2tsc_shift;

// somewhere in setup code {
    tsc2ns_factor = get_scale_factor(1000000, cpu_khz, &tsc2ns_shift);
    ns2tsc_factor = get_scale_factor(cpu_khz, 1000000, &ns2tsc_shift);
// }

uint64_t rtapi_clocks_to_ns(uint64_t clocks) {
    return ullms(clocks, tsc2ns_factor, tsc2ns_shift);
}

uint64_t rtapi_ns_to_clocks(uint64_t ns) {
    return ullms(ns, ns2tsc_factor, ns2tsc_shift);
}
//----------------------------------------------------------------------

Jeff

------------------------------------------------------------------------------
Got Input?   Slashdot Needs You.
Take our quick survey online.  Come on, we don't ask for help often.
Plus, you'll get a chance to win $100 to spend on ThinkGeek.
http://p.sf.net/sfu/slashdot-survey
_______________________________________________
Emc-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/emc-developers

Reply via email to