On Tue, 15 Mar 2011, Jung-uk Kim wrote:

On Monday 14 March 2011 10:31 pm, Bruce Evans wrote:
On Mon, 14 Mar 2011, Jung-uk Kim wrote:
Log:
 When TSC is unavailable, broken or disabled and the current
timecounter has better quality than i8254 timer, use it for
DELAY(9).

You cannot use a random timecounter for DELAY().  DELAY() is
carefully written to not use any locks, so that it doesn't deadlock
when called in the "any" context from a bad console driver like
syscons when syscons is called from ddb.  (Bad console drivers may
have their own locks which deadlock in ddb, but that is another
problem.)  Even the i8254 timer, which is the only timer that
should be used by DELAY(), normally use locks that used to cause
deadlock, but DELAY() was unbroken to bypass the locks when it is
called from ddb.

I disagree.  I think we should keep away from i8254 as much as
possible.


It is adequate for DELAY(), and is the only timer that is available on
all x86.  You need large complications in DELAY() to make it work as
well as the old code using the i8254, for no gain until there is an
x86 without an i8254.

Actually, i8254 is the only timer that requires locking in
the first place because of its braindead design.  All other x86
timers do not require any locking at all.  We even sacrificed upper
32 bits of HPET in favor of i386. :-(

Did we?  Timecounters use only 32 bits as an optimization.  Scaling of
64-bit values is slow even on 64-bit machines since it needs 128-bit
intermediate values for the multiplication/shift method to work.

Cpufreq and other calibration code should use a normal timecounter
via nanouptime() in all cases and never go near low level
timecounters or DELAY().  There are (hopefully minor) problems
getting the nanotime() initialized before it used.  The dummy
timecounter is just what you don't want to use.

That's exactly the problem.

Your fix won't make much difference then.  DELAY() will start with a
dummy timecounter and probably also a zero tsc_freq, so it will use
the i8254 for various things, probably including initial calibration
of the TSC/cpufreq.  Later it may see a better timecounter and use
that to get a better calibration of the TSC/cpufreq.  But its caller
can just as easily check for a better timecounter and not use DELAY()
if it can use a timecounter, without messing with DELAY()'s internals
and having to guard against reentrancy from ddb.  delay_timecounter()
already has all the code for this, except the check for a better
timecounter is external.

Even in this patch, it isn't clear that the low level timecounters
are initialized before they are used.

The timecounter is always initialized first, then set.

It is still unclear whether they are initialized enough before they
are used since their setting is neither locked nor atomic.  Consider
the TSC initialization.  This is init_TSC() from startrtclock() and
init_TSC_tc() from cpu_initclocks().  The first should make the low-
level timecounter useable, but this is unclear.  The second attaches
the low-level timecounter to the high-level timecounter data structures.
There is no locking for this except possibly Giant.  tsc_freq is
initialized early, so the tests of it will succeed long before the TSC
can become the high-level timecounter.  It is probably usable at a low
level, but this is unclear.  tsc_freq is 64 bits, and the accesses to
it are non-atomic, so especially on 32-bit machines there are races
accessing it.  In fact, I can now see a reproducible bug in DELAY():
- find an x86 that can run at 4GHz
- run it in 386 mode
- start it so that tsc_freq is 0x10000000ULL
- use the machdep.tsc_freq sysctl to tune tsc_freq to 0xFFFFFFFF
- trace through this sysctl using gdb, and using a low quality console
  driver like syscons which calls DELAY().
- the 2 words in tsc_freq will normally be written from the low word
  up.  It will be seen to have value 0x1FFFFFFFFULL after only the first
  word is written.  Not too bad -- just of by a factor of 2.
- I can't quite see how to make it have a really broken value.  At first
  I tried and example with it starting with a value if 0xFFFFFFFF and
  changing it to 0x100000000ULL.  Then after changing only its low word,
  it is 0 so DELAY() will not use the TSC.
Related non-reproducible bugs are also easy to see:
- on one CPU, spin calling the sysctl to toggly tsc_freq between 4G-1 and
  4G
- on another CPU, spin doing something that calls DELAY().  Since there
  is no locking or atomic accesses for the writes to tsc_freq in the sysctl
  or for the reads of it in DELAY(), not to mention elsewhwere, tsc_freq
  can read as zero when it is being changed between 2 nonzero values.  Not
  too bad if it reads as zero -- then it won't be used.  But uses of it
  following it being read as nonzero may find it with changed values that
  are very bad, for example zero.
These bugs are in the old code in DELAY() that uses the TSC, and perhaps
in all code of the form "if (tsc_freq != 0) use(<tsc>)".  Fixes for them
should probably use a generation count.

init_TSC_tc() finishes with tc_init(&tsc_timecounter) which does most of
the work for attaching to the higher-level timecounter structures.
tc_init() has no locking or atomic operations, which I think gives it
races in some contexts, but not in ones involving single stepping it
using ddb, since then there are no other CPUs to race with.  The
races are the potentially-unordered writes of all the variables set
by tc_init().

Modified: head/sys/x86/isa/clock.c
=================================================================
============= --- head/sys/x86/isa/clock.c      Mon Mar 14 19:31:43
2011    (r219645) +++ head/sys/x86/isa/clock.c  Mon Mar 14 22:05:59
2011    (r219646) @@ -245,6 +245,42 @@ getit(void)
        return ((high << 8) | low);
}

+static __inline void
+delay_tsc(int n)
+{
+       uint64_t start, end, now;
+
+       sched_pin();
+       start = rdtsc();
+       end = start + (tsc_freq * n) / 1000000;
+       do {
+               cpu_spinwait();
+               now = rdtsc();
+       } while (now < end || (now > start && end < start));
+       sched_unpin();
+}

You cannot call random scheduling code from DELAY(), since the
scheduling code is not designed to be called in the "any" context.
As it happens, sched_pin() and sched_unpin() are safe, and were
already used in the TSC case.

Actually, I really like to get rid of this code.  It isn't safe for
many reasons, e.g., its frequency may change while in the loop and
end result is unpredictable if TSC is not invariant.  We may limit
its use for invariant case, though.

Me too.  It can also be used of the TSC is the timecounter, since it it
is good enough for a timecounter than it is good enough for this.  But
it is a lot of work for no benefit to make all cases work.

+static __inline void
+delay_timecounter(struct timecounter *tc, int n)
+{
+       uint64_t end, now;
+       u_int last, mask, u;
+
+       mask = tc->tc_counter_mask;
+       last = tc->tc_get_timecount(tc) & mask;
+       end = tc->tc_frequency * n / 1000000;

This depends on the delicate timecounter locking to be safe.  I
think it it is safe except possibly in tc_get_timecount(), for the
same reasons that nanouptime() can run safely with no visible
locking.  tc must be the current timecounter/timehands, which is
guaranteed to not be in an in-between state or become so while we
are using it, despite there being no visible locking.  Actually
there are some minor races:
- in nanouptime(), the timecounter generation is checked to avoid
using a new generation if necessary, but updates of the generation
count and the fileds that it protects are not properly atomic.
- here, the check of the generation is missing.  So this works
without races when called from ddb (since ddb stops other CPUs),
but has minor races in general.

Window of the races is very narrow here and it only happens on i386
where reading uint64_t is not atomic if my understanding is correct.

I can easily make race windows very wide by single stepping them, and
usually do for testing code like this :-).

It cannot be any worse than tsc_freq anyway. ;-)

Yes, but it extends old mistakes.

Optimizing DELAY() was bogus.  Now the loop for this is
duplicatied.

"Bogus" optimization is not exactly duplicated here if you meant "(now
< end || (now > start && end < start))" is bogus.  Its bogusness
comes from the fact that TSC is 64-bit.  However, timecounter is
effectively 32-bit because of its mask.  Now overflow is carefully
accounted for and it is (almost) correct.

I meant not using the i8254.  Even DELAY()'s API limits it to 1 uS
resolution.  Who cares if the hardware read extends this to 5 uS?

Duplication can be avoided by replacing the rdstc() by conditional
code like (using_tsc ? rdtsc() : (tc->tc_get_timecount(tc) & mask)
in the loop.  We have a 1 uS resolution and a pause in the loop,
so we don't care that the loop takes a little longer due to the
conditional in it.

My userland code for this does a full sysctl to read an emulated
timecounter for this.  Its loop is:

%       binuptime(&start_bt);
%       start_tsc = rdtsc();
%       do {
%               binuptime(&sample_bt);
%               tsc = rdtsc();
%               binuptime(&bt);
%               bintime_sub(&bt, &sample_bt);
%               if (bintimecmp(&bt, &best_bt, <)) {
%                       best_bt = bt;
%                       best_sample_bt = sample_bt;
%                       best_tsc = tsc;
%               }
%               bt = sample_bt;
%               bintime_sub(&bt, &start_bt);
%       } while (++nsamples < min_nsamples ||
%           bintimecmp(&best_bt, min_error_btp, >));

binuptime() is emulated so that it has the same API as the kernel.  It
is actually clock_gettime(CLOCK_MONOTONIC, &ts) followed by error handling
and conversion to bintime.  The loop is missing the equivalent of
sched_pin().  I don't know exactly how to do that in userland.  Is
there a single syscall that does it?  I used cpuset(1).

Bruce
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Reply via email to