On Sat, 18 Jul 2015, Mark Johnston wrote:

Log:
 Pass the lock object to lockstat_nsecs() and return immediately if
 LO_NOPROFILE is set. Some timecounter handlers acquire a spin mutex, and
 we don't want to recurse if lockstat probes are enabled.

It is an error to call timecounter code from a low-level place like the
mutex implementation.  This workaround depends on all locks in timecounter
handlers being LO_NOPROFILE, and that breaks profiling of these locks.
The breakage is largest if the locks are used for more than timecounters.
E.g., the one for i386's obsolescent i8254 timer has to be used for all
accesses to the i8254.  This lock is configured as MTX_SPIN | MTX_NOPROFILE
and is perhaps the main one fixed by this commit.

KTR handles this problem badly in a different way.  It abuses
get_cyclecount().  The abuse is parametrized by defining KTR_TIME as
get_cyclecount() and using KTR_TIME.  The definition of KTR_TIME is
ifdefed but not in a usable way.  This is of course undocumented, and
KTR_TIME is not a normal kernel option so it is hard to define in
config files.

KTR needs to be about as careful about this as lockstat since it is
called from low level mutex code for debugging.  I think there is nothing
like LO_NOPROFILE to turn it off.  Of course you can turn off the mutex
code's calls to it using MTX_QUIET or MTX_NOWITNESS, but this breaks much
more than lock profiling.  (I rarely use any lock debugging, but when
I did I found it useful to use it without WITNESS_SKIPSPIN and with
MTX_QUIET and MTX_NOWITNESS turned off for the locks being debugged.
Some console drivers use spinlocks with MTX_QUIET or MTX_NOWITNESS to
avoid various problems, and this breaks witnessing of them even for
non-console uses.  Unbreaking their witnessing uncovered further bugs.)

Abusing get_cyclecount() in KTR doesn't even fix the problem with
timecounters, since some implementations of get_cyclecount() (some
arm, some i386) use the timecounter.

Other bugs with using get_cyclecount():
- even if it returns cycles counted by a clock, the frequency of this
  clock is unknown/unobtainable, so there is no way to convert to
  human-readable units
- some arches return a shifted binuptime() that wraps very quickly
  (256 seconds for arm with ARM_ARCH < 6).  Such counters cannot be
  monotonic.
- some arches (arm again?) used to use binuptime() with swizzling more
  complicated than shifting.  Their counters were further from being
  monotonic.
- arm is now the only arch that uses binuptime() at a low level.
  arm64 use a stub that returns 1.  ARM_ARCH < 6 does the above.  i386
  returns cpu_ticks().  This is the correct method.  It uses the
  hardware counter use by cpu_ticks() if possible, else the raw value
  of the current hardware timecounter, adjusted to avoid wrap.
    (This is not actually a correct method, since its implementation is
    no good.  The wrap adjustment is down with no locking.  Also,
    switching the timecounter hardware breaks the get_cyclecount()
    use even more than the thread resource usage use.  The latter
    has incomplete recalibration.)
- KTR of course has no support for converting to human-readable units.
  ktrdump prints raw timestamps and their differences in %16ju format.
  kern_ktr.c seems to print them only in a ddb command.  This prints
  the raw value in the much worse format %10.10lld.  This uses the
  long long abomination and has a sign error (ktr_timestamp has type
  uint64_t).
- ktr's "timestamps" are thus at best a cookie that increases strictly
  monotonically
- get_cyclecount() is always rdtsc() on modern x86.  rdtsc() is not
  a serializing instruction.  I recently learned how unserial it can
  be -- hundreds of cycles on freefall.  The hundred-cycle case probably
  doesn't happen much in KTR's use.  It happens in the test program
  when the program spins waiting for something and executes rdtsc()
  after that, or so it thinks.  The CPU actually executes rdtsc()
  speculatively hundreds of cycles earlier while spinning.  Back to
  back rdtsc()s seem to be executed in order on freefall, and the
  speculation seems to extend to only the first one.  So there is
  a good chance that get_cyclecount() called on the same CPU gives
  a strictly increasing value.  But across CPUs, the hundred-cycle
  difference are the usual case if there is no synchronization of
  the threads (even when the TSCs on the different CPUs are perfectly
  synchronized at the hardware level).
- get_cyclecount() doesn't bother with synchronization if it uses the
  TSC (or equivalent).  This is a feature.  It is supposed to be
  efficient even if this requires it to return garbage.  But this makes
  it unusable for almost everything.  Certainly for timestamps.
- get_cyclecount()'s documentation has rotted to match its misuse.  It
  is now documented as returning a value that is monotonically
  increasing inside each CPU.  But it never obviously did that on x86
  (rdtsc isn't a serializing instruction, and the implementation doesn't
  even have a compiler barrier), and it doesn't do that on arm < 6
  (wrap occurs after 256 seconds).  Its documentation does a good job
  of describing its non-monotonicity across CPUs.  cpu_ticks() also
  doesn't require any synchronization across CPUs.  This is less clear
  since it has null documenation.

  KTR shouldn't try to synchronize threads to get timestamps that are
  monotonic across CPUs relative to some shared (non-relativistic) clock,
  although this would be useful for debugging synchronization, since
  this would be slow (100-200 cycles per sync on freefall) and would
  break debugging of synchronization bugs by adding synchronization.

  Timecounters give timestamps that are almost comparable across CPUs,
  even when they use the TSC timecounter.  They do this by reading a
  shared hardware timecounter, perhaps even with a lock that serializes
  everything, or by reading something like a shared TSC that is not
  fully serialized.  For the TSC, low-level code gives serialization
  for each CPU.  The TSCs ar hopefully synced at the hardware level,
  so that when you read them at almost the same physical time on separate
  CPUs you get almost the same value.  The physical time might be affected
  relativistically, but CPUs are not yet quite fast enough for that to
  be a large effect.  The largest effect is probably from pipeline
  delays.

Bruce
_______________________________________________
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"

Reply via email to