Re: Weird clock behaviour with current (amd64) kernel

2022-08-14 Thread Michael van Elst
On Mon, Aug 15, 2022 at 01:09:55AM +0700, Robert Elz wrote:
>   | N.B. It would be nice if there were a (MI) boot option that could be used
>   | to influence HZ without recompiling. Booting into ddb and patching hz
>   | and a few derived variables isn't that comfortable.
> 
> It would be nice...
> 
> I doubt patching is feasible, there are quite a few places in the code
> which use HZ rather than hz (quite a lot in the DRM code), there also
> appear to be some ports that make assumptions about what HZ is going to
> be - which can be checked at compile time, but which patching would destroy.

Our code should be safe, the old DRM code did use:

./drm/dist/bsd-core/drmP.h:#define DRM_HZ   hz

and DRM2 uses:

./drm2/include/asm/param.h:#define  HZ  hz

but obviously, this needs to be verified and there can be
weak assumptions about possible HZ values.

But I doubt these issues are difficult to find and to fix.


Providing a common boot parameter over all ports and port masters
on the other hand starts with the question what color it should be
painted.

Greetings,
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: Weird clock behaviour with current (amd64) kernel

2022-08-14 Thread Robert Elz
Date:Sun, 14 Aug 2022 15:52:35 - (UTC)
From:mlel...@serpens.de (Michael van Elst)
Message-ID:  

  | N.B. It would be nice if there were a (MI) boot option that could be used
  | to influence HZ without recompiling. Booting into ddb and patching hz
  | and a few derived variables isn't that comfortable.

It would be nice...

I doubt patching is feasible, there are quite a few places in the code
which use HZ rather than hz (quite a lot in the DRM code), there also
appear to be some ports that make assumptions about what HZ is going to
be - which can be checked at compile time, but which patching would destroy.

kre



Re: Weird clock behaviour with current (amd64) kernel

2022-08-14 Thread Michael van Elst
k...@munnari.oz.au (Robert Elz) writes:

>Pity this won't help with PR 43997 - but I conclude from your response about
>that, that if the host running qemu had HZ set significantly higher than 100
>then qemu (hosting a kernel with HZ==100) would probably work just fine?

Yes.

FreeBSD has the following comment (in sys/conf/NOTES):

# The granularity of operation is controlled by the kernel option HZ (default
# frequency of 1000 Hz or a period 1ms between calls). Virtual machine guests
# use a value of 100.

N.B. It would be nice if there were a (MI) boot option that could be used
to influence HZ without recompiling. Booting into ddb and patching hz
and a few derived variables isn't that comfortable.



Re: Weird clock behaviour with current (amd64) kernel

2022-08-14 Thread Robert Elz
Date:Sun, 14 Aug 2022 00:28:38 +0200
From:Joerg Sonnenberger 
Message-ID:  

  | Does the interrupt rate match HZ?

Yes.   I rebooted the system I tested yesterday (which behaved just the
same as reported then .. which I guess was technically the early hours
of this morning, but to me, that was yesterday).

I ran the same tests (actually, more of them - this time I also tested
switching to ACPI-Fast and i8254 as well (the next two highest q= values
in the kern.timecounter.choice output, after TSC and hpet0).

This time I watched interrupt rates while all of that was happening, and
the cpu0 timer interrupt rate stuck on a nice constant 100/sec for all of
this (didn't even see a 99 or 101).

The results of the tests were unchanged while TSC and hpet0 were being used
(with TSC, time ran about 2.6 times slower than real time, and sleep 1
paused for 1 of those long seconds; with hpet0 (after TSC) time ran at the
same rate as any other clock in the room, but sleep 1 would take 2 or 3
seconds (pattern of 3 3 2 3 3 2 3 3 2 ... indicating it was probably really
taking ~2.66 seconds, but I didn't run the test long enough to calculate to
2 decimal places).

Changing to ACPI-Fast or i8254 changed nothing, time kept running at
correct clock speed, and sleep kept on being slow.

Date: Sat, 13 Aug 2022 21:32:15 - (UTC)
From: mlel...@serpens.de (Michael van Elst)
Message-id: 

  | >In your case, you say it takes ~6 minutes between attachment and
  | >calibration and your hpet runs at 19.2MHz.

Well, to be strictly accurate, you concluded the former (I don't know
enough about the process to specify attachment and calibration) and
extracted the latter from the dmesg fragment I sent - but yes, that all
sounds correct.

  | >This is enough for HPET_MCOUNT_LO to overflow.

  | This patch adds a separate delay of ~0.1 seconds to calibrate the timers.
  | This should avoid any overflow.

I applied that patch (to dev/ic/hpet.c) and also updated x86/x86/cpu.c
to version 1.204, which apart from the RCSID stuff, looks to be identical
to the patched version I ran in the previous test (better to have fewer
patched files, so that's good).

This one worked fine (still boring sitting waiting for all that PCI CONFIG
DUMP data to scroll past) - an extract from the dmesg for this one
(selected manually, so it is more likely that I may have missed something,
but also more likely to contain data that doesn't say "TSC"):

[ 1.00] NetBSD 9.99.99 (JACARANDA:1.1-20220811) #15: Sun Aug 14 
16:23:30 +07 2022
[ 1.00] 
k...@jacaranda.noi.kre.to:/usr/obj/testing/kernels/amd64/JACARANDA_PCI_DUMP
[...]
[ 1.00] timecounter: Timecounters tick every 10.000 msec 
[...]
[ 1.00] timecounter: Timecounter "i8254" frequency 1193182 Hz quality 
100
[...]
[ 1.03] cpu0: lapic_per_second set to 3840
[ 1.03] cpu0: TSC freq CPUID 341760 Hz
[ 1.03] cpu0: TSC freq from CPUID 341760 Hz
[ 1.03] cpu0: [re]calibrating local timer
[ 1.03] cpu0: apic clock running at 38 MHz
[...]
[ 1.03] timecounter: Timecounter "ACPI-Fast" frequency 3579545 Hz 
quality 1000
[ 1.027976] acpi0: ACPI-Fast 24-bit timer
[ 1.027976] hpet0 at acpi0: high precision event timer (mem 
0xfed0-0xfed00400)
[ 1.027976] timecounter: Timecounter "hpet0" frequency 1920 Hz quality 
2000
[...]
[ 1.064455] attimer1 at acpi0 (TIMR, PNP0100): io 0x40-0x43,0x50-0x53 irq 0
[...]
[... pci config dump starts happening here ]
[...]
[ 1.064455] cpu0: TSC freq from HPET 3417601000 Hz
[ 1.064455] cpu0: CPU base freq 34 Hz
[ 1.064455] cpu0: CPU max freq 55 Hz
[ 1.064455] cpu0: TSC freq CPUID 341760 Hz
[ 1.064455] cpu0: TSC freq calibrated 3417601000 Hz
[ 1.064455] cpu0: [re]calibrating local timer 
[ 1.064455] cpu0: apic clock running at 38 MHz
[ 1.064455] timecounter: Timecounter "clockinterrupt" frequency 100 Hz 
quality 0
[ 1.064455] UVM: using package allocation scheme, 1 package(s) per bucket
[ 2.073341] cpu1: TSC skew=4 drift=0 
[ 2.073341] cpu1: running
[ 2.091459] cpu2: TSC skew=0 drift=2 
[ 2.091459] cpu2: running
[ 2.161459] cpu3: TSC skew=16 drift=67
[...]
[... omitting most of the cpu TSC skew/drift debug output ]
[...]
[ 3.561459] cpu22: TSC skew=-34 drift=3
[ 3.681459] cpu23: TSC skew=-30 drift=2 
[ 3.681459] cpu23: running
[ 3.751459] timecounter: Timecounter "TSC" frequency 3417601000 Hz quality 
3000

That's it for relevant info from there (that I could see anyway - OK, I know
the UVM package allocation scheme line isn't relevant, but it was just 1 line
amongst a block of lines that were relevant!)

The "while sleep 1; do date; done" test worked just fine ie: behaved as it
would on a kernel without the pci config dump.

The interrupt cpu timer rate remained a nice constant 100 for this as well
(this time, I didn't see the point in switching to some 

Re: Weird clock behaviour with current (amd64) kernel

2022-08-14 Thread Michael van Elst
s...@stix.id.au (Paul Ripke) writes:

>This is likely somewhat similar to what I reported here:
>http://mail-index.netbsd.org/current-users/2019/07/29/msg036293.html

>tl;dr: weird clock behaviour on GCE micro instances. This at least
>provides a nice easy testbed.

| ACPI-Safe: ntp syncs fine, clock maintains time, but "sleep 5" sleeps about 
20s.

Means, that the system does not maintain 100 clock interrupts per second
but the emulated ACPI timer is stable.

You could try if a kernel with HZ=10 gets proper time.



Re: Weird clock behaviour with current (amd64) kernel

2022-08-13 Thread Paul Ripke
This is likely somewhat similar to what I reported here:
http://mail-index.netbsd.org/current-users/2019/07/29/msg036293.html

tl;dr: weird clock behaviour on GCE micro instances. This at least
provides a nice easy testbed.

-- 
Paul Ripke
"Great minds discuss ideas, average minds discuss events, small minds
 discuss people."
-- Disputed: Often attributed to Eleanor Roosevelt. 1948.


Re: Weird clock behaviour with current (amd64) kernel

2022-08-13 Thread Michael van Elst
On Sun, Aug 14, 2022 at 09:00:20AM +0700, Robert Elz wrote:
> Date:Sun, 14 Aug 2022 00:28:38 +0200
> From:Joerg Sonnenberger 
> Message-ID:  
> 
>   | I'm more wondering about the LAPIC frequency here. That one is normally
>   | used to drive the clockintr and if that frequency is off, interrupt rate
>   | would be off too. Does the interrupt rate match HZ?
> 
> That's a very good question, I never thought to check that, and should have.
> I will do later today, when I also test Michael's latest patch for HPET
> overflow.
> 
> Thanks both.
> 
> Do you (either of you, or anyone else) consider that what is happening
> here might be related to PR 43997?  If it is, then this might not be
> quite so unimportant as I had been considering it.


PR 43997 is more of a bug in qemu than anything else. You cannot emulate
a 100Hz interrupt when your clock granularity for sleep is 10ms. Best you
can do is to catch up interrupts when you are too late but which has other
problems. Qemu doesn't catch up, and so the emulated interrupt effectively
runs at 50Hz.

Linux (tickless kernel) has a clock granularity of ideally zero (in reality
limited by clock resolution and CPU speed), so you don't see such a problem
there.

You can still get a discrepancy between sleep time and wall clock
time as these clocks run independently (starting with the fact that
you sleep for "at least" some interval), but that's a different problem.



Greetings,
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: Weird clock behaviour with current (amd64) kernel

2022-08-13 Thread Robert Elz
Date:Sun, 14 Aug 2022 00:28:38 +0200
From:Joerg Sonnenberger 
Message-ID:  

  | I'm more wondering about the LAPIC frequency here. That one is normally
  | used to drive the clockintr and if that frequency is off, interrupt rate
  | would be off too. Does the interrupt rate match HZ?

That's a very good question, I never thought to check that, and should have.
I will do later today, when I also test Michael's latest patch for HPET
overflow.

Thanks both.

Do you (either of you, or anyone else) consider that what is happening
here might be related to PR 43997?  If it is, then this might not be
quite so unimportant as I had been considering it.

kre



Re: Weird clock behaviour with current (amd64) kernel

2022-08-13 Thread Joerg Sonnenberger
On Sun, Aug 14, 2022 at 02:38:07AM +0700, Robert Elz wrote:
> To avoid delays in a message turnaround, this is what sysctl says is
> available (this output is from a normal boot, not the PCIDUMP one).
> 
> kern.timecounter.choice = TSC(q=3000, f=3417601000 Hz) clockinterrupt(q=0, 
> f=100 Hz) lapic(q=-100, f=3840 Hz) hpet0(q=2000, f=1920 Hz) 
> ACPI-Fast(q=1000, f=3579545 Hz) i8254(q=100, f=1193182 Hz) dummy(q=-100, 
> f=100 Hz)

I'm more wondering about the LAPIC frequency here. That one is normally
used to drive the clockintr and if that frequency is off, interrupt rate
would be off too. Does the interrupt rate match HZ?

Joerg


Re: Weird clock behaviour with current (amd64) kernel

2022-08-13 Thread Michael van Elst
mlel...@serpens.de (Michael van Elst) writes:

>In your case, you say it takes ~6 minutes between attachment and
>calibration and your hpet runs at 19.2MHz.

>This is enough for HPET_MCOUNT_LO to overflow.



This patch adds a separate delay of ~0.1 seconds to calibrate
the timers. This should avoid any overflow.


Index: sys/dev/ic/hpet.c
===
RCS file: /cvsroot/src/sys/dev/ic/hpet.c,v
retrieving revision 1.17
diff -p -u -r1.17 hpet.c
--- sys/dev/ic/hpet.c   16 May 2020 23:06:40 -  1.17
+++ sys/dev/ic/hpet.c   13 Aug 2022 21:24:58 -
@@ -54,8 +54,6 @@ static u_int  hpet_get_timecount(struct t
 static boolhpet_resume(device_t, const pmf_qual_t *);
 
 static struct hpet_softc *hpet0 __read_mostly;
-static uint32_t hpet_attach_val;
-static uint64_t hpet_attach_tsc;
 
 int
 hpet_detach(device_t dv, int flags)
@@ -147,14 +145,6 @@ hpet_attach_subr(device_t dv)
eval = bus_space_read_4(sc->sc_memt, sc->sc_memh, HPET_MCOUNT_LO);
val = eval - sval;
sc->sc_adj = (int64_t)val * sc->sc_period / 1000;
-
-   /* Store attach-time values for computing TSC frequency later. */
-   if (cpu_hascounter() && sc == hpet0) {
-   (void)cpu_counter();
-   val = bus_space_read_4(sc->sc_memt, sc->sc_memh, 
HPET_MCOUNT_LO);
-   hpet_attach_tsc = cpu_counter();
-   hpet_attach_val = val;
-   }
 }
 
 static u_int
@@ -214,33 +204,37 @@ uint64_t
 hpet_tsc_freq(void)
 {
struct hpet_softc *sc;
-   uint64_t td, val, freq;
-   uint32_t hd;
+   uint64_t td0, td, val, freq;
+   uint32_t hd0, hd;
int s;
 
if (hpet0 == NULL || !cpu_hascounter())
return 0;
 
-   /* Slow down if we got here from attach in under 0.1s. */
sc = hpet0;
-   hd = bus_space_read_4(sc->sc_memt, sc->sc_memh, HPET_MCOUNT_LO);
-   hd -= hpet_attach_val;
-   if (hd < (uint64_t)10 * 10 / sc->sc_period)
-   hpet_delay(10);
+
+   s = splhigh();
+   (void)cpu_counter();
+   (void)bus_space_read_4(sc->sc_memt, sc->sc_memh, HPET_MCOUNT_LO);
+   hd0 = bus_space_read_4(sc->sc_memt, sc->sc_memh, HPET_MCOUNT_LO);
+   td0 = cpu_counter();
+   splx(s);
+
+   hpet_delay(10);
 
/*
 * Determine TSC freq by comparing how far the TSC and HPET have
-* advanced since attach time.  Take the cost of reading HPET
-* register into account and round result to the nearest 1000.
+* advanced and round result to the nearest 1000.
 */
s = splhigh();
(void)cpu_counter();
+   (void)bus_space_read_4(sc->sc_memt, sc->sc_memh, HPET_MCOUNT_LO);
hd = bus_space_read_4(sc->sc_memt, sc->sc_memh, HPET_MCOUNT_LO);
td = cpu_counter();
splx(s);
-   hd -= hpet_attach_val;
-   val = ((uint64_t)hd * sc->sc_period - sc->sc_adj) / 1;
-   freq = (td - hpet_attach_tsc) * 1000 / val;
+
+   val = (uint64_t)(hd - hd0) * sc->sc_period / 1;
+   freq = (td - td0) * 1000 / val;
return rounddown(freq + 500, 1000);
 }
 




Re: Weird clock behaviour with current (amd64) kernel

2022-08-13 Thread Michael van Elst
On Sun, Aug 14, 2022 at 02:38:07AM +0700, Robert Elz wrote:
> Date:Sat, 13 Aug 2022 17:41:05 +0200
> From:Michael van Elst 
> Message-ID:  
> 
>   | If you boot the kernel in debug mode (netbsd -x),
> 
> I did.
> 
>   | you may see output like:
> 
> which was:
> 
> [ 1.03] cpu0: TSC freq CPUID 341760 Hz
> [ 1.03] cpu0: TSC freq from CPUID 341760 Hz
> [ 1.064451] xhci0: hcc2=0x1fd
> [ 1.064451] xhci3: hcc2=0xfd
> [ 1.064451] cpu0: TSC freq from HPET 9007294000 Hz
> [ 1.064451] cpu0: TSC freq CPUID 341760 Hz
> [ 1.064451] cpu0: TSC freq calibrated 9007294000 Hz


So it's the HPET calibration that goes wrong.


The calibration works like:

Fetch hpet and tsc at attach time.

(void)cpu_counter();
val = bus_space_read_4(sc->sc_memt, sc->sc_memh, HPET_MCOUNT_LO);
hpet_attach_tsc = cpu_counter();
hpet_attach_val = val;


When calibtrating, make sure that hpet has counted for at
least 0.1 seconds:

hd = bus_space_read_4(sc->sc_memt, sc->sc_memh, HPET_MCOUNT_LO);
hd -= hpet_attach_val;
if (hd < (uint64_t)10 * 10 / sc->sc_period)
hpet_delay(10);


Fetch hpet and tsc again

s = splhigh();
(void)cpu_counter();
hd = bus_space_read_4(sc->sc_memt, sc->sc_memh, HPET_MCOUNT_LO);
td = cpu_counter();
splx(s);


Compute tsc frequency from hpet frequency.

hd -= hpet_attach_val;
val = ((uint64_t)hd * sc->sc_period - sc->sc_adj) / 1;
freq = (td - hpet_attach_tsc) * 1000 / val;
return rounddown(freq + 500, 1000);


In your case, you say it takes ~6 minutes between attachment and
calibration and your hpet runs at 19.2MHz.

This is enough for HPET_MCOUNT_LO to overflow.


Greetings,
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."



Re: Weird clock behaviour with current (amd64) kernel

2022-08-13 Thread Robert Elz
Date:Sat, 13 Aug 2022 17:41:05 +0200
From:Michael van Elst 
Message-ID:  

  | If you boot the kernel in debug mode (netbsd -x),

I did.

  | you may see output like:

which was:

[ 1.03] cpu0: TSC freq CPUID 341760 Hz
[ 1.03] cpu0: TSC freq from CPUID 341760 Hz
[ 1.064451] xhci0: hcc2=0x1fd
[ 1.064451] xhci3: hcc2=0xfd
[ 1.064451] cpu0: TSC freq from HPET 9007294000 Hz
[ 1.064451] cpu0: TSC freq CPUID 341760 Hz
[ 1.064451] cpu0: TSC freq calibrated 9007294000 Hz
[ 1.972808] cpu1: TSC skew=4 drift=0
[ 1.993656] cpu2: TSC skew=0 drift=0
[ 2.060514] cpu3: TSC skew=81 drift=-16
[ 2.187906] cpu4: TSC skew=-61 drift=4
[ 2.213693] cpu5: TSC skew=0 drift=0
[ 2.289145] cpu6: TSC skew=0 drift=0
[ 2.415347] cpu7: TSC skew=0 drift=0
[ 2.441701] cpu8: TSC skew=-65 drift=4
[ 2.557519] cpu9: TSC skew=0 drift=0
[ 2.583039] cpu10: TSC skew=-2 drift=-2
[ 2.705257] cpu11: TSC skew=68 drift=-127
[ 2.731613] cpu12: TSC skew=0 drift=0
[ 2.799767] cpu13: TSC skew=0 drift=0
[ 2.875167] cpu14: TSC skew=0 drift=0
[ 2.944878] cpu15: TSC skew=-51 drift=49
[ 3.074237] cpu16: TSC skew=-28 drift=2
[ 3.100594] cpu17: TSC skew=-27 drift=-2
[ 3.219004] cpu18: TSC skew=-27 drift=-1
[ 3.244362] cpu19: TSC skew=-31 drift=6
[ 3.365815] cpu20: TSC skew=-9 drift=0
[ 3.391221] cpu21: TSC skew=-5 drift=0
[ 3.512663] cpu22: TSC skew=-33 drift=4
[ 3.538003] cpu23: TSC skew=-3 drift=-1
[ 3.654063] timecounter: Timecounter "TSC" frequency 9007294000 Hz quality 
3000

(ignore the xhci lines, they just matched the grep pattern "TSC").

A normal boot (no PCIDUMP) ends with:

[ 1.678054] timecounter: Timecounter "TSC" frequency 3417601000 Hz quality 
3000

jacaranda$ bc
scale=3
9007294000/341760
2.635

which is almost exactly the slowdown I was seeing in user mode once
it booted.   Note that aside from time running slowly, everything else
seems to work fine.   After switching to hpet0 (after having booted
as above) time advances at the proper rate again, but internal delays
(as in that needed for "sleep 1" to sleep for a second) keep running
at the slow rate.

In the dmesg, everything that happens at timestamp 1.064451 is while
the PCI CONFIG DUMP code is running - that lasts about 6 minutes, maybe a
bit more with -x (I didn't time it this time), during which the timestamps
don't advance.   The timestamps that happen after that are running on
slow time (a 1 second increment is really about 2.6 seconds).

  | "from delay" is the first calibration against the i8254 timer.

I don't have any "from delay" output.

  | The patch should improve the accuracy of the "from delay" value.

In that case it is no surprise that nothing changed much.  I am
guessing that that line would appear without -x as well, the other
ones do - I don't see a "from delay" line in any dmesg output I have
saved (unless that was something new you just added, and does only
appear with -x - in which case it simply wasn't invoked on my system).

  | It's also the only place that could have been influenced by e.g. console
  | output.

Clearly there's some other place, the only difference between the kernel
that has this problem, and one that doesn't, is the PCIDUMP option being
enabled:

jacaranda$ (command cd ~/src/sys-conf; diff JACARANDA JACARANDA_PCI_DUMP)
207c207
< #options  PCI_CONFIG_DUMP # verbosely dump PCI config space
---
> options   PCI_CONFIG_DUMP # verbosely dump PCI config space
jacaranda$ 


  | If you have a working HPET,

I do, that one works fine in general.

  | the second calibration should be better.
  | Here it always returns exactly the same number.

See above.

To avoid delays in a message turnaround, this is what sysctl says is
available (this output is from a normal boot, not the PCIDUMP one).

kern.timecounter.choice = TSC(q=3000, f=3417601000 Hz) clockinterrupt(q=0, 
f=100 Hz) lapic(q=-100, f=3840 Hz) hpet0(q=2000, f=1920 Hz) 
ACPI-Fast(q=1000, f=3579545 Hz) i8254(q=100, f=1193182 Hz) dummy(q=-100, 
f=100 Hz)


As I have said before, this is most probably an exceedingly unimportant
issue, not worth spending a lot of time on.   That is, unless perhaps it
is exposing some other bug that may possibly reveal itself in some other
situations.

kre



Re: Weird clock behaviour with current (amd64) kernel

2022-08-13 Thread Robert Elz
Date:Sun, 7 Aug 2022 09:17:52 - (UTC)
From:mlel...@serpens.de (Michael van Elst)
Message-ID:  


  | Does this help ?
  |
  | Index: sys/arch/x86/x86/cpu.c

Finally got a chance to test that patch properly - sorry for the delay.

The result:  "not much" if anything at all.

Running that system (after booting with PCIDUMP enabled) and using TSC
as the clock source, a

while sleep 1; do date; done

loop prints the date (which increases a second at a time) about every
2.6 seconds (previously it was about 3 - but that was a less precise
measurement).

After switching to hpet0 (no reboot) the same loop runs at about the
same speed, but the date (time really) printed is advancing in accordance
with clock time - successive times reported differ by 2 or 3 seconds.

I have the dmesg from the boot (I used -x for this one) if that might
help (or some parts of it perhaps - the file is just a little under 1.5MB).

kre



Re: Weird clock behaviour with current (amd64) kernel

2022-08-13 Thread Michael van Elst
On Sat, Aug 13, 2022 at 10:15:30PM +0700, Robert Elz wrote:
> 
> The result:  "not much" if anything at all.

If you boot the kernel in debug mode (netbsd -x), you may see
output like:


[ 1.03] cpu0: TSC freq from delay 2521276800 Hz

maybe also something like:
[ ] cpu0: TSC freq from CPUID XX Hz

[ 1.057594] cpu0: TSC freq from HPET 2491906000 Hz
[ 1.957885] cpu1: TSC skew=8 drift=0
[ 2.014612] cpu2: TSC skew=34 drift=4
[ 2.181611] cpu3: TSC skew=34 drift=4
[ 2.291306] timecounter: Timecounter "TSC" frequency 2491906000 Hz quality 
3000


"from delay" is the first calibration against the i8254 timer.
"from CPUID" is a value that the CPU reports.
"from HPET" is a second calibration against the HPET timer.


The patch should improve the accuracy of the "from delay" value. It's
also the only place that could have been influenced by e.g. console
output.

If you have a working HPET, the second calibration should be better.
Here it always returns exactly the same number.


Greetings,
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: Weird clock behaviour with current (amd64) kernel

2022-08-07 Thread Robert Elz
Date:Sun, 7 Aug 2022 09:17:52 - (UTC)
From:mlel...@serpens.de (Michael van Elst)
Message-ID:  

  | Does this help ?

I'll try it and let you know.   Thanks.It might take a few days.

kre



Re: Weird clock behaviour with current (amd64) kernel

2022-08-07 Thread Michael van Elst
k...@munnari.oz.au (Robert Elz) writes:

>Date:Thu, 4 Aug 2022 12:49:35 - (UTC)
>From:mlel...@serpens.de (Michael van Elst)
>Message-ID:  

>  | The measurement runs with enabled interrupts. If you have lots of 
> interrupts
>  | or interrupts that take some time, the measurement is biased.
>  |
>  | Console output can do this.

>That is what I suspected.   A normal boot dmesg on this system is about
>50KB.   With PCI_CONFIG_DUMP it is about 1MB (just a bit over).  That's
>a lot of work for wscons scrolling (via the BIOS the whole time - the
>dump all happens before the console switches to graphics mode) a fairly
>large screen.


Does this help ?

Index: sys/arch/x86/x86/cpu.c
===
RCS file: /cvsroot/src/sys/arch/x86/x86/cpu.c,v
retrieving revision 1.203
diff -p -u -r1.203 cpu.c
--- sys/arch/x86/x86/cpu.c  1 Apr 2022 19:57:22 -   1.203
+++ sys/arch/x86/x86/cpu.c  7 Aug 2022 09:17:12 -
@@ -1336,9 +1336,16 @@ cpu_get_tsc_freq(struct cpu_info *ci)
 */
if (ci->ci_data.cpu_cc_freq == 0)
freq = freq_from_cpuid = cpu_tsc_freq_cpuid(ci);
+   if (freq != 0)
+   aprint_debug_dev(ci->ci_dev, "TSC freq "
+   "from CPUID %" PRIu64 " Hz\n", freq);
 #if NHPET > 0
-   if (freq == 0)
+   if (freq == 0) {
freq = hpet_tsc_freq();
+   if (freq != 0)
+   aprint_debug_dev(ci->ci_dev, "TSC freq "
+   "from HPET %" PRIu64 " Hz\n", freq);
+   }
 #endif
if (freq == 0) {
/*
@@ -1348,20 +1355,33 @@ cpu_get_tsc_freq(struct cpu_info *ci)
 */
overhead = 0;
for (int i = 0; i <= 8; i++) {
+   const int s = splhigh();
t0 = cpu_counter();
delay_func(0);
t1 = cpu_counter();
+   splx(s);
if (i > 0) {
overhead += (t1 - t0);
}
}
overhead >>= 3;
 
-   /* Now do the calibration. */
-   t0 = cpu_counter();
-   delay_func(10);
-   t1 = cpu_counter();
-   freq = (t1 - t0 - overhead) * 10;
+   /*
+* Now do the calibration.
+*/
+   freq = 0;
+   for (int i = 0; i < 1000; i++) {
+   const int s = splhigh();
+   t0 = cpu_counter();
+   delay_func(100);
+   t1 = cpu_counter();
+   splx(s);
+   freq += t1 - t0 - overhead;
+   }
+   freq = freq * 10;
+
+   aprint_debug_dev(ci->ci_dev, "TSC freq "
+   "from delay %" PRIu64 " Hz\n", freq);
}
if (ci->ci_data.cpu_cc_freq != 0) {
freq_from_cpuid = cpu_tsc_freq_cpuid(ci);




Re: Weird clock behaviour with current (amd64) kernel

2022-08-04 Thread Robert Elz
Date:Thu, 4 Aug 2022 12:49:35 - (UTC)
From:mlel...@serpens.de (Michael van Elst)
Message-ID:  

  | The measurement runs with enabled interrupts. If you have lots of interrupts
  | or interrupts that take some time, the measurement is biased.
  |
  | Console output can do this.

That is what I suspected.   A normal boot dmesg on this system is about
50KB.   With PCI_CONFIG_DUMP it is about 1MB (just a bit over).  That's
a lot of work for wscons scrolling (via the BIOS the whole time - the
dump all happens before the console switches to graphics mode) a fairly
large screen.

The effect is that when I do this, the TSC seems to be running about 3
times faster than it actually does, and consequently, timekeeping is
horrid (as long as it remains the clock source).  Switching away from it
causes time to flow at the correct rate - but things like "sleep 1" still
take 3 seconds to complete.

This is not important - PCI_CONFIG_DUMP isn't something that it is rational
to leave turned on in an everyday kernel.   I'll probably stop even bothering
to build that one now.

kre



Re: Weird clock behaviour with current (amd64) kernel

2022-08-04 Thread Michael van Elst
k...@munnari.oz.au (Robert Elz) writes:

>The issue only occurs when I boot a kernel with options PCI_CONFIG_DUMP
>enabled, which is (one way or the other) almost certainly responsible for
>the issue - though the problem you mention may be involved in the kernel's
>failure to detect that the TSC measurement is being messed up by that option.

The measurement runs with enabled interrupts. If you have lots of interrupts
or interrupts that take some time, the measurement is biased.

Console output can do this.



Re: Weird clock behaviour with current (amd64) kernel

2022-08-04 Thread Robert Elz
Date:Wed, 3 Aug 2022 22:30:34 +
From:David Holland 
Message-ID:  

  | so if your previous kernel was fairly old and you
  | still have older boot logs lying around you might check them for
  | such notices.

This is an entirely new system, the oldest kernel it has ever run is 9.99.97
(and that fairly briefly - I had it prepared to install before the hardware
arrived, once it was running I updated the sources and self-built a 9.99.98
(HEAD at the time) system).  It is currently on 9.99.99 from this month
(kernel, userland is still mostly 9.99.98 - though I boot HEAD occasionally
to test X updates).

The issue only occurs when I boot a kernel with options PCI_CONFIG_DUMP
enabled, which is (one way or the other) almost certainly responsible for
the issue - though the problem you mention may be involved in the kernel's
failure to detect that the TSC measurement is being messed up by that option.

I don't really even run with that option any more - it turned out to be
useless (or perhaps useful) for the reason I had it enabled ... I have an
add in PICe SATA card, that the BIOS sometimes manages to hide, and I wanted
to see if I could see what was happening.   That ended up showing me that
as far as NetBSD is concerned, (when in that state) the card simply does not
exist, the slot it is connected to seems empty.

I don't understand how that happens - I would have thought that NetBSD (with
all the options enabled to make this happen - which I think I have) should be
doing PCI config cycles, and everything connected to the bus should be
responding - and the BIOS should not be involved in any of that at all.
A standard PCI bus, as I remember it, has no hidden pins to cause a device
to not respond to a config request, and a plug in card only has the bus as
its interface to the system.  That's certainly what happened when I used to
write code to do that kind of thing (not on an x86 architecture system) long
long long ago.

But that was before PCIe - while at the time I considered PCI to be the one and
only technology to have emerged from the PC world (hardware and software) that
was well enough designed to stand the test of time, I am not sure that PCIe
has continued down the same path, it looks to be to be targeted much more
at the PC world (only) and so perhaps the BIOS is given hidden (to everyone
else) config options which do allow devices to simply vanish, perhaps by being
able to tell the bus slot not to pass on (or through, whatever) CONFIG 
requests.

On the other hand, and while my experience with these on this system is
very limited, I have never seen windows or linux fail to detect the SATA
card (that may be that they never booted while the BIOS was trying to hide it).

kre



Re: Weird clock behaviour with current (amd64) kernel

2022-08-04 Thread Michael van Elst
dholland-curr...@netbsd.org (David Holland) writes:

>On Thu, Jul 14, 2022 at 08:59:25PM +0700, Robert Elz wrote:
> > I just booted a kernel that I built (from up to date at the time)
> > HEAD sources about 24 hours ago.
> > 
> > Everything seemed to be working fine - until I noticed that all of
> > my clocks (there are several, gkrellm, window manager, a dclock,
> > and an xtu) were all wildly wrong (as in, were moving time forwards
> > incredibly slowly).

>Probably not related, but see PR 56322. I have a machine where the TSC
>is apparently bad, and somewhere in -current a bit more than a year
>ago we stopped detecting that during boot, with negative consequences.


The bad TSC is probably a known erratum that we don't (yet) check for.



Re: Weird clock behaviour with current (amd64) kernel

2022-08-03 Thread David Holland
On Thu, Jul 14, 2022 at 08:59:25PM +0700, Robert Elz wrote:
 > I just booted a kernel that I built (from up to date at the time)
 > HEAD sources about 24 hours ago.
 > 
 > Everything seemed to be working fine - until I noticed that all of
 > my clocks (there are several, gkrellm, window manager, a dclock,
 > and an xtu) were all wildly wrong (as in, were moving time forwards
 > incredibly slowly).

Probably not related, but see PR 56322. I have a machine where the TSC
is apparently bad, and somewhere in -current a bit more than a year
ago we stopped detecting that during boot, with negative consequences.

I used to get boot messages of the form
   autoconfiguration error: ERROR: 2607 cycle TSC drift observed

and these stopped, so if your previous kernel was fairly old and you
still have older boot logs lying around you might check them for
such notices.

-- 
David A. Holland
dholl...@netbsd.org


Re: Weird clock behaviour with current (amd64) kernel

2022-07-20 Thread Robert Elz
Date:Wed, 20 Jul 2022 21:18:33 + (UTC)
From:RVP 
Message-ID:  

  | Yes indeed. src/sys/arch/x86/x86/genfb_machdep.c:1.17 _has_ fixed this
  | little issue quite satisfactorily. Thank you, Michael :)

Oh, yes, Sorry - meant to send a message like that a couple of days ago,
but it just never happened...

Thanks, my all cyan boot messages are back.

Not sure if this is worth a pullup to -9 or not.

kre



Re: Weird clock behaviour with current (amd64) kernel

2022-07-20 Thread RVP

On Sat, 16 Jul 2022, Robert Elz wrote:


But it appears as if mlelstv@ might have fixed the colour issue now anyway.



Yes indeed. src/sys/arch/x86/x86/genfb_machdep.c:1.17 _has_ fixed this
little issue quite satisfactorily. Thank you, Michael :)

-RVP



Re: Weird clock behaviour with current (amd64) kernel

2022-07-18 Thread Masanobu SAITOH
Hi,

On 2022/07/15 21:38, Robert Elz wrote:
> Date:Fri, 15 Jul 2022 13:32:55 +0900
> From:Masanobu SAITOH 
> Message-ID:  <87553319-950b-7dad-ac64-29d2c25d1...@execsw.org>
> 
>   | Could you show me the full dmesg with verbose output?
> 
> I could, but it turns out to be about a megabyte, so unless you need
> to see all that noise, perhaps just this will help:
> 
> jacaranda$ grep -i TSC ~/dmesg*cold
> [ 1.03] cpu0: Use lfence to serialize rdtsc
> [ 1.03] cpu0: TSC freq CPUID 341760 Hz
> [ 1.031552] cpu0: TSC freq CPUID 341760 Hz
> [ 1.031552] cpu0: TSC freq calibrated 10483939000 Hz
> [ 1.457151] timecounter: Timecounter "TSC" frequency 10483939000 Hz 
> quality 3000
> 
> When that was running, I did a
>   while sleep 1; do date; done
> loop, and watched it compared with time from my phone.
> The date lines showed nicely incrementing seconds, but each
> "sleep 1" lasted for (about) 3 seconds.
> 
> I switched from TSC to hpet0 and repeated the loop, the sleeps
> still slept for 3 seconds, but now time went up in steps of 3.
> (ie: time of day keeping became accurate, internal relative times
> remained broken).
> 
> It is difficult to believe that the ratio 10483939000/341760 (== 3.067)
> is not related here.

Yes, it should be related to the problem.

> A different boot produced this instead
> 
> [ 1.03] cpu0: TSC freq CPUID 341760 Hz
> [ 1.031706] cpu0: TSC freq CPUID 341760 Hz
> [ 1.031706] cpu0: TSC freq calibrated 10491257000 Hz
> [ 1.345506] timecounter: Timecounter "TSC" frequency 10491257000 Hz 
> quality 3000
> 
> A similar calibrated value, but not identical.   That kind of looks like
> the PCI_CONFIG_DUMP output/scrolling/something related is interfering with
> the calibration ...

Another possibility is that the LAPIC_ICR_TIMER register is
not set correctly in lapic_calibrate_timer(). The function is called
twice and it refers "lapic_per_second" global variable. The code is
little tricky.

> all the messages with dmesg timestamps of 1.03xxx are
> being produced while all of that is happening.   The 1.000 message is before
> the config dump starts, the 1.3455 or 1.457... messages appear after the
> config dump has ended.
> 
>   | i.e. add -v option to the boot command in /boot.cfg.
> 
> I can't do exactly that, as I cannot find a boot.cfg file anywhere that
> gets used (this is another issue I'm having, which I was going to ask
> about sometime) - I have been modifying the banner= strings on every one
> I can find (turns out I have a lot of them... none being used) so I know
> when one is picked.   None of the boot.cfg files I can find has the ascii
> art flag in the banner - the menu printed by efiboot does have that.
> 
> So, instead I simply hand typed a boot command, with -v at the end.   I'm
> not sure that worked (that is, the system certainly booted, but I don't
> know that the -v flag worked).

It worked. "TSC freq CPUID" is printed in cpu_tsc_freq_cpuid().

aprint_verbose_dev(ci->ci_dev, "TSC freq CPUID %" PRIu64
" Hz\n", freq);

aprint_verbose*() is printed when -v is set.

> If you'd like to see the complete dmesg I can make that available (either
> one from a cold boot, or the subsequent one, from a reboot, where the
> previous boot's dmesg is still in the buffer - that file is about 2MB).
> 
> For comparison, when I boot generic (built from the same kernel sources,
> but no pci config dump of course) what I see is:
> 
> [ 1.03] cpu0: Use lfence to serialize rdtsc
> [ 1.03] cpu0: TSC freq CPUID 341760 Hz
> [ 1.059545] cpu0: TSC freq CPUID 341760 Hz
> [ 1.059545] cpu0: TSC freq calibrated 3417601000 Hz
> [ 2.106529] timecounter: Timecounter "TSC" frequency 3417601000 Hz 
> quality 3000
> 
> That there is no info not there which is included above, is why I
> suspect the -v might not have worked .. of course, this is all from a
> simple grep (using -i, which is why that rdtsc line is included).
> Of course, if the additional info you are looking for doesn't contain "tsc"
> (or TSC) then I wouldn't have found it, in which case either I give you
> the whole dmesg,

I'd like to see the debug(-x) output, too.
I'll mail to you privately.


> or you supply a different grep pattern (if I can find one
> line, I can easily extract any relevant looking surrounding lines).
> 
> kre
> 

-- 
---
SAITOH Masanobu (msai...@execsw.org
 msai...@netbsd.org)


Re: Weird clock behaviour with current (amd64) kernel

2022-07-16 Thread Robert Elz
Date:Sat, 16 Jul 2022 00:20:41 + (UTC)
From:RVP 
Message-ID:  

  | On Fri, 15 Jul 2022, Robert Elz wrote:
  | > If that is all it is, it is barely worth fixing ... though this
  | > must have happened sometime in the 9.99.9[78] series (sometime
  | > after early last Dec).

  | Farther back than that I think: 9.2_STABLE does the same thing.

Just as likely the relevant change has been pulled up to -9 to get
that result, I haven't booted a -9 with a local kernel recently to
know (but doing that, for other reasons, is possibly going to happen).

But until the graphics update last Dec, I was running HEAD on my
laptop, and updating its kernel frequently, and its cyan console was
100% cyan (no yellow/brown/...) anywhere in sight.   I stopped after
those changes, as my (quite old by then) X server dumped core with the
new drivers, and that wasn't useful...   I believe that issue was fixed,
but before I got around to testing it, the laptop decided to suffer heat
exhaustion (or similar) and really doesn't want to run for more than a
couple of minutes (and right now, is lacking a boot device anyway, I
pulled it).   So I am fairly sure that up until then, all was good
(not using EFI booting though, which now seems to be a difference, it
used EFI partitioning, but using biosboot, with EFI booting, NetBSD seemed
to not find half the hardware ... it does have a fairly early EFI
prom implementation though).

But it appears as if mlelstv@ might have fixed the colour issue now anyway.
(Thanks Michael).   Not that it really was much of an issue.  I will
verify later (maybe tomorrow sometime) - a system with that change
is built already.

kre



Re: Weird clock behaviour with current (amd64) kernel

2022-07-16 Thread Robert Elz
Date:Sat, 16 Jul 2022 06:09:26 - (UTC)
From:mlel...@serpens.de (Michael van Elst)
Message-ID:  

  | For the green color it doesn't matter if the order is BGR or RGB.
  | For cyan, the wrong order gives "brown" which is a dark yellow.

I tossed up what to call the colour - certainly not bright yellow,
I considered orange (but that would bring to mind products from
Valencia, and that's not it) or polished copper (too artsy for me)
so I just decided to stick with the rgbcym (plus bw) pallette,
and yellow was closest...   brown would normally bring to mind
something darker, but otherwise it works too.

This is not important, I mentioned it only on the off chance that
it might be a useful clue to someone looking for more serious
problems.   It obviously isn't, so I think we can forget it now.

The clock issue is more interesting however, something is clearly
not being handled quute right there.

kre


Re: Weird clock behaviour with current (amd64) kernel

2022-07-16 Thread Michael van Elst
r...@sdf.org (RVP) writes:

>Unsurprisingly, EFI also has a colour-index similar to VGA (see:
>/usr/src/sys/external/bsd/gnu-efi/dist/inc/eficon.h). I tried fixing the
>indexes like this, but, it doesn't for some (autoconfig?) reason. Can
>only look into this after I come back from my road-trip.

That color index is used by text mode, but booting from EFI uses
a graphics framebuffer (nowadays mostly 24bit or 32bit per pixel).


But all this color shift is unrelated to the color indexes, but
how the framebuffer pixels are organized.


The early console code has no information about byte order in
the framebuffer. rasops then initializes e.g. for 32bit pixels:

if (ri->ri_rnum == 0) {
ri->ri_rnum = ri->ri_gnum = ri->ri_bnum = 8;

ri->ri_rpos = 0;
ri->ri_gpos = 8;
ri->ri_bpos = 16;
}

which is 0x00BBGGRR.

When genfb actually attaches it carries information about
the byte ordering and rasops gets initialized with the
right values.

For the green color it doesn't matter if the order is BGR or RGB.
For cyan, the wrong order gives "brown" which is a dark yellow.



Re: Weird clock behaviour with current (amd64) kernel

2022-07-15 Thread RVP

On Fri, 15 Jul 2022, Robert Elz wrote:


If that is all it is, it is barely worth fixing ... though this
must have happened sometime in the 9.99.9[78] series (sometime
after early last Dec).



Farther back than that I think: 9.2_STABLE does the same thing.

On Fri, 15 Jul 2022, Michael van Elst wrote:


Whatever driver you use either doesn't translate correctly or badly
assumes some hardware configuration (e.g. color palette) when booting.



Unsurprisingly, EFI also has a colour-index similar to VGA (see:
/usr/src/sys/external/bsd/gnu-efi/dist/inc/eficon.h). I tried fixing the
indexes like this, but, it doesn't for some (autoconfig?) reason. Can
only look into this after I come back from my road-trip.

---
diff -urN a/src/sys/arch/x86/x86/genfb_machdep.c 
b/src/sys/arch/x86/x86/genfb_machdep.c
--- a/src/sys/arch/x86/x86/genfb_machdep.c  2021-01-28 01:57:31.0 
+
+++ b/src/sys/arch/x86/x86/genfb_machdep.c  2022-07-15 23:29:07.334415944 
+
@@ -163,24 +163,60 @@
return 1;
 }

+/*
+ * translate WS(=ANSI) color codes to EFI/PC ones
+ */
+#define bg(x) ((x)<<4U)
+static const unsigned char fgansitopc[] = {
+   WSCOL_BLACK, WSCOL_BLUE, WSCOL_GREEN, WSCOL_CYAN, WSCOL_RED,
+   WSCOL_MAGENTA, WSCOL_BROWN, WSCOL_WHITE,
+   WSCOL_LIGHT_GREY, WSCOL_LIGHT_BLUE, WSCOL_LIGHT_GREEN, WSCOL_LIGHT_CYAN,
+   WSCOL_LIGHT_RED, WSCOL_LIGHT_MAGENTA, WSCOL_LIGHT_BROWN, 
WSCOL_LIGHT_WHITE
+}, bgansitopc[] = {
+   bg(WSCOL_BLACK), bg(WSCOL_BLUE), bg(WSCOL_GREEN), bg(WSCOL_CYAN),
+   bg(WSCOL_RED), bg(WSCOL_MAGENTA), bg(WSCOL_BROWN), bg(WSCOL_LIGHT_GREY)
+};
+#undef bg
+
+static void
+x86_genfb_pc_colour(long *attr)
+{
+   uint32_t fg, bg, flag;
+   long aa;
+
+   rasops_unpack_attr(*attr, , , NULL);
+   if (__predict_false(fg >= sizeof(fgansitopc) || bg >= 
sizeof(bgansitopc))) {
+   aprint_normal("x86_genfb_pc_colour: out of range\n");
+   return;
+   }
+   flag = *attr & 0xU;
+   aa = fgansitopc[fg] << 24 | bgansitopc[bg] << 16 | flag;
+   aprint_normal("x86_genfb_pc_colour: old = 0x%lX, new = 0x%lX\n", *attr, 
aa);
+   *attr = aa;
+}
+
 int
 x86_genfb_cnattach(void)
 {
static int ncalls = 0;
struct rasops_info *ri = _genfb_console_screen.scr_ri;
-   long defattr;
+   long defattr = 0;

/* XXX jmcneill
 *  Defer console initialization until UVM is initialized
 */
+   aprint_normal("x86_genfb_cnattach()\n");
++ncalls;
if (ncalls < 3)
return -1;

-   if (!x86_genfb_init())
+   if (!x86_genfb_init()) {
+   aprint_normal("x86_genfb_init ERROR\n");
return 0;
+   }

ri->ri_ops.allocattr(ri, 0, 0, 0, );
+   x86_genfb_pc_colour();
wsdisplay_preattach(_genfb_stdscreen, ri, 0, 0, defattr);

return 1;
---

-RVP


Re: Weird clock behaviour with current (amd64) kernel

2022-07-15 Thread Robert Elz
Date:Fri, 15 Jul 2022 13:32:55 +0900
From:Masanobu SAITOH 
Message-ID:  <87553319-950b-7dad-ac64-29d2c25d1...@execsw.org>

  | Could you show me the full dmesg with verbose output?

I could, but it turns out to be about a megabyte, so unless you need
to see all that noise, perhaps just this will help:

jacaranda$ grep -i TSC ~/dmesg*cold
[ 1.03] cpu0: Use lfence to serialize rdtsc
[ 1.03] cpu0: TSC freq CPUID 341760 Hz
[ 1.031552] cpu0: TSC freq CPUID 341760 Hz
[ 1.031552] cpu0: TSC freq calibrated 10483939000 Hz
[ 1.457151] timecounter: Timecounter "TSC" frequency 10483939000 Hz quality 
3000

When that was running, I did a
while sleep 1; do date; done
loop, and watched it compared with time from my phone.
The date lines showed nicely incrementing seconds, but each
"sleep 1" lasted for (about) 3 seconds.

I switched from TSC to hpet0 and repeated the loop, the sleeps
still slept for 3 seconds, but now time went up in steps of 3.
(ie: time of day keeping became accurate, internal relative times
remained broken).

It is difficult to believe that the ratio 10483939000/341760 (== 3.067)
is not related here.

A different boot produced this instead

[ 1.03] cpu0: TSC freq CPUID 341760 Hz
[ 1.031706] cpu0: TSC freq CPUID 341760 Hz
[ 1.031706] cpu0: TSC freq calibrated 10491257000 Hz
[ 1.345506] timecounter: Timecounter "TSC" frequency 10491257000 Hz quality 
3000

A similar calibrated value, but not identical.   That kind of looks like
the PCI_CONFIG_DUMP output/scrolling/something related is interfering with
the calibration ... all the messages with dmesg timestamps of 1.03xxx are
being produced while all of that is happening.   The 1.000 message is before
the config dump starts, the 1.3455 or 1.457... messages appear after the
config dump has ended.

  | i.e. add -v option to the boot command in /boot.cfg.

I can't do exactly that, as I cannot find a boot.cfg file anywhere that
gets used (this is another issue I'm having, which I was going to ask
about sometime) - I have been modifying the banner= strings on every one
I can find (turns out I have a lot of them... none being used) so I know
when one is picked.   None of the boot.cfg files I can find has the ascii
art flag in the banner - the menu printed by efiboot does have that.

So, instead I simply hand typed a boot command, with -v at the end.   I'm
not sure that worked (that is, the system certainly booted, but I don't
know that the -v flag worked).

If you'd like to see the complete dmesg I can make that available (either
one from a cold boot, or the subsequent one, from a reboot, where the
previous boot's dmesg is still in the buffer - that file is about 2MB).

For comparison, when I boot generic (built from the same kernel sources,
but no pci config dump of course) what I see is:

[ 1.03] cpu0: Use lfence to serialize rdtsc
[ 1.03] cpu0: TSC freq CPUID 341760 Hz
[ 1.059545] cpu0: TSC freq CPUID 341760 Hz
[ 1.059545] cpu0: TSC freq calibrated 3417601000 Hz
[ 2.106529] timecounter: Timecounter "TSC" frequency 3417601000 Hz quality 
3000

That there is no info not there which is included above, is why I
suspect the -v might not have worked .. of course, this is all from a
simple grep (using -i, which is why that rdtsc line is included).
Of course, if the additional info you are looking for doesn't contain "tsc"
(or TSC) then I wouldn't have found it, in which case either I give you
the whole dmesg, or you supply a different grep pattern (if I can find one
line, I can easily extract any relevant looking surrounding lines).

kre



Re: Weird clock behaviour with current (amd64) kernel

2022-07-15 Thread Robert Elz
Date:Fri, 15 Jul 2022 05:12:10 - (UTC)
From:mlel...@serpens.de (Michael van Elst)
Message-ID:  

  | Does the color shift also happen after a cold boot ?

Yes, it is yellow, rather than cyan, when the system boots,
cold or warm (or if booting warm after linux had borrowed the
system for a few minutes).

For the cold boot, I made it quite cold (not literally, no ice
involved) - shutdown -p, then switched off the power supply, and
removed the power cord, and left it like that about a minute, before
supplying power and booting.

kre



Re: Weird clock behaviour with current (amd64) kernel

2022-07-15 Thread Robert Elz
Date:Fri, 15 Jul 2022 05:12:10 - (UTC)
From:mlel...@serpens.de (Michael van Elst)
Message-ID:  

  | Does the color shift also happen after a cold boot ?

Not sure, I rarely do that, and don't think I ever have for a kernel
intended to have cyan text.   I will try it when I next boot in a bit.

kre

ps: wscons seems to be correct, it is whatever is before that which
gets the colours wrong - the early boot messages (eg: the copyright
message output, and everything around and following that until the
graphics switch is made).




Re: Weird clock behaviour with current (amd64) kernel

2022-07-14 Thread Robert Elz
Date:Fri, 15 Jul 2022 13:32:55 +0900
From:Masanobu SAITOH 
Message-ID:  <87553319-950b-7dad-ac64-29d2c25d1...@execsw.org>

  | Could you show me the full dmesg with verbose output?

Sure, it will take me a bit to get to a state where I can reboot again.

You do realize that it will be a very big file (my next kernel, which
will come from today's sources - release currently building - has the
message buffer size bumped to 4MB - that still might not be enough).

kre



Re: Weird clock behaviour with current (amd64) kernel

2022-07-14 Thread Michael van Elst
k...@munnari.oz.au (Robert Elz) writes:

>  | Heh. It's not just Cyan/Yellow; Red and Blue are swapped too, because:
>  |
>  | /usr/src/sys/dev/wscons/wsdisplayvar.h and
>  | /usr/src/sys/dev/ic/pcdisplay.h have different values for those colour=
>s.

>If that is all it is, it is barely worth fixing ... though this
>must have happened sometime in the 9.99.9[78] series (sometime
>after early last Dec) - up to then I was building and running
>custom cyan text console systems (I kept building after that but
>didn't boot them... or not on real hardware) I have been mostly
>running GENERIC (green, which seems unaffected, or everyone would
>be noticing) since then, until very recently...  The yellow was
>just a quirk I didn't bother mentioning until I had another reason
>to send a (semi-related) message.


wsdisplayvar.h has ANSI color codes.
pcdisplay.h has VGA color codes.

wscons drivers need to interpret ANSI color codes in their allocattr
function. The VGA driver translates between ANSI and VGA codes
(dev/ic/vga.c, see the fgansitopc/bgansitopc tables). 24bit
framebuffers using rasops use a colormap (dev/rasops/rasops.c, see
the rasops_cmap).

Whatever driver you use either doesn't translate correctly or badly
assumes some hardware configuration (e.g. color palette) when booting.

Does the color shift also happen after a cold boot ?



Re: Weird clock behaviour with current (amd64) kernel

2022-07-14 Thread Robert Elz
Date:Thu, 14 Jul 2022 23:17:36 + (UTC)
From:RVP 
Message-ID:  <259960d6-9f5d-56c9-6f15-46766387f...@sdf.org>

  | Works OK for me with an updated-just-now customized (ie. just a lot of stuff
  | removed) GENERIC.

My kernel is similar (though still has too much unnecessary stuff, but
that just takes a bit more time to remove).

I am guessing (no more than that) that this might be related to
the pci config dump, which writes tens of thousands of lines to
the console, takes 10 or 15 mins (haven't timed it, but it seems
like hours sitting watching it... it isn't, just seems like it)
during all of which the timestamps added don't advance at all,
10 mins (or however long it takes) later the timestamp is still
1.03 seconds (it starts at 1).   That time advanced from
1 shows time measurement started, that it did not continue shows
something stalled.  After the pci dump is done, the timestamps
start advancing again as autoconfig continues, with the missing
minutes just missing.   Since I wasn't looking for timekeeping
issues, I was not watching to see if those timestamps were
keeping accurate time or not.

  | Heh. It's not just Cyan/Yellow; Red and Blue are swapped too, because:
  |
  | /usr/src/sys/dev/wscons/wsdisplayvar.h and
  | /usr/src/sys/dev/ic/pcdisplay.h have different values for those colours.

If that is all it is, it is barely worth fixing ... though this
must have happened sometime in the 9.99.9[78] series (sometime
after early last Dec) - up to then I was building and running
custom cyan text console systems (I kept building after that but
didn't boot them... or not on real hardware) I have been mostly
running GENERIC (green, which seems unaffected, or everyone would
be noticing) since then, until very recently...  The yellow was
just a quirk I didn't bother mentioning until I had another reason
to send a (semi-related) message.

zre


Re: Weird clock behaviour with current (amd64) kernel

2022-07-14 Thread Masanobu SAITOH



On 2022/07/14 22:59, Robert Elz wrote:
> Hi,
> 
> I just booted a kernel that I built (from up to date at the time)
> HEAD sources about 24 hours ago.
> 
> Everything seemed to be working fine - until I noticed that all of
> my clocks (there are several, gkrellm, window manager, a dclock,
> and an xtu) were all wildly wrong (as in, were moving time forwards
> incredibly slowly).
> 
> You can see the results of that if you compare the Date header, and
> Received header from my local system (jacaranda) with the Received
> header munnari adds (there should be no more than a second or two
> between those - in this message there will be much more).
> 
> When I noticed this, I changed the clock source from TSC to hpet0,
> and since then, the system appears to be advancing time at about
> the right rate - but unlike its normal smooth motion, the second hand
> in xtu (the only one of the clocks that has seconds) looks very
> jerky, and ...
> 
> jacaranda$ while sleep 1; do date; done
> Thu Jul 14 20:43:49 +07 2022
> Thu Jul 14 20:43:52 +07 2022
> Thu Jul 14 20:43:55 +07 2022
> Thu Jul 14 20:43:58 +07 2022
> Thu Jul 14 20:44:01 +07 2022
> 
> that would be much like what it looks like.

Could you show me the full dmesg with verbose output?
i.e. add -v option to the boot command in /boot.cfg.
It shows some message related to the TSC stuff.

Thanks in advance.


> This is a fairly normal kernel, the most notable features of its config
> are:
> 
> options PCIVERBOSE  # verbose PCI device autoconfig messages
> options PCI_CONFIG_DUMP # verbosely dump PCI config space
> options SCSIVERBOSE # human readable SCSI error messages
> options HDAUDIOVERBOSE  # human readable HDAUDIO device names
> 
> options ACPI_SCANPCI# find PCI roots using ACPI
> options MPBIOS  # configure CPUs and APICs using 
> MPBIOS
> options MPBIOS_SCANPCI  # MPBIOS configures PCI roots
> options PCI_INTR_FIXUP  # fixup PCI interrupt routing via ACPI
> options PCI_BUS_FIXUP   # fixup PCI bus numbering
> options PCI_ADDR_FIXUP  # fixup PCI I/O addresses
> options ACPI_ACTIVATE_DEV   # If set, activate inactive devices
> options VGA_POST# in-kernel support for VGA POST
> 
> options MSGBUFSIZE=1049600
> 
> (still not big enough to hold all of what PCI_CONFIG_DUMP produces, I do have
> the dmesg.boot file from it if anyone cares).
> 
> Anyone have any ideas?   Note that the CPU is an Alder Lake - has both
> performance and economy cores which run at different rates (which NetBSD
> knows nothing about, yet, but that's OK) - core speed is always subject
> to variation anyway, so that should not matter (and has not on previous
> kernels).
> 
> My previous kernel did not have most of those options, and managed time well
> enough (I have never really trusted TSC, but it was at least close enough
> that NTP could keep it in line - this is beyond NTP's abilities).
> 
> While I am here, an unrelatged matter, one other config option:
> 
> options WS_KERNEL_FG=WSCOL_CYAN
> 
> No way is that related to the time ... I've used that in kernels for decades.
> I mention it now, as it might just provide some assistance to those working
> on the graphics drivers.
> 
> When the system first boots, all the console messages appear in yellow, not
> the normal green, and not cyan.   After the system switches the console to
> graphics mode (it is an nvidia GT930 - running X on wsfb) the messages all
> switch to cyan.  This harms nothing, it's just a bit weird, and I thought
> it might provide a clue to some possible setup errors?
> 
> kre

-- 
---
SAITOH Masanobu (msai...@execsw.org
 msai...@netbsd.org)


Re: Weird clock behaviour with current (amd64) kernel

2022-07-14 Thread RVP

On Thu, 14 Jul 2022, Robert Elz wrote:


Anyone have any ideas?   Note that the CPU is an Alder Lake - has both
performance and economy cores which run at different rates (which NetBSD
knows nothing about, yet, but that's OK) - core speed is always subject
to variation anyway, so that should not matter (and has not on previous
kernels).



Works OK for me with an updated-just-now customized (ie. just a lot of stuff
removed) GENERIC.


While I am here, an unrelatged matter, one other config option:

options WS_KERNEL_FG=WSCOL_CYAN

[...]
When the system first boots, all the console messages appear in yellow, not
the normal green, and not cyan.   After the system switches the console to
graphics mode (it is an nvidia GT930 - running X on wsfb) the messages all
switch to cyan.  This harms nothing, it's just a bit weird, and I thought
it might provide a clue to some possible setup errors?



Heh. It's not just Cyan/Yellow; Red and Blue are swapped too, because:

/usr/src/sys/dev/wscons/wsdisplayvar.h and
/usr/src/sys/dev/ic/pcdisplay.h have different values for those colours.

The VGA console fixes the mapping in vga_allocattr()/vga_raster_allocattr(),
but I don't see where the framebuffer consoles are doing that...

-RVP


Weird clock behaviour with current (amd64) kernel

2022-07-14 Thread Robert Elz
Hi,

I just booted a kernel that I built (from up to date at the time)
HEAD sources about 24 hours ago.

Everything seemed to be working fine - until I noticed that all of
my clocks (there are several, gkrellm, window manager, a dclock,
and an xtu) were all wildly wrong (as in, were moving time forwards
incredibly slowly).

You can see the results of that if you compare the Date header, and
Received header from my local system (jacaranda) with the Received
header munnari adds (there should be no more than a second or two
between those - in this message there will be much more).

When I noticed this, I changed the clock source from TSC to hpet0,
and since then, the system appears to be advancing time at about
the right rate - but unlike its normal smooth motion, the second hand
in xtu (the only one of the clocks that has seconds) looks very
jerky, and ...

jacaranda$ while sleep 1; do date; done
Thu Jul 14 20:43:49 +07 2022
Thu Jul 14 20:43:52 +07 2022
Thu Jul 14 20:43:55 +07 2022
Thu Jul 14 20:43:58 +07 2022
Thu Jul 14 20:44:01 +07 2022

that would be much like what it looks like.

This is a fairly normal kernel, the most notable features of its config
are:

options PCIVERBOSE  # verbose PCI device autoconfig messages
options PCI_CONFIG_DUMP # verbosely dump PCI config space
options SCSIVERBOSE # human readable SCSI error messages
options HDAUDIOVERBOSE  # human readable HDAUDIO device names

options ACPI_SCANPCI# find PCI roots using ACPI
options MPBIOS  # configure CPUs and APICs using MPBIOS
options MPBIOS_SCANPCI  # MPBIOS configures PCI roots
options PCI_INTR_FIXUP  # fixup PCI interrupt routing via ACPI
options PCI_BUS_FIXUP   # fixup PCI bus numbering
options PCI_ADDR_FIXUP  # fixup PCI I/O addresses
options ACPI_ACTIVATE_DEV   # If set, activate inactive devices
options VGA_POST# in-kernel support for VGA POST

options MSGBUFSIZE=1049600

(still not big enough to hold all of what PCI_CONFIG_DUMP produces, I do have
the dmesg.boot file from it if anyone cares).

Anyone have any ideas?   Note that the CPU is an Alder Lake - has both
performance and economy cores which run at different rates (which NetBSD
knows nothing about, yet, but that's OK) - core speed is always subject
to variation anyway, so that should not matter (and has not on previous
kernels).

My previous kernel did not have most of those options, and managed time well
enough (I have never really trusted TSC, but it was at least close enough
that NTP could keep it in line - this is beyond NTP's abilities).

While I am here, an unrelatged matter, one other config option:

options WS_KERNEL_FG=WSCOL_CYAN

No way is that related to the time ... I've used that in kernels for decades.
I mention it now, as it might just provide some assistance to those working
on the graphics drivers.

When the system first boots, all the console messages appear in yellow, not
the normal green, and not cyan.   After the system switches the console to
graphics mode (it is an nvidia GT930 - running X on wsfb) the messages all
switch to cyan.  This harms nothing, it's just a bit weird, and I thought
it might provide a clue to some possible setup errors?

kre