Re: crash on booting GENERIC.MP since upgrade to Jan 18 snapshot

2022-01-31 Thread Jonathan Gray
On Mon, Jan 31, 2022 at 12:14:32PM -0300, Martin Pieuchot wrote:
> On 31/01/22(Mon) 00:54, Thomas Frohwein wrote:
> > On Sat, 29 Jan 2022 12:15:10 -0300
> > Martin Pieuchot  wrote:
> > 
> > > On 28/01/22(Fri) 23:03, Thomas Frohwein wrote:
> > > > On Sat, 29 Jan 2022 15:19:20 +1100
> > > > Jonathan Gray  wrote:
> > > >   
> > > > > does this diff to revert uvm_fault.c rev 1.124 change anything?  
> > > > 
> > > > Unfortunately no. Same pmap error as in the original bug report occurs
> > > > with a kernel with this diff.  
> > > 
> > > Could you submit a new bug report?  Could you manage to include ps and the
> > > trace of all the CPUs when the pmap corruption occurs?
> > 
> > See below
> > 
> > > 
> > > Do you have some steps to reproduce the corruption?  Which program is
> > > currently running?  Is it multi-threaded?  What is the simplest scenario
> > > to trigger the corruption?
> > 
> > It's during boot of the MP kernel. The only scenario I can provide is
> > booting this machine with an MP kernel from January 18 or newer. If I
> > boot SP kernel, or build an MP kernel with jsg@'s diff that adds
> > `pool_debug = 2`, the panic does _not_ occur.
> 
> This indicates some race is present and not triggered if more context
> switches occur.
> 
> > Here some new (hand-typed from a picture) output when I boot a freshly
> > downloaded snapshot MP kernel from January 30th (note this is an 8 core/16
> > hyperthreads CPU; I have _not_ enabled hyperthreading). I attached dmesg 
> > from
> > booting bsd.sp, too.
> 
> Thanks, so most CPUs already reached the idle loop and are not yet running
> anything.
> 
> Nobody is running the KERNEL_LOCK(), the faulting process obviously
> isn't and I don't understand which one it is.
> 
> Note that the corruption occurred on CPU2.  We don't know where it
> occurred the previous time.  This is interesting to watch to understand
> between which CPUs the race is occurring. 
> 
> > ... (boot, see dmesg in original bugs@ submission)
> > wsdisplay0: screen 1-5 added (std, vt100 emulation)
> > iwm0: hw rev 0x200, fw ver 36.ca7b901d.0, address [...]
> > va 7f7fb000 ppa ff000
>  
> That the faulting address, right?  This is the same as in the first
> report.  It seems to be inside the level 1 page table range, is it?
> What does that mean?

it is exactly one page below amd64 VM_MAXUSER_ADDRESS 7f7fc000
rounding error or off by one somewhere?

> 
> I don't understand which process is triggering the fault.  Maybe
> somebody (jsg@?) could craft a diff to figure out if this same address
> fault and which thread/context is faulting it in SP and/or with
> pool_debug = 2.

I'm not sure what you mean here.  It doesn't trigger with pool_debug=2.

It would be interesting if not starting xenodm on boot could be
tried, that should avoid the drm mmap path on boot.

> 
> Something like:
> 
>   if (va == 0x7f7fb000)
>   db_enter();
> 
> > panic: pmap_get_ptp: unmanaged user PTP
> > Stopped at db_enter+0x10: popq   %rbp
> > TID PID UID PRFLAGS PFLAGS CPU COMMAND
> > * 28644   1   0   0  0   2K swapper
> > db_enter() at db_enter+0x10
> > panic(81f3dd1f) at panic+0xbf
> > pmap_get_ptp(fd888e52ee58,7f7fb000) at pmap_get_ptp+0x303
> > pmap_enter(fd888e52ee58,7f7fb000,13d151000,3,22) at pmap_enter+0x188
> > uvm_fault_lower(8000156852a0,8000156852d8,800015685220,0) at 
> > uvm_fault_lower+0x63d
> > uvm_fault(fd888e52fdd0,7f7fb000,0,2) at uvm_fault+0x1b3
> > kpageflttrap(800015685420,7f7fbff5) at kpageflttrap+0x12c
> > kerntrap(800015685420) at kerntrap+0x91
> > alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b
> > copyout() at copyout+0x53
> > end trace frame: 0x0, count: 5
> > https://www.openbsd.org/ [...]
> > ddb{2}> show panic
> > *cpu2: pmap_get_ptp: unmanaged user PTP
> > ddb{2}> mach ddbcpu 0
> > Stopped at  x86_ipi_db+0x12:leave
> > x86_ipi_db(822acff0) at x86_ipi_db+0x12
> > x86_ipi_handler() at x86_ipi_handler+0x80
> > Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23
> > acpicpu_idle() at acpicpu_idle+0x203
> > sched_idle(f822acff0) at sched_idle+0x280
> > end trace frame: 0x0, count: 10
> > ddb{0}> mach ddbcpu 1
> > Stopped at  x86_ipi_db+0x12:leave
> > x86_ipi_db(800015363ff0) at x86_ipi_db+0x12
> > x86_ipi_handler() at x86_ipi_handler+0x80
> > Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23
> > acpicpu_idle() at acpicpu_idle+0x203
> > sched_idle(800015363ff0) at sched_idle+0x280
> > end trace frame: 0x0, count: 10
> > ddb{1}> mach ddbcpu 3
> > Stopped at  x86_ipi_db+0x12:leave
> > x86_ipi_db(800015375ff0) at x86_ipi_db+0x12
> > x86_ipi_handler() at x86_ipi_handler+0x80
> > Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23
> > acpicpu_idle() at acpicpu_idle+0x203
> > sched_idle(800015375ff0) at sched_idle+0x280
> > end trace frame: 0x0, count: 10
> > ddb{3}> mach ddbcpu 4
> > Stopped at  

Re: Interrupts hover above 40% when idle on Dell Latitude E7450

2022-01-31 Thread Scott Cheloha
On Sun, Jan 30, 2022 at 10:41:34AM -0500, Ryan Kavanagh wrote:
> On Sun, Jan 30, 2022 at 12:39:02AM -0600, Scott Cheloha wrote:
> > > btrace -e 'profile:hz:100 { @[kstack] = count(); }' > /tmp/btrace.out
> > > 
> > > for ten seconds and ran the output through
> > > 
> > > https://github.com/brendangregg/FlameGraph/raw/master/stackcollapse-bpftrace.pl
> > > https://github.com/brendangregg/FlameGraph/raw/master/flamegraph.pl
> > > 
> > > The output of stackcollapse-bpftrace.pl and flamegraph.pl are attached
> > > as btrace.collapsed and btrace.svg.
> > 
> > The flamegraph suggests that you spent 10% of that time servicing
> > ichiic(4) interrupts from idle.
> > 
> > That could be a fluke though.
> 
> In case it was a fluke, I've regenerated the flamegraph on 7.0
> GENERIC.MP#293 amd64 using 10 seconds of output on an idle machine.
> Please see attached.
> 
> > What does the main systat view look like in the interrupt column?
> > 
> > $ systat 1
> 
> Again on #293:
> 
> Interrupts(range after idling for a few seconds)
>  247 total(235-260)
>  200 clock(200-200)
>   21 ipi  (16-23)
>1 acpi0(0-1)
>6 inteldrm (5-7)
>  azalia1  (0-0)
>   11 iwm0 (10-16)
>  ehci0(0-0)
>1 ahci0(0-1)
>1 ichiic0  (0-1)
>6 pckbc0   (0-0)
>  pckbc0   (0-0)

Based on these numbers and the similar-looking flamegraph I'd say
you're spending a relatively large amount of time handling ichiic(4)
interrupts.

I don't know anything about that device but my guess is that it is
slow if you're spending that much time in x86_bus_space_io_read_1()
and its _write_1() counterpart.

Someone else is going to have to weigh in on what might be the cause
and solution.

Thank you providing the traces.



Re: crash on booting GENERIC.MP since upgrade to Jan 18 snapshot

2022-01-31 Thread Martin Pieuchot
On 31/01/22(Mon) 19:18, Jonathan Gray wrote:
> On Mon, Jan 31, 2022 at 12:54:53AM -0700, Thomas Frohwein wrote:
> > On Sat, 29 Jan 2022 12:15:10 -0300
> > Martin Pieuchot  wrote:
> > 
> > > On 28/01/22(Fri) 23:03, Thomas Frohwein wrote:
> > > > On Sat, 29 Jan 2022 15:19:20 +1100
> > > > Jonathan Gray  wrote:
> > > >   
> > > > > does this diff to revert uvm_fault.c rev 1.124 change anything?  
> > > > 
> > > > Unfortunately no. Same pmap error as in the original bug report occurs
> > > > with a kernel with this diff.  
> > > 
> > > Could you submit a new bug report?  Could you manage to include ps and the
> > > trace of all the CPUs when the pmap corruption occurs?
> > 
> > See below
> > 
> > > 
> > > Do you have some steps to reproduce the corruption?  Which program is
> > > currently running?  Is it multi-threaded?  What is the simplest scenario
> > > to trigger the corruption?
> > 
> > It's during boot of the MP kernel. The only scenario I can provide is
> > booting this machine with an MP kernel from January 18 or newer. If I
> > boot SP kernel, or build an MP kernel with jsg@'s diff that adds
> > `pool_debug = 2`, the panic does _not_ occur.
> 
> That pool_debug change also avoids what Paul de Weerd sees on a
> Dell XPS 13 9305 with i7-1165G7 as does running SP
> 
> panic: pool_do_get: idrpl: page empty
> Stopped atdb_enter+0x10:  popq%rbp
> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
>
> *293226   4683  0 0x14000  0x2000K drmwq  
> 
How can this error happen?  Does that mean there's a corruption in the
pool?  Is some synchronisation incorrect or some lock missing?

David you know the pool subsystem better than us, do you have any
inside?  Thanks!

> db_enter() at db_enter+0x10
> panic(81f08e21) at panic+0xbf
> pool_do_get(823b3710,1,80001d9a11e4) at pool_do_get+0x2f6
> pool_get(823b3710,1) at pool_get+0x96
> idr_alloc(803cc2e0,80fba500,1,0,5) at idr_alloc+0x78
> __drm_mode_object_add(803cc078,80fba500,,1,8102dda0)
>  at __drm_mode_object_add+0xa6
> drm_property_create_blob(803cc078,80,8119ef80) at 
> drm_property_create_blob+0xa7
> drm_property_replace_global_blob(803cc078,80e9c950,80,8119ef80,80e9c828,8095a180)
>  at drm_property_replace_global_blob+0x84
> drm_connector_update_edid_property(80e9c800,8119ef80) at 
> drm_connector_update_edid_property+0x118
> intel_connector_update_modes(80e9c800,8119ef80) at 
> intel_connector_update_modes+0x15
> intel_dp_get_modes(80e9c800) at intel_dp_get_modes+0x33
> drm_helper_probe_single_connector_modes(80e9c800,f00,870) at 
> drm_helper_probe_single_connector_modes+0x353
> drm_client_modeset_probe(80edda00,f00,870) at 
> drm_client_modeset_probe+0x281
> drm_fb_helper_hotplug_event(80edda00) at 
> drm_fb_helper_hotplug_event+0xd3
> end trace frame: 0x80001d9a1800, count: 0
> 
> some tiger lake machines don't see either problem
> for example thinkpad x1 nano, framework laptop
> 
> > 
> > Here some new (hand-typed from a picture) output when I boot a freshly
> > downloaded snapshot MP kernel from January 30th (note this is an 8 core/16
> > hyperthreads CPU; I have _not_ enabled hyperthreading). I attached dmesg 
> > from
> > booting bsd.sp, too.
> > 
> > ... (boot, see dmesg in original bugs@ submission)
> > wsdisplay0: screen 1-5 added (std, vt100 emulation)
> > iwm0: hw rev 0x200, fw ver 36.ca7b901d.0, address [...]
> > va 7f7fb000 ppa ff000
> > panic: pmap_get_ptp: unmanaged user PTP
> > Stopped at db_enter+0x10: popq   %rbp
> > TID PID UID PRFLAGS PFLAGS CPU COMMAND
> > * 28644   1   0   0  0   2K swapper
> > db_enter() at db_enter+0x10
> > panic(81f3dd1f) at panic+0xbf
> > pmap_get_ptp(fd888e52ee58,7f7fb000) at pmap_get_ptp+0x303
> > pmap_enter(fd888e52ee58,7f7fb000,13d151000,3,22) at pmap_enter+0x188
> > uvm_fault_lower(8000156852a0,8000156852d8,800015685220,0) at 
> > uvm_fault_lower+0x63d
> > uvm_fault(fd888e52fdd0,7f7fb000,0,2) at uvm_fault+0x1b3
> > kpageflttrap(800015685420,7f7fbff5) at kpageflttrap+0x12c
> > kerntrap(800015685420) at kerntrap+0x91
> > alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b
> > copyout() at copyout+0x53
> > end trace frame: 0x0, count: 5
> > https://www.openbsd.org/ [...]
> > ddb{2}> show panic
> > *cpu2: pmap_get_ptp: unmanaged user PTP
> > ddb{2}> mach ddbcpu 0
> > Stopped at  x86_ipi_db+0x12:leave
> > x86_ipi_db(822acff0) at x86_ipi_db+0x12
> > x86_ipi_handler() at x86_ipi_handler+0x80
> > Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23
> > acpicpu_idle() at acpicpu_idle+0x203
> > sched_idle(f822acff0) at sched_idle+0x280
> > end trace frame: 0x0, count: 10
> > ddb{0}> mach ddbcpu 1
> > Stopped at  

Re: crash on booting GENERIC.MP since upgrade to Jan 18 snapshot

2022-01-31 Thread Martin Pieuchot
On 31/01/22(Mon) 00:54, Thomas Frohwein wrote:
> On Sat, 29 Jan 2022 12:15:10 -0300
> Martin Pieuchot  wrote:
> 
> > On 28/01/22(Fri) 23:03, Thomas Frohwein wrote:
> > > On Sat, 29 Jan 2022 15:19:20 +1100
> > > Jonathan Gray  wrote:
> > >   
> > > > does this diff to revert uvm_fault.c rev 1.124 change anything?  
> > > 
> > > Unfortunately no. Same pmap error as in the original bug report occurs
> > > with a kernel with this diff.  
> > 
> > Could you submit a new bug report?  Could you manage to include ps and the
> > trace of all the CPUs when the pmap corruption occurs?
> 
> See below
> 
> > 
> > Do you have some steps to reproduce the corruption?  Which program is
> > currently running?  Is it multi-threaded?  What is the simplest scenario
> > to trigger the corruption?
> 
> It's during boot of the MP kernel. The only scenario I can provide is
> booting this machine with an MP kernel from January 18 or newer. If I
> boot SP kernel, or build an MP kernel with jsg@'s diff that adds
> `pool_debug = 2`, the panic does _not_ occur.

This indicates some race is present and not triggered if more context
switches occur.

> Here some new (hand-typed from a picture) output when I boot a freshly
> downloaded snapshot MP kernel from January 30th (note this is an 8 core/16
> hyperthreads CPU; I have _not_ enabled hyperthreading). I attached dmesg from
> booting bsd.sp, too.

Thanks, so most CPUs already reached the idle loop and are not yet running
anything.

Nobody is running the KERNEL_LOCK(), the faulting process obviously
isn't and I don't understand which one it is.

Note that the corruption occurred on CPU2.  We don't know where it
occurred the previous time.  This is interesting to watch to understand
between which CPUs the race is occurring. 

> ... (boot, see dmesg in original bugs@ submission)
> wsdisplay0: screen 1-5 added (std, vt100 emulation)
> iwm0: hw rev 0x200, fw ver 36.ca7b901d.0, address [...]
> va 7f7fb000 ppa ff000
 
That the faulting address, right?  This is the same as in the first
report.  It seems to be inside the level 1 page table range, is it?
What does that mean?

I don't understand which process is triggering the fault.  Maybe
somebody (jsg@?) could craft a diff to figure out if this same address
fault and which thread/context is faulting it in SP and/or with
pool_debug = 2.

Something like:

if (va == 0x7f7fb000)
db_enter();

> panic: pmap_get_ptp: unmanaged user PTP
> Stopped at db_enter+0x10: popq   %rbp
> TID   PID UID PRFLAGS PFLAGS CPU COMMAND
> * 28644   1   0   0  0   2K swapper
> db_enter() at db_enter+0x10
> panic(81f3dd1f) at panic+0xbf
> pmap_get_ptp(fd888e52ee58,7f7fb000) at pmap_get_ptp+0x303
> pmap_enter(fd888e52ee58,7f7fb000,13d151000,3,22) at pmap_enter+0x188
> uvm_fault_lower(8000156852a0,8000156852d8,800015685220,0) at 
> uvm_fault_lower+0x63d
> uvm_fault(fd888e52fdd0,7f7fb000,0,2) at uvm_fault+0x1b3
> kpageflttrap(800015685420,7f7fbff5) at kpageflttrap+0x12c
> kerntrap(800015685420) at kerntrap+0x91
> alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b
> copyout() at copyout+0x53
> end trace frame: 0x0, count: 5
> https://www.openbsd.org/ [...]
> ddb{2}> show panic
> *cpu2: pmap_get_ptp: unmanaged user PTP
> ddb{2}> mach ddbcpu 0
> Stopped atx86_ipi_db+0x12:leave
> x86_ipi_db(822acff0) at x86_ipi_db+0x12
> x86_ipi_handler() at x86_ipi_handler+0x80
> Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23
> acpicpu_idle() at acpicpu_idle+0x203
> sched_idle(f822acff0) at sched_idle+0x280
> end trace frame: 0x0, count: 10
> ddb{0}> mach ddbcpu 1
> Stopped atx86_ipi_db+0x12:leave
> x86_ipi_db(800015363ff0) at x86_ipi_db+0x12
> x86_ipi_handler() at x86_ipi_handler+0x80
> Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23
> acpicpu_idle() at acpicpu_idle+0x203
> sched_idle(800015363ff0) at sched_idle+0x280
> end trace frame: 0x0, count: 10
> ddb{1}> mach ddbcpu 3
> Stopped atx86_ipi_db+0x12:leave
> x86_ipi_db(800015375ff0) at x86_ipi_db+0x12
> x86_ipi_handler() at x86_ipi_handler+0x80
> Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23
> acpicpu_idle() at acpicpu_idle+0x203
> sched_idle(800015375ff0) at sched_idle+0x280
> end trace frame: 0x0, count: 10
> ddb{3}> mach ddbcpu 4
> Stopped atx86_ipi_db+0x12:leave
> x86_ipi_db(80001537eff0) at x86_ipi_db+0x12
> x86_ipi_handler() at x86_ipi_handler+0x80
> Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23
> acpicpu_idle() at acpicpu_idle+0x203
> sched_idle(80001537eff0) at sched_idle+0x280
> end trace frame: 0x0, count: 10
> ddb{4}> mach ddbcpu 5
> Stopped atx86_ipi_db+0x12:leave
> x86_ipi_db(800015387ff0) at x86_ipi_db+0x12
> x86_ipi_handler() at x86_ipi_handler+0x80
> Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23
> acpicpu_idle() at acpicpu_idle+0x203
> sched_idle(800015387ff0) at 

Re: crash on booting GENERIC.MP since upgrade to Jan 18 snapshot

2022-01-31 Thread Jonathan Gray
On Mon, Jan 31, 2022 at 12:54:53AM -0700, Thomas Frohwein wrote:
> On Sat, 29 Jan 2022 12:15:10 -0300
> Martin Pieuchot  wrote:
> 
> > On 28/01/22(Fri) 23:03, Thomas Frohwein wrote:
> > > On Sat, 29 Jan 2022 15:19:20 +1100
> > > Jonathan Gray  wrote:
> > >   
> > > > does this diff to revert uvm_fault.c rev 1.124 change anything?  
> > > 
> > > Unfortunately no. Same pmap error as in the original bug report occurs
> > > with a kernel with this diff.  
> > 
> > Could you submit a new bug report?  Could you manage to include ps and the
> > trace of all the CPUs when the pmap corruption occurs?
> 
> See below
> 
> > 
> > Do you have some steps to reproduce the corruption?  Which program is
> > currently running?  Is it multi-threaded?  What is the simplest scenario
> > to trigger the corruption?
> 
> It's during boot of the MP kernel. The only scenario I can provide is
> booting this machine with an MP kernel from January 18 or newer. If I
> boot SP kernel, or build an MP kernel with jsg@'s diff that adds
> `pool_debug = 2`, the panic does _not_ occur.
> 
> Here some new (hand-typed from a picture) output when I boot a freshly
> downloaded snapshot MP kernel from January 30th (note this is an 8 core/16
> hyperthreads CPU; I have _not_ enabled hyperthreading). I attached dmesg from
> booting bsd.sp, too.
> 
> ... (boot, see dmesg in original bugs@ submission)
> wsdisplay0: screen 1-5 added (std, vt100 emulation)
> iwm0: hw rev 0x200, fw ver 36.ca7b901d.0, address [...]
> va 7f7fb000 ppa ff000
> panic: pmap_get_ptp: unmanaged user PTP
> Stopped at db_enter+0x10: popq   %rbp
> TID   PID UID PRFLAGS PFLAGS CPU COMMAND
> * 28644   1   0   0  0   2K swapper
> db_enter() at db_enter+0x10
> panic(81f3dd1f) at panic+0xbf
> pmap_get_ptp(fd888e52ee58,7f7fb000) at pmap_get_ptp+0x303
> pmap_enter(fd888e52ee58,7f7fb000,13d151000,3,22) at pmap_enter+0x188
> uvm_fault_lower(8000156852a0,8000156852d8,800015685220,0) at 
> uvm_fault_lower+0x63d
> uvm_fault(fd888e52fdd0,7f7fb000,0,2) at uvm_fault+0x1b3
> kpageflttrap(800015685420,7f7fbff5) at kpageflttrap+0x12c
> kerntrap(800015685420) at kerntrap+0x91
> alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b
> copyout() at copyout+0x53
> end trace frame: 0x0, count: 5

does this diff to provide stolen memory data help?

Index: sys/dev/pci/drm/i915/i915_drv.c
===
RCS file: /cvs/src/sys/dev/pci/drm/i915/i915_drv.c,v
retrieving revision 1.135
diff -u -p -r1.135 i915_drv.c
--- sys/dev/pci/drm/i915/i915_drv.c 19 Jan 2022 02:20:06 -  1.135
+++ sys/dev/pci/drm/i915/i915_drv.c 31 Jan 2022 11:20:04 -
@@ -2350,6 +2350,7 @@ inteldrm_match(struct device *parent, vo
 }
 
 int drm_gem_init(struct drm_device *);
+void intel_init_stolen_res(struct inteldrm_softc *);
 
 void
 inteldrm_attach(struct device *parent, struct device *self, void *aux)
@@ -2469,6 +2470,7 @@ inteldrm_attach(struct device *parent, s
return;
}
dev->pdev->irq = -1;
+   intel_init_stolen_res(dev_priv);
 
config_mountroot(self, inteldrm_attachhook);
 }
Index: sys/dev/pci/drm/i915/intel_stolen.c
===
RCS file: /cvs/src/sys/dev/pci/drm/i915/intel_stolen.c,v
retrieving revision 1.2
diff -u -p -r1.2 intel_stolen.c
--- sys/dev/pci/drm/i915/intel_stolen.c 14 Jan 2022 06:53:11 -  1.2
+++ sys/dev/pci/drm/i915/intel_stolen.c 31 Jan 2022 11:25:37 -
@@ -163,7 +163,7 @@ intel_init_stolen_res(struct inteldrm_so
 
if (GRAPHICS_VER(dev_priv) >= 3 && GRAPHICS_VER(dev_priv) < 11)
stolen_base  = gen3_stolen_base(dev_priv);
-   else if (GRAPHICS_VER(dev_priv) == 11)
+   else if (GRAPHICS_VER(dev_priv) == 11 || GRAPHICS_VER(dev_priv) == 12)
stolen_base = gen11_stolen_base(dev_priv);
 
if (IS_I830(dev_priv) || IS_I845G(dev_priv))
@@ -177,7 +177,7 @@ intel_init_stolen_res(struct inteldrm_so
stolen_size = gen6_stolen_size(dev_priv);
else if (GRAPHICS_VER(dev_priv) == 8)
stolen_size = gen8_stolen_size(dev_priv);
-   else if (GRAPHICS_VER(dev_priv) >= 9 && GRAPHICS_VER(dev_priv) < 12)
+   else if (GRAPHICS_VER(dev_priv) >= 9 && GRAPHICS_VER(dev_priv) <= 12)
stolen_size = gen9_stolen_size(dev_priv);
 
if (stolen_base == 0 || stolen_size == 0)
Index: sys/dev/pci/drm/i915/gt/intel_ggtt.c
===
RCS file: /cvs/src/sys/dev/pci/drm/i915/gt/intel_ggtt.c,v
retrieving revision 1.4
diff -u -p -r1.4 intel_ggtt.c
--- sys/dev/pci/drm/i915/gt/intel_ggtt.c26 Jan 2022 01:46:12 -  
1.4
+++ sys/dev/pci/drm/i915/gt/intel_ggtt.c31 Jan 2022 11:33:05 -
@@ -1320,10 +1320,10 @@ static int ggtt_probe_hw(struct i915_ggt
}
 
/* GMADR is the 

Re: crash on booting GENERIC.MP since upgrade to Jan 18 snapshot

2022-01-31 Thread Jonathan Gray
On Mon, Jan 31, 2022 at 12:54:53AM -0700, Thomas Frohwein wrote:
> On Sat, 29 Jan 2022 12:15:10 -0300
> Martin Pieuchot  wrote:
> 
> > On 28/01/22(Fri) 23:03, Thomas Frohwein wrote:
> > > On Sat, 29 Jan 2022 15:19:20 +1100
> > > Jonathan Gray  wrote:
> > >   
> > > > does this diff to revert uvm_fault.c rev 1.124 change anything?  
> > > 
> > > Unfortunately no. Same pmap error as in the original bug report occurs
> > > with a kernel with this diff.  
> > 
> > Could you submit a new bug report?  Could you manage to include ps and the
> > trace of all the CPUs when the pmap corruption occurs?
> 
> See below
> 
> > 
> > Do you have some steps to reproduce the corruption?  Which program is
> > currently running?  Is it multi-threaded?  What is the simplest scenario
> > to trigger the corruption?
> 
> It's during boot of the MP kernel. The only scenario I can provide is
> booting this machine with an MP kernel from January 18 or newer. If I
> boot SP kernel, or build an MP kernel with jsg@'s diff that adds
> `pool_debug = 2`, the panic does _not_ occur.

That pool_debug change also avoids what Paul de Weerd sees on a
Dell XPS 13 9305 with i7-1165G7 as does running SP

panic: pool_do_get: idrpl: page empty
Stopped at  db_enter+0x10:  popq%rbp
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND  
 
*293226   4683  0 0x14000  0x2000K drmwq
  
db_enter() at db_enter+0x10
panic(81f08e21) at panic+0xbf
pool_do_get(823b3710,1,80001d9a11e4) at pool_do_get+0x2f6
pool_get(823b3710,1) at pool_get+0x96
idr_alloc(803cc2e0,80fba500,1,0,5) at idr_alloc+0x78
__drm_mode_object_add(803cc078,80fba500,,1,8102dda0)
 at __drm_mode_object_add+0xa6
drm_property_create_blob(803cc078,80,8119ef80) at 
drm_property_create_blob+0xa7
drm_property_replace_global_blob(803cc078,80e9c950,80,8119ef80,80e9c828,8095a180)
 at drm_property_replace_global_blob+0x84
drm_connector_update_edid_property(80e9c800,8119ef80) at 
drm_connector_update_edid_property+0x118
intel_connector_update_modes(80e9c800,8119ef80) at 
intel_connector_update_modes+0x15
intel_dp_get_modes(80e9c800) at intel_dp_get_modes+0x33
drm_helper_probe_single_connector_modes(80e9c800,f00,870) at 
drm_helper_probe_single_connector_modes+0x353
drm_client_modeset_probe(80edda00,f00,870) at 
drm_client_modeset_probe+0x281
drm_fb_helper_hotplug_event(80edda00) at 
drm_fb_helper_hotplug_event+0xd3
end trace frame: 0x80001d9a1800, count: 0

some tiger lake machines don't see either problem
for example thinkpad x1 nano, framework laptop

> 
> Here some new (hand-typed from a picture) output when I boot a freshly
> downloaded snapshot MP kernel from January 30th (note this is an 8 core/16
> hyperthreads CPU; I have _not_ enabled hyperthreading). I attached dmesg from
> booting bsd.sp, too.
> 
> ... (boot, see dmesg in original bugs@ submission)
> wsdisplay0: screen 1-5 added (std, vt100 emulation)
> iwm0: hw rev 0x200, fw ver 36.ca7b901d.0, address [...]
> va 7f7fb000 ppa ff000
> panic: pmap_get_ptp: unmanaged user PTP
> Stopped at db_enter+0x10: popq   %rbp
> TID   PID UID PRFLAGS PFLAGS CPU COMMAND
> * 28644   1   0   0  0   2K swapper
> db_enter() at db_enter+0x10
> panic(81f3dd1f) at panic+0xbf
> pmap_get_ptp(fd888e52ee58,7f7fb000) at pmap_get_ptp+0x303
> pmap_enter(fd888e52ee58,7f7fb000,13d151000,3,22) at pmap_enter+0x188
> uvm_fault_lower(8000156852a0,8000156852d8,800015685220,0) at 
> uvm_fault_lower+0x63d
> uvm_fault(fd888e52fdd0,7f7fb000,0,2) at uvm_fault+0x1b3
> kpageflttrap(800015685420,7f7fbff5) at kpageflttrap+0x12c
> kerntrap(800015685420) at kerntrap+0x91
> alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b
> copyout() at copyout+0x53
> end trace frame: 0x0, count: 5
> https://www.openbsd.org/ [...]
> ddb{2}> show panic
> *cpu2: pmap_get_ptp: unmanaged user PTP
> ddb{2}> mach ddbcpu 0
> Stopped atx86_ipi_db+0x12:leave
> x86_ipi_db(822acff0) at x86_ipi_db+0x12
> x86_ipi_handler() at x86_ipi_handler+0x80
> Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23
> acpicpu_idle() at acpicpu_idle+0x203
> sched_idle(f822acff0) at sched_idle+0x280
> end trace frame: 0x0, count: 10
> ddb{0}> mach ddbcpu 1
> Stopped atx86_ipi_db+0x12:leave
> x86_ipi_db(800015363ff0) at x86_ipi_db+0x12
> x86_ipi_handler() at x86_ipi_handler+0x80
> Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23
> acpicpu_idle() at acpicpu_idle+0x203
> sched_idle(800015363ff0) at sched_idle+0x280
> end trace frame: 0x0, count: 10
> ddb{1}> mach ddbcpu 3
> Stopped atx86_ipi_db+0x12:leave
> x86_ipi_db(800015375ff0) at x86_ipi_db+0x12
> x86_ipi_handler() at x86_ipi_handler+0x80
>