Re: [Intel-gfx] Regression in linux-next

2023-10-13 Thread Borah, Chaitanya Kumar
Hello Rafael,

> -Original Message-
> From: Borah, Chaitanya Kumar
> Sent: Wednesday, October 11, 2023 10:19 PM
> To: Wysocki, Rafael J 
> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> ; Saarinen, Jani 
> Subject: RE: Regression in linux-next
> 
> Hello Rafael,
> 
> > -Original Message-
> > From: Wysocki, Rafael J 
> > Sent: Wednesday, October 11, 2023 9:44 PM
> > To: Borah, Chaitanya Kumar 
> > Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> > ; Saarinen, Jani
> > 
> > Subject: Re: Regression in linux-next
> >
> > Hi,
> >
> > On 10/11/2023 6:00 AM, Borah, Chaitanya Kumar wrote:
> > > Hello Rafael,
> > >
> > >> -Original Message-
> > >> From: Wysocki, Rafael J 
> > >> Sent: Tuesday, October 10, 2023 12:54 AM
> > >> To: Borah, Chaitanya Kumar 
> > >> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> > >> ; Saarinen, Jani
> > >> 
> > >> Subject: Re: Regression in linux-next
> > >>
> > >> Hi,
> > >>
> > >> On 10/9/2023 7:10 AM, Borah, Chaitanya Kumar wrote:
> > >>> Hello Rafael
> > >>>
> >  Thanks for the report, I think that this is a lockdep assertion 
> >  failing.
> >  If that is correct, it should be straightforward to fix.
> >  I'll take care of this early next week.
> >  Thanks!
> > >>> Thank you for your response.  Please let us know when a fix is 
> > >>> available.
> > >> It should be fixed in linux-next from today, by this commit:
> > >>
> > >> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-
> > >> pm.git/commit/?h=linux-
> > >> next&id=b4027ce7714f309e96b804b7fb088a40d708
> > >>
> > >> Thanks!
> > > Thanks a lot for the fix. This seems to have fixed the issue in most
> > > of the
> > machines but we are still seeing a similar problem in few of the machines.
> >
> > Thanks for reporting this!
> >
> >
> > > This has a different call stack but seems to be from the same
> > > thermal subsystem. Full logs in [1]
> > >
> > > <4>[4.392015] WARNING: CPU: 1 PID: 306 at
> > drivers/thermal/thermal_trip.c:178 thermal_zone_trip_id+0x61/0x70
> > > <4>[4.392022] Modules linked in: x86_pkg_temp_thermal coretemp
> > kvm_intel mei_pxp mei_hdcp wmi_bmof kvm e1000e irqbypass
> > crct10dif_pclmul video ptp crc32_pclmul ghash_clmulni_intel i2c_i801
> > mei_me pps_core mei i2c_smbus wmi
> > > <4>[4.392057] CPU: 1 PID: 306 Comm: thermald Not tainted 6.6.0-rc5-
> > next-20231010-next-20231010-gc0a6edb636cb+ #1
> > > <4>[4.392061] Hardware name: System manufacturer System Product
> > Name/Z170M-PLUS, BIOS 3610 03/29/2018
> > > <4>[4.392063] RIP: 0010:thermal_zone_trip_id+0x61/0x70
> > > <4>[4.392066] Code: 74 0c 83 c0 01 39 c8 75 f0 b8 c3 ff ff ff 5b 5d 
> > > c3 cc
> cc
> > cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 63 a4 2d 00 85 c0 75 b5
> > <0f> 0b eb b1
> > 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
> > > <4>[4.392069] RSP: 0018:c9000156bda8 EFLAGS: 00010246
> > > <4>[4.392073] RAX:  RBX: 888103828ae8 RCX:
> > 0001
> > > <4>[4.392075] RDX: 8000 RSI: 823de5ab RDI:
> > 823fdfba
> > > <4>[4.392078] RBP: 888103a88800 R08: 888103828ae8 R09:
> > 0001
> > > <4>[4.392080] R10: 0001 R11: 88811494d3c0 R12:
> > 888103a88818
> > > <4>[4.392082] R13: 8881108bfa00 R14: 888103794408 R15:
> > 0001
> > > <4>[4.392084] FS:  7f1f0d6d28c0()
> GS:88822e68()
> > knlGS:
> > > <4>[4.392087] CS:  0010 DS:  ES:  CR0: 80050033
> > > <4>[4.392089] CR2: 55857c50b750 CR3: 000111efa005
> CR4:
> > 003706f0
> > > <4>[4.392091] DR0:  DR1: 
> DR2:
> > 
> > > <4>[4.392093] DR3:  DR6: fffe0ff0 DR7:
> > 0400
> > > <4>[4.392095] Call Trace:
> > > <4>[4.392097]  
> > > <4>[4.392100]  ? __warn+0x7f/0x170
> > > <4>[4.392104]  ? thermal_zone_trip_id+0x61/0x70
> > > <4>[4.392109]  ? report_bug+0x1f8/0x200
> > > <4>[4.392116]  ? handle_bug+0x3c/0x70
> > > <4>[4.392119]  ? exc_invalid_op+0x18/0x70
> > > <4>[4.392123]  ? asm_exc_invalid_op+0x1a/0x20
> > > <4>[4.392133]  ? thermal_zone_trip_id+0x61/0x70
> > > <4>[4.392137]  ? thermal_zone_trip_id+0x5d/0x70
> > > <4>[4.392141]  trip_point_show+0x18/0x40
> > > <4>[4.392145]  dev_attr_show+0x15/0x60
> > > <4>[4.392149]  sysfs_kf_seq_show+0xb5/0x100
> > > <4>[4.392154]  seq_read_iter+0x111/0x450
> > > <4>[4.392158]  ? check_object+0x133/0x320
> > > <4>[4.392164]  vfs_read+0x20d/0x300
> > > <4>[4.392175]  ksys_read+0x64/0xe0
> > > <4>[4.392180]  do_syscall_64+0x3c/0x90
> > > <4>[4.392183]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > > <4>[4.392187] RIP: 0033:0x7f1f0e193392
> > >
> > > Can you please check what could be the reason for this issue?
> >
> > Well, one more unuseful lockdep assertion h

Re: [Intel-gfx] Regression in linux-next

2023-10-11 Thread Borah, Chaitanya Kumar
Hello Rafael,

> -Original Message-
> From: Wysocki, Rafael J 
> Sent: Wednesday, October 11, 2023 9:44 PM
> To: Borah, Chaitanya Kumar 
> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> ; Saarinen, Jani 
> Subject: Re: Regression in linux-next
> 
> Hi,
> 
> On 10/11/2023 6:00 AM, Borah, Chaitanya Kumar wrote:
> > Hello Rafael,
> >
> >> -Original Message-
> >> From: Wysocki, Rafael J 
> >> Sent: Tuesday, October 10, 2023 12:54 AM
> >> To: Borah, Chaitanya Kumar 
> >> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> >> ; Saarinen, Jani
> >> 
> >> Subject: Re: Regression in linux-next
> >>
> >> Hi,
> >>
> >> On 10/9/2023 7:10 AM, Borah, Chaitanya Kumar wrote:
> >>> Hello Rafael
> >>>
>  Thanks for the report, I think that this is a lockdep assertion failing.
>  If that is correct, it should be straightforward to fix.
>  I'll take care of this early next week.
>  Thanks!
> >>> Thank you for your response.  Please let us know when a fix is available.
> >> It should be fixed in linux-next from today, by this commit:
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-
> >> pm.git/commit/?h=linux-
> >> next&id=b4027ce7714f309e96b804b7fb088a40d708
> >>
> >> Thanks!
> > Thanks a lot for the fix. This seems to have fixed the issue in most of the
> machines but we are still seeing a similar problem in few of the machines.
> 
> Thanks for reporting this!
> 
> 
> > This has a different call stack but seems to be from the same thermal
> > subsystem. Full logs in [1]
> >
> > <4>[4.392015] WARNING: CPU: 1 PID: 306 at
> drivers/thermal/thermal_trip.c:178 thermal_zone_trip_id+0x61/0x70
> > <4>[4.392022] Modules linked in: x86_pkg_temp_thermal coretemp
> kvm_intel mei_pxp mei_hdcp wmi_bmof kvm e1000e irqbypass
> crct10dif_pclmul video ptp crc32_pclmul ghash_clmulni_intel i2c_i801
> mei_me pps_core mei i2c_smbus wmi
> > <4>[4.392057] CPU: 1 PID: 306 Comm: thermald Not tainted 6.6.0-rc5-
> next-20231010-next-20231010-gc0a6edb636cb+ #1
> > <4>[4.392061] Hardware name: System manufacturer System Product
> Name/Z170M-PLUS, BIOS 3610 03/29/2018
> > <4>[4.392063] RIP: 0010:thermal_zone_trip_id+0x61/0x70
> > <4>[4.392066] Code: 74 0c 83 c0 01 39 c8 75 f0 b8 c3 ff ff ff 5b 5d c3 
> > cc cc
> cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 63 a4 2d 00 85 c0 75 b5 <0f> 0b 
> eb b1
> 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
> > <4>[4.392069] RSP: 0018:c9000156bda8 EFLAGS: 00010246
> > <4>[4.392073] RAX:  RBX: 888103828ae8 RCX:
> 0001
> > <4>[4.392075] RDX: 8000 RSI: 823de5ab RDI:
> 823fdfba
> > <4>[4.392078] RBP: 888103a88800 R08: 888103828ae8 R09:
> 0001
> > <4>[4.392080] R10: 0001 R11: 88811494d3c0 R12:
> 888103a88818
> > <4>[4.392082] R13: 8881108bfa00 R14: 888103794408 R15:
> 0001
> > <4>[4.392084] FS:  7f1f0d6d28c0() GS:88822e68()
> knlGS:
> > <4>[4.392087] CS:  0010 DS:  ES:  CR0: 80050033
> > <4>[4.392089] CR2: 55857c50b750 CR3: 000111efa005 CR4:
> 003706f0
> > <4>[4.392091] DR0:  DR1:  DR2:
> 
> > <4>[4.392093] DR3:  DR6: fffe0ff0 DR7:
> 0400
> > <4>[4.392095] Call Trace:
> > <4>[4.392097]  
> > <4>[4.392100]  ? __warn+0x7f/0x170
> > <4>[4.392104]  ? thermal_zone_trip_id+0x61/0x70
> > <4>[4.392109]  ? report_bug+0x1f8/0x200
> > <4>[4.392116]  ? handle_bug+0x3c/0x70
> > <4>[4.392119]  ? exc_invalid_op+0x18/0x70
> > <4>[4.392123]  ? asm_exc_invalid_op+0x1a/0x20
> > <4>[4.392133]  ? thermal_zone_trip_id+0x61/0x70
> > <4>[4.392137]  ? thermal_zone_trip_id+0x5d/0x70
> > <4>[4.392141]  trip_point_show+0x18/0x40
> > <4>[4.392145]  dev_attr_show+0x15/0x60
> > <4>[4.392149]  sysfs_kf_seq_show+0xb5/0x100
> > <4>[4.392154]  seq_read_iter+0x111/0x450
> > <4>[4.392158]  ? check_object+0x133/0x320
> > <4>[4.392164]  vfs_read+0x20d/0x300
> > <4>[4.392175]  ksys_read+0x64/0xe0
> > <4>[4.392180]  do_syscall_64+0x3c/0x90
> > <4>[4.392183]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > <4>[4.392187] RIP: 0033:0x7f1f0e193392
> >
> > Can you please check what could be the reason for this issue?
> 
> Well, one more unuseful lockdep assertion has been added recently to the
> thermal core, sorry about that.
> 
> This commit
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-
> pm.git/commit/?h=linux-
> next&id=108ffd12be24ba1d74b3314df8db32a0a6d55ba5
> 
> that will be merged into linux-next tomorrow if all goes well, should address
> this.

Thank you for the fix. We will wait for it to get merged in linux-next.

Regards

Chaitanya

> 
> Thanks!
> 
> 
> > [1]
> > https://intel-gfx-ci.01.org/tree/linux-next/next-20231010/fi-kbl-guc/b
> > o

Re: [Intel-gfx] Regression in linux-next

2023-10-11 Thread Wysocki, Rafael J

Hi,

On 10/11/2023 6:00 AM, Borah, Chaitanya Kumar wrote:

Hello Rafael,


-Original Message-
From: Wysocki, Rafael J 
Sent: Tuesday, October 10, 2023 12:54 AM
To: Borah, Chaitanya Kumar 
Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
; Saarinen, Jani 
Subject: Re: Regression in linux-next

Hi,

On 10/9/2023 7:10 AM, Borah, Chaitanya Kumar wrote:

Hello Rafael


Thanks for the report, I think that this is a lockdep assertion failing.
If that is correct, it should be straightforward to fix.
I'll take care of this early next week.
Thanks!

Thank you for your response.  Please let us know when a fix is available.

It should be fixed in linux-next from today, by this commit:

https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-
pm.git/commit/?h=linux-
next&id=b4027ce7714f309e96b804b7fb088a40d708

Thanks!

Thanks a lot for the fix. This seems to have fixed the issue in most of the 
machines but we are still seeing a similar problem in few of the machines.


Thanks for reporting this!



This has a different call stack but seems to be from the same thermal 
subsystem. Full logs in [1]

<4>[4.392015] WARNING: CPU: 1 PID: 306 at 
drivers/thermal/thermal_trip.c:178 thermal_zone_trip_id+0x61/0x70
<4>[4.392022] Modules linked in: x86_pkg_temp_thermal coretemp kvm_intel 
mei_pxp mei_hdcp wmi_bmof kvm e1000e irqbypass crct10dif_pclmul video ptp 
crc32_pclmul ghash_clmulni_intel i2c_i801 mei_me pps_core mei i2c_smbus wmi
<4>[4.392057] CPU: 1 PID: 306 Comm: thermald Not tainted 
6.6.0-rc5-next-20231010-next-20231010-gc0a6edb636cb+ #1
<4>[4.392061] Hardware name: System manufacturer System Product 
Name/Z170M-PLUS, BIOS 3610 03/29/2018
<4>[4.392063] RIP: 0010:thermal_zone_trip_id+0x61/0x70
<4>[4.392066] Code: 74 0c 83 c0 01 39 c8 75 f0 b8 c3 ff ff ff 5b 5d c3 cc cc cc 
cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 63 a4 2d 00 85 c0 75 b5 <0f> 0b eb b1 66 2e 
0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
<4>[4.392069] RSP: 0018:c9000156bda8 EFLAGS: 00010246
<4>[4.392073] RAX:  RBX: 888103828ae8 RCX: 
0001
<4>[4.392075] RDX: 8000 RSI: 823de5ab RDI: 
823fdfba
<4>[4.392078] RBP: 888103a88800 R08: 888103828ae8 R09: 
0001
<4>[4.392080] R10: 0001 R11: 88811494d3c0 R12: 
888103a88818
<4>[4.392082] R13: 8881108bfa00 R14: 888103794408 R15: 
0001
<4>[4.392084] FS:  7f1f0d6d28c0() GS:88822e68() 
knlGS:
<4>[4.392087] CS:  0010 DS:  ES:  CR0: 80050033
<4>[4.392089] CR2: 55857c50b750 CR3: 000111efa005 CR4: 
003706f0
<4>[4.392091] DR0:  DR1:  DR2: 

<4>[4.392093] DR3:  DR6: fffe0ff0 DR7: 
0400
<4>[4.392095] Call Trace:
<4>[4.392097]  
<4>[4.392100]  ? __warn+0x7f/0x170
<4>[4.392104]  ? thermal_zone_trip_id+0x61/0x70
<4>[4.392109]  ? report_bug+0x1f8/0x200
<4>[4.392116]  ? handle_bug+0x3c/0x70
<4>[4.392119]  ? exc_invalid_op+0x18/0x70
<4>[4.392123]  ? asm_exc_invalid_op+0x1a/0x20
<4>[4.392133]  ? thermal_zone_trip_id+0x61/0x70
<4>[4.392137]  ? thermal_zone_trip_id+0x5d/0x70
<4>[4.392141]  trip_point_show+0x18/0x40
<4>[4.392145]  dev_attr_show+0x15/0x60
<4>[4.392149]  sysfs_kf_seq_show+0xb5/0x100
<4>[4.392154]  seq_read_iter+0x111/0x450
<4>[4.392158]  ? check_object+0x133/0x320
<4>[4.392164]  vfs_read+0x20d/0x300
<4>[4.392175]  ksys_read+0x64/0xe0
<4>[4.392180]  do_syscall_64+0x3c/0x90
<4>[4.392183]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
<4>[4.392187] RIP: 0033:0x7f1f0e193392

Can you please check what could be the reason for this issue?


Well, one more unuseful lockdep assertion has been added recently to the 
thermal core, sorry about that.


This commit

https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/commit/?h=linux-next&id=108ffd12be24ba1d74b3314df8db32a0a6d55ba5

that will be merged into linux-next tomorrow if all goes well, should 
address this.


Thanks!



[1] 
https://intel-gfx-ci.01.org/tree/linux-next/next-20231010/fi-kbl-guc/boot0.txt

Regards

Chaitanya







From: Wysocki, Rafael J 
Sent: Saturday, October 7, 2023 2:01 AM
To: Borah, Chaitanya Kumar 
Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
; Saarinen, Jani

Subject: Re: Regression in linux-next

Hi,
On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:
Hello Rafael,

Hope you are doing well. I am Chaitanya from the linux graphics team in

Intel.

This mail is regarding a regression we are seeing in our CI runs[1] on linux-

next repository.

Thanks for the report, I think that this is a lockdep assertion failing.
If that is correct, it should be straightforward to fix.
I'll take care of this early next week.
Thanks!

On next-20231003 [2], we are seeing the following error

``

Re: [Intel-gfx] Regression in linux-next

2023-10-10 Thread Borah, Chaitanya Kumar
Hello Rafael,

> -Original Message-
> From: Wysocki, Rafael J 
> Sent: Tuesday, October 10, 2023 12:54 AM
> To: Borah, Chaitanya Kumar 
> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> ; Saarinen, Jani 
> Subject: Re: Regression in linux-next
> 
> Hi,
> 
> On 10/9/2023 7:10 AM, Borah, Chaitanya Kumar wrote:
> > Hello Rafael
> >
> >> Thanks for the report, I think that this is a lockdep assertion failing.
> >> If that is correct, it should be straightforward to fix.
> >> I'll take care of this early next week.
> >> Thanks!
> > Thank you for your response.  Please let us know when a fix is available.
> 
> It should be fixed in linux-next from today, by this commit:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-
> pm.git/commit/?h=linux-
> next&id=b4027ce7714f309e96b804b7fb088a40d708
> 
> Thanks!

Thanks a lot for the fix. This seems to have fixed the issue in most of the 
machines but we are still seeing a similar problem in few of the machines.

This has a different call stack but seems to be from the same thermal 
subsystem. Full logs in [1]

<4>[4.392015] WARNING: CPU: 1 PID: 306 at 
drivers/thermal/thermal_trip.c:178 thermal_zone_trip_id+0x61/0x70
<4>[4.392022] Modules linked in: x86_pkg_temp_thermal coretemp kvm_intel 
mei_pxp mei_hdcp wmi_bmof kvm e1000e irqbypass crct10dif_pclmul video ptp 
crc32_pclmul ghash_clmulni_intel i2c_i801 mei_me pps_core mei i2c_smbus wmi
<4>[4.392057] CPU: 1 PID: 306 Comm: thermald Not tainted 
6.6.0-rc5-next-20231010-next-20231010-gc0a6edb636cb+ #1
<4>[4.392061] Hardware name: System manufacturer System Product 
Name/Z170M-PLUS, BIOS 3610 03/29/2018
<4>[4.392063] RIP: 0010:thermal_zone_trip_id+0x61/0x70
<4>[4.392066] Code: 74 0c 83 c0 01 39 c8 75 f0 b8 c3 ff ff ff 5b 5d c3 cc 
cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 63 a4 2d 00 85 c0 75 b5 <0f> 0b 
eb b1 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
<4>[4.392069] RSP: 0018:c9000156bda8 EFLAGS: 00010246
<4>[4.392073] RAX:  RBX: 888103828ae8 RCX: 
0001
<4>[4.392075] RDX: 8000 RSI: 823de5ab RDI: 
823fdfba
<4>[4.392078] RBP: 888103a88800 R08: 888103828ae8 R09: 
0001
<4>[4.392080] R10: 0001 R11: 88811494d3c0 R12: 
888103a88818
<4>[4.392082] R13: 8881108bfa00 R14: 888103794408 R15: 
0001
<4>[4.392084] FS:  7f1f0d6d28c0() GS:88822e68() 
knlGS:
<4>[4.392087] CS:  0010 DS:  ES:  CR0: 80050033
<4>[4.392089] CR2: 55857c50b750 CR3: 000111efa005 CR4: 
003706f0
<4>[4.392091] DR0:  DR1:  DR2: 

<4>[4.392093] DR3:  DR6: fffe0ff0 DR7: 
0400
<4>[4.392095] Call Trace:
<4>[4.392097]  
<4>[4.392100]  ? __warn+0x7f/0x170
<4>[4.392104]  ? thermal_zone_trip_id+0x61/0x70
<4>[4.392109]  ? report_bug+0x1f8/0x200
<4>[4.392116]  ? handle_bug+0x3c/0x70
<4>[4.392119]  ? exc_invalid_op+0x18/0x70
<4>[4.392123]  ? asm_exc_invalid_op+0x1a/0x20
<4>[4.392133]  ? thermal_zone_trip_id+0x61/0x70
<4>[4.392137]  ? thermal_zone_trip_id+0x5d/0x70
<4>[4.392141]  trip_point_show+0x18/0x40
<4>[4.392145]  dev_attr_show+0x15/0x60
<4>[4.392149]  sysfs_kf_seq_show+0xb5/0x100
<4>[4.392154]  seq_read_iter+0x111/0x450
<4>[4.392158]  ? check_object+0x133/0x320
<4>[4.392164]  vfs_read+0x20d/0x300
<4>[4.392175]  ksys_read+0x64/0xe0
<4>[4.392180]  do_syscall_64+0x3c/0x90
<4>[4.392183]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
<4>[4.392187] RIP: 0033:0x7f1f0e193392

Can you please check what could be the reason for this issue?

[1] 
https://intel-gfx-ci.01.org/tree/linux-next/next-20231010/fi-kbl-guc/boot0.txt

Regards

Chaitanya




> 
> 
> > From: Wysocki, Rafael J 
> > Sent: Saturday, October 7, 2023 2:01 AM
> > To: Borah, Chaitanya Kumar 
> > Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> > ; Saarinen, Jani
> > 
> > Subject: Re: Regression in linux-next
> >
> > Hi,
> > On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:
> > Hello Rafael,
> >
> > Hope you are doing well. I am Chaitanya from the linux graphics team in
> Intel.
> > This mail is regarding a regression we are seeing in our CI runs[1] on 
> > linux-
> next repository.
> >
> > Thanks for the report, I think that this is a lockdep assertion failing.
> > If that is correct, it should be straightforward to fix.
> > I'll take care of this early next week.
> > Thanks!
> >
> > On next-20231003 [2], we are seeing the following error
> >
> > ``
> > ` <4>[   14.093075] [ cut here ] <4>[
> > 14.097664] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_trip.c:18
> > for_each_thermal_trip+0x83/0x90 <4>[   14.106977] Modules linked in:
> > <4>[ 

Re: [Intel-gfx] Regression in linux-next

2023-10-09 Thread Wysocki, Rafael J

Hi,

On 10/9/2023 7:10 AM, Borah, Chaitanya Kumar wrote:

Hello Rafael


Thanks for the report, I think that this is a lockdep assertion failing.
If that is correct, it should be straightforward to fix.
I'll take care of this early next week.
Thanks!

Thank you for your response.  Please let us know when a fix is available.


It should be fixed in linux-next from today, by this commit:

https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/commit/?h=linux-next&id=b4027ce7714f309e96b804b7fb088a40d708

Thanks!



From: Wysocki, Rafael J 
Sent: Saturday, October 7, 2023 2:01 AM
To: Borah, Chaitanya Kumar 
Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar 
; Saarinen, Jani 
Subject: Re: Regression in linux-next

Hi,
On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:
Hello Rafael,
  
Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.

This mail is regarding a regression we are seeing in our CI runs[1] on 
linux-next repository.
  
Thanks for the report, I think that this is a lockdep assertion failing.

If that is correct, it should be straightforward to fix.
I'll take care of this early next week.
Thanks!

On next-20231003 [2], we are seeing the following error
  
```

<4>[   14.093075] [ cut here ]
<4>[   14.097664] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_trip.c:18 
for_each_thermal_trip+0x83/0x90
<4>[   14.106977] Modules linked in:
<4>[   14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G    W      
6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1
<4>[   14.121305] Hardware name: Intel Corporation Meteor Lake Client 
Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.3323.D89.2309110529 09/11/2023
<4>[   14.134478] RIP: 0010:for_each_thermal_trip+0x83/0x90
<4>[   14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c 41 5d c3 cc cc cc 
cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2 2d 00 85 c0 75 9a <0f> 0b eb 96 66 0f 
1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
  
Details log can be found in [3].
  
After bisecting the tree, the following patch [4] seems to be causing the regression.
  
commit d5ea889246b112e228433a5f27f57af90ca0c1fb

Author: Rafael J. Wysocki mailto:rafael.j.wyso...@intel.com
Date:   Thu Sep 21 20:02:59 2023 +0200
  
     ACPI: thermal: Do not use trip indices for cooling device binding
  
     Rearrange the ACPI thermal driver's callback functions used for cooling

     device binding and unbinding, acpi_thermal_bind_cooling_device() and
     acpi_thermal_unbind_cooling_device(), respectively, so that they use trip
     pointers instead of trip indices which is more straightforward and allows
     the driver to become independent of the ordering of trips in the thermal
     zone structure.
  
     The general functionality is not expected to be changed.
  
     Signed-off-by: Rafael J. Wysocki mailto:rafael.j.wyso...@intel.com

     Reviewed-by: Daniel Lezcano mailto:daniel.lezc...@linaro.org
  
We also verified by moving the head of the tree to the previous commit.
  
Could you please check why this patch causes the regression and if we can find a solution for it soon?
  
[1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?

[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003
[3] 
https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp-6/boot0.txt
[4] 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb


Re: [Intel-gfx] Regression in linux-next

2023-10-08 Thread Borah, Chaitanya Kumar
Hello Rafael

>Thanks for the report, I think that this is a lockdep assertion failing.
>If that is correct, it should be straightforward to fix.
>I'll take care of this early next week.
>Thanks!

Thank you for your response.  Please let us know when a fix is available.

Regards

Chaitanya

From: Wysocki, Rafael J  
Sent: Saturday, October 7, 2023 2:01 AM
To: Borah, Chaitanya Kumar 
Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar 
; Saarinen, Jani 
Subject: Re: Regression in linux-next

Hi,
On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:
Hello Rafael,
 
Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
This mail is regarding a regression we are seeing in our CI runs[1] on 
linux-next repository.
 
Thanks for the report, I think that this is a lockdep assertion failing.
If that is correct, it should be straightforward to fix.
I'll take care of this early next week.
Thanks!

On next-20231003 [2], we are seeing the following error
 
```
<4>[   14.093075] [ cut here ]
<4>[   14.097664] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_trip.c:18 
for_each_thermal_trip+0x83/0x90
<4>[   14.106977] Modules linked in:
<4>[   14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G    W      
6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1
<4>[   14.121305] Hardware name: Intel Corporation Meteor Lake Client 
Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.3323.D89.2309110529 
09/11/2023
<4>[   14.134478] RIP: 0010:for_each_thermal_trip+0x83/0x90
<4>[   14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c 41 5d c3 cc 
cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2 2d 00 85 c0 75 9a <0f> 0b 
eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
 
Details log can be found in [3].
 
After bisecting the tree, the following patch [4] seems to be causing the 
regression.
 
commit d5ea889246b112e228433a5f27f57af90ca0c1fb
Author: Rafael J. Wysocki mailto:rafael.j.wyso...@intel.com
Date:   Thu Sep 21 20:02:59 2023 +0200
 
    ACPI: thermal: Do not use trip indices for cooling device binding
 
    Rearrange the ACPI thermal driver's callback functions used for cooling
    device binding and unbinding, acpi_thermal_bind_cooling_device() and
    acpi_thermal_unbind_cooling_device(), respectively, so that they use trip
    pointers instead of trip indices which is more straightforward and allows
    the driver to become independent of the ordering of trips in the thermal
    zone structure.
 
    The general functionality is not expected to be changed.
 
    Signed-off-by: Rafael J. Wysocki mailto:rafael.j.wyso...@intel.com
    Reviewed-by: Daniel Lezcano mailto:daniel.lezc...@linaro.org
 
We also verified by moving the head of the tree to the previous commit.
 
Could you please check why this patch causes the regression and if we can find 
a solution for it soon?
 
[1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003
[3] 
https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp-6/boot0.txt
[4] 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb


Re: [Intel-gfx] Regression in linux-next

2023-10-06 Thread Wysocki, Rafael J

Hi,

On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:


Hello Rafael,

Hope you are doing well. I am Chaitanya from the linux graphics team 
in Intel.


This mail is regarding a regression we are seeing in our CI runs[1] on 
linux-next repository.



Thanks for the report, I think that this is a lockdep assertion failing.

If that is correct, it should be straightforward to fix.

I'll take care of this early next week.

Thanks!



On next-20231003 [2], we are seeing the following error

```

<4>[ 14.093075] [ cut here ]

<4>[ 14.097664] WARNING: CPU: 0 PID: 1 at 
drivers/thermal/thermal_trip.c:18 for_each_thermal_trip+0x83/0x90


<4>[ 14.106977] Modules linked in:

<4>[ 14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 
  6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1


<4>[ 14.121305] Hardware name: Intel Corporation Meteor Lake Client 
Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS 
MTLPFWI1.R00.3323.D89.2309110529 09/11/2023


<4>[ 14.134478] RIP: 0010:for_each_thermal_trip+0x83/0x90

<4>[ 14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c 41 5d 
c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2 2d 00 85 
c0 75 9a <0f> 0b eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 
90 90


Details log can be found in [3].

After bisecting the tree, the following patch [4] seems to be causing 
the regression.


commit d5ea889246b112e228433a5f27f57af90ca0c1fb

Author: Rafael J. Wysocki rafael.j.wyso...@intel.com

Date:   Thu Sep 21 20:02:59 2023 +0200

    ACPI: thermal: Do not use trip indices for cooling device binding

    Rearrange the ACPI thermal driver's callback functions used for 
cooling


    device binding and unbinding, acpi_thermal_bind_cooling_device() and

    acpi_thermal_unbind_cooling_device(), respectively, so that they 
use trip


    pointers instead of trip indices which is more straightforward and 
allows


    the driver to become independent of the ordering of trips in the 
thermal


    zone structure.

    The general functionality is not expected to be changed.

    Signed-off-by: Rafael J. Wysocki rafael.j.wyso...@intel.com

    Reviewed-by: Daniel Lezcano daniel.lezc...@linaro.org

We also verified by moving the head of the tree to the previous commit.

Could you please check why this patch causes the regression and if we 
can find a solution for it soon?


[1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?

[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003


[3] 
https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp-6/boot0.txt


[4] 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb 



[Intel-gfx] Regression in linux-next

2023-10-05 Thread Borah, Chaitanya Kumar
Hello Rafael,


Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.

This mail is regarding a regression we are seeing in our CI runs[1] on 
linux-next repository.



On next-20231003 [2], we are seeing the following error


```
<4>[   14.093075] [ cut here ]
<4>[   14.097664] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_trip.c:18 
for_each_thermal_trip+0x83/0x90
<4>[   14.106977] Modules linked in:
<4>[   14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: GW  
6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1
<4>[   14.121305] Hardware name: Intel Corporation Meteor Lake Client 
Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.3323.D89.2309110529 
09/11/2023
<4>[   14.134478] RIP: 0010:for_each_thermal_trip+0x83/0x90
<4>[   14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c 41 5d c3 cc 
cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2 2d 00 85 c0 75 9a <0f> 0b 
eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90



Details log can be found in [3].



After bisecting the tree, the following patch [4] seems to be causing the 
regression.


commit d5ea889246b112e228433a5f27f57af90ca0c1fb
Author: Rafael J. Wysocki 
rafael.j.wyso...@intel.com
Date:   Thu Sep 21 20:02:59 2023 +0200

ACPI: thermal: Do not use trip indices for cooling device binding

Rearrange the ACPI thermal driver's callback functions used for cooling
device binding and unbinding, acpi_thermal_bind_cooling_device() and
acpi_thermal_unbind_cooling_device(), respectively, so that they use trip
pointers instead of trip indices which is more straightforward and allows
the driver to become independent of the ordering of trips in the thermal
zone structure.

The general functionality is not expected to be changed.

Signed-off-by: Rafael J. Wysocki 
rafael.j.wyso...@intel.com
Reviewed-by: Daniel Lezcano 
daniel.lezc...@linaro.org



We also verified by moving the head of the tree to the previous commit.



Could you please check why this patch causes the regression and if we can find 
a solution for it soon?


[1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003
[3] 
https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp-6/boot0.txt
[4] 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb


Re: [Intel-gfx] Regression in linux-next

2023-07-26 Thread Alistair Popple


Thanks Chaitanya for the detailed report. Dan Carpenter also reported a
Smatch warning for this:

https://lore.kernel.org/linux-mm/38ed0627-1283-4da2-827a-e90484d7bd7d@moroto.mountain/

The below should fix the problem, will respin the series to include the
fix.

---

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 63c8eb740af7..ec3b068cbbe6 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -621,9 +621,10 @@ int __mmu_notifier_register(struct mmu_notifier 
*subscription,
 * Subsystems should only register for invalidate_secondary_tlbs() or
 * invalidate_range_start()/end() callbacks, not both.
 */
-   if (WARN_ON_ONCE(subscription->ops->arch_invalidate_secondary_tlbs &&
-   (subscription->ops->invalidate_range_start ||
-   subscription->ops->invalidate_range_end)))
+   if (WARN_ON_ONCE(subscription &&
+(subscription->ops->arch_invalidate_secondary_tlbs &&
+(subscription->ops->invalidate_range_start ||
+ subscription->ops->invalidate_range_end
return -EINVAL;
 
if (!mm->notifier_subscriptions) {


"Borah, Chaitanya Kumar"  writes:

> Hello Alistair,
>
> Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
>  
> This mail is regarding a regression we are seeing in our CI runs[1] on 
> linux-next
> repository.
>  
> On next-20230720 [2], we are seeing the following error
>
> <4>[   76.189375] Hardware name: Intel Corporation Meteor Lake Client 
> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.3271.D81.2307101805 
> 07/10/2023
> <4>[   76.202534] RIP: 0010:__mmu_notifier_register+0x40/0x210
> <4>[ 76.207804] Code: 1a 71 5a 01 85 c0 0f 85 ec 00 00 00 48 8b 85 30
> 01 00 00 48 85 c0 0f 84 04 01 00 00 8b 85 cc 00 00 00 85 c0 0f 8e bb
> 01 00 00 <49> 8b 44 24 10 48 83 78 38 00 74 1a 48 83 78 28 00 74 0c 0f
> 0b b8
> <4>[   76.226368] RSP: 0018:c900019d7ca8 EFLAGS: 00010202
> <4>[   76.231549] RAX: 0001 RBX: 1000 RCX: 
> 0001
> <4>[   76.238613] RDX:  RSI: 823ceb7b RDI: 
> 823ee12d
> <4>[   76.245680] RBP: 888102ec9b40 R08:  R09: 
> 0001
> <4>[   76.252747] R10: 0001 R11: 8881157cd2c0 R12: 
> 
> <4>[   76.259811] R13: 888102ec9c70 R14: a07de500 R15: 
> 888102ec9ce0
> <4>[   76.266875] FS:  7fbcabe11c00() GS:88846ec0() 
> knlGS:
> <4>[   76.274884] CS:  0010 DS:  ES:  CR0: 80050033
> <4>[   76.280578] CR2: 0010 CR3: 00010d4c2005 CR4: 
> 00f70ee0
> <4>[   76.287643] DR0:  DR1:  DR2: 
> 
> <4>[   76.294711] DR3:  DR6: 07f0 DR7: 
> 0400
> <4>[   76.301775] PKRU: 5554
> <4>[   76.304463] Call Trace:
> <4>[   76.306893]  
> <4>[   76.308983]  ? __die_body+0x1a/0x60
> <4>[   76.312444]  ? page_fault_oops+0x156/0x450
> <4>[   76.316510]  ? do_user_addr_fault+0x65/0x980
> <4>[   76.320747]  ? exc_page_fault+0x68/0x1a0
> <4>[   76.324643]  ? asm_exc_page_fault+0x26/0x30
> <4>[   76.328796]  ? __mmu_notifier_register+0x40/0x210
> <4>[   76.333460]  ? __mmu_notifier_register+0x11c/0x210
> <4>[   76.338206]  ? preempt_count_add+0x4c/0xa0
> <4>[   76.342273]  mmu_notifier_register+0x30/0xe0
> <4>[   76.346509]  mmu_interval_notifier_insert+0x74/0xb0
> <4>[   76.351344]  i915_gem_userptr_ioctl+0x21a/0x320 [i915]
> <4>[   76.356565]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
> <4>[   76.362271]  drm_ioctl_kernel+0xb4/0x150
> <4>[   76.366159]  drm_ioctl+0x21d/0x420
> <4>[   76.369537]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
> <4>[   76.375242]  ? find_held_lock+0x2b/0x80
> <4>[   76.379046]  __x64_sys_ioctl+0x79/0xb0
> <4>[   76.382766]  do_syscall_64+0x3c/0x90
> <4>[   76.386312]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> <4>[   76.391317] RIP: 0033:0x7fbcae63f3ab
>
> Details log can be found in [3].
>
> After bisecting the tree, the following patch seems to be causing the
> regression.
>
> commit 828fe4085cae77acb3abf7dd3d25b3ed6c560edf
> Author: Alistair Popple apop...@nvidia.com
> Date:   Wed Jul 19 22:18:46 2023 +1000
>
> mmu_notifiers: rename invalidate_range notifier
>
> There are two main use cases for mmu notifiers.  One is by KVM which uses
> mmu_notifier_invalidate_range_start()/end() to manage a software TLB.
>
> The other is to manage hardware TLBs which need to use the
> invalidate_range() callback because HW can establish new TLB entries at
> any time.  Hence using start/end() can lead to memory corruption as these
> callbacks happen too soon/late during page unmap.
>
> mmu notifier users should therefore either use the start()/end() callbacks
> or the invalidate_range() callbacks.  To make this usage clearer rename

Re: [Intel-gfx] Regression in linux-next

2023-07-25 Thread Borah, Chaitanya Kumar
Hello Tvrtko,

Your analysis is correct. Alistair has sent a new patch set with a fix.

Thank you.

Regards

Chaitanya

> -Original Message-
> From: Tvrtko Ursulin 
> Sent: Tuesday, July 25, 2023 4:24 PM
> To: Borah, Chaitanya Kumar ;
> apop...@nvidia.com
> Cc: Nikula, Jani ; intel-gfx@lists.freedesktop.org; 
> linux-
> ker...@vger.kernel.org; linux...@kvack.org; Kurmi, Suresh Kumar
> ; Yedireswarapu, SaiX Nandan
> 
> Subject: Re: [Intel-gfx] Regression in linux-next
> 
> 
> On 25/07/2023 07:42, Borah, Chaitanya Kumar wrote:
> > Hello Alistair,
> >
> > Hope you are doing well. I am Chaitanya from the linux graphics team in
> Intel.
> >
> > This mail is regarding a regression we are seeing in our CI runs[1] on
> > linux-next repository.
> >
> > On next-20230720 [2], we are seeing the following error
> >
> > <4>[   76.189375] Hardware name: Intel Corporation Meteor Lake Client
> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS
> MTLPFWI1.R00.3271.D81.2307101805 07/10/2023
> > <4>[   76.202534] RIP: 0010:__mmu_notifier_register+0x40/0x210
> > <4>[   76.207804] Code: 1a 71 5a 01 85 c0 0f 85 ec 00 00 00 48 8b 85 30 01 
> > 00
> 00 48 85 c0 0f 84 04 01 00 00 8b 85 cc 00 00 00 85 c0 0f 8e bb 01 00 00 <49> 
> 8b
> 44 24 10 48 83 78 38 00 74 1a 48 83 78 28 00 74 0c 0f 0b b8
> > <4>[   76.226368] RSP: 0018:c900019d7ca8 EFLAGS: 00010202
> > <4>[   76.231549] RAX: 0001 RBX: 1000 RCX:
> 0001
> > <4>[   76.238613] RDX:  RSI: 823ceb7b RDI:
> 823ee12d
> > <4>[   76.245680] RBP: 888102ec9b40 R08:  R09:
> 0001
> > <4>[   76.252747] R10: 0001 R11: 8881157cd2c0 R12:
> 
> > <4>[   76.259811] R13: 888102ec9c70 R14: a07de500 R15:
> 888102ec9ce0
> > <4>[   76.266875] FS:  7fbcabe11c00() GS:88846ec0()
> knlGS:
> > <4>[   76.274884] CS:  0010 DS:  ES:  CR0: 80050033
> > <4>[   76.280578] CR2: 0010 CR3: 00010d4c2005 CR4:
> 00f70ee0
> > <4>[   76.287643] DR0:  DR1:  DR2:
> 
> > <4>[   76.294711] DR3:  DR6: 07f0 DR7:
> 0400
> > <4>[   76.301775] PKRU: 5554
> > <4>[   76.304463] Call Trace:
> > <4>[   76.306893]  
> > <4>[   76.308983]  ? __die_body+0x1a/0x60
> > <4>[   76.312444]  ? page_fault_oops+0x156/0x450
> > <4>[   76.316510]  ? do_user_addr_fault+0x65/0x980
> > <4>[   76.320747]  ? exc_page_fault+0x68/0x1a0
> > <4>[   76.324643]  ? asm_exc_page_fault+0x26/0x30
> > <4>[   76.328796]  ? __mmu_notifier_register+0x40/0x210
> > <4>[   76.333460]  ? __mmu_notifier_register+0x11c/0x210
> > <4>[   76.338206]  ? preempt_count_add+0x4c/0xa0
> > <4>[   76.342273]  mmu_notifier_register+0x30/0xe0
> > <4>[   76.346509]  mmu_interval_notifier_insert+0x74/0xb0
> > <4>[   76.351344]  i915_gem_userptr_ioctl+0x21a/0x320 [i915]
> > <4>[   76.356565]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
> > <4>[   76.362271]  drm_ioctl_kernel+0xb4/0x150
> > <4>[   76.366159]  drm_ioctl+0x21d/0x420
> > <4>[   76.369537]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
> > <4>[   76.375242]  ? find_held_lock+0x2b/0x80
> > <4>[   76.379046]  __x64_sys_ioctl+0x79/0xb0
> > <4>[   76.382766]  do_syscall_64+0x3c/0x90
> > <4>[   76.386312]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > <4>[   76.391317] RIP: 0033:0x7fbcae63f3ab
> >
> > Details log can be found in [3].
> >
> > After bisecting the tree, the following patch seems to be causing the
> > regression.
> >
> > commit 828fe4085cae77acb3abf7dd3d25b3ed6c560edf
> > Author: Alistair Popple apop...@nvidia.com
> > Date:   Wed Jul 19 22:18:46 2023 +1000
> >
> >  mmu_notifiers: rename invalidate_range notifier
> >
> >  There are two main use cases for mmu notifiers.  One is by KVM which
> uses
> >  mmu_notifier_invalidate_range_start()/end() to manage a software TLB.
> >
> >  The other is to manage hardware TLBs which need to use the
> >  invalidate_range() callback because HW can establish new TLB entries at
> >  any time.  Hence using start/end() can lead to memory corruption as
> these
> >  callbacks happen too soon/late during page unmap.
> >
> &

Re: [Intel-gfx] Regression in linux-next

2023-07-25 Thread Borah, Chaitanya Kumar
Hello Alistair,

Thank you for the quick fix.

Regards

Chaitanya

> -Original Message-
> From: Alistair Popple 
> Sent: Tuesday, July 25, 2023 6:45 PM
> To: Borah, Chaitanya Kumar 
> Cc: Yedireswarapu, SaiX Nandan ;
> Saarinen, Jani ; Kurmi, Suresh Kumar
> ; Nikula, Jani ; intel-
> g...@lists.freedesktop.org; linux-ker...@vger.kernel.org; linux-
> m...@kvack.org; dan.carpen...@linaro.org
> Subject: Re: Regression in linux-next
> 
> 
> Thanks Chaitanya for the detailed report. Dan Carpenter also reported a
> Smatch warning for this:
> 
> https://lore.kernel.org/linux-mm/38ed0627-1283-4da2-827a-
> e90484d7bd7d@moroto.mountain/
> 
> The below should fix the problem, will respin the series to include the fix.
> 
> ---
> 
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index
> 63c8eb740af7..ec3b068cbbe6 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -621,9 +621,10 @@ int __mmu_notifier_register(struct mmu_notifier
> *subscription,
>* Subsystems should only register for invalidate_secondary_tlbs() or
>* invalidate_range_start()/end() callbacks, not both.
>*/
> - if (WARN_ON_ONCE(subscription->ops-
> >arch_invalidate_secondary_tlbs &&
> - (subscription->ops->invalidate_range_start ||
> - subscription->ops->invalidate_range_end)))
> + if (WARN_ON_ONCE(subscription &&
> +  (subscription->ops->arch_invalidate_secondary_tlbs
> &&
> +  (subscription->ops->invalidate_range_start ||
> +   subscription->ops->invalidate_range_end
>   return -EINVAL;
> 
>   if (!mm->notifier_subscriptions) {
> 
> 
> "Borah, Chaitanya Kumar"  writes:
> 
> > Hello Alistair,
> >
> > Hope you are doing well. I am Chaitanya from the linux graphics team in
> Intel.
> >
> > This mail is regarding a regression we are seeing in our CI runs[1] on
> > linux-next repository.
> >
> > On next-20230720 [2], we are seeing the following error
> >
> > <4>[   76.189375] Hardware name: Intel Corporation Meteor Lake Client
> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS
> MTLPFWI1.R00.3271.D81.2307101805 07/10/2023
> > <4>[   76.202534] RIP: 0010:__mmu_notifier_register+0x40/0x210
> > <4>[ 76.207804] Code: 1a 71 5a 01 85 c0 0f 85 ec 00 00 00 48 8b 85 30
> > 01 00 00 48 85 c0 0f 84 04 01 00 00 8b 85 cc 00 00 00 85 c0 0f 8e bb
> > 01 00 00 <49> 8b 44 24 10 48 83 78 38 00 74 1a 48 83 78 28 00 74 0c 0f
> > 0b b8
> > <4>[   76.226368] RSP: 0018:c900019d7ca8 EFLAGS: 00010202
> > <4>[   76.231549] RAX: 0001 RBX: 1000 RCX:
> 0001
> > <4>[   76.238613] RDX:  RSI: 823ceb7b RDI:
> 823ee12d
> > <4>[   76.245680] RBP: 888102ec9b40 R08:  R09:
> 0001
> > <4>[   76.252747] R10: 0001 R11: 8881157cd2c0 R12:
> 
> > <4>[   76.259811] R13: 888102ec9c70 R14: a07de500 R15:
> 888102ec9ce0
> > <4>[   76.266875] FS:  7fbcabe11c00() GS:88846ec0()
> knlGS:
> > <4>[   76.274884] CS:  0010 DS:  ES:  CR0: 80050033
> > <4>[   76.280578] CR2: 0010 CR3: 00010d4c2005 CR4:
> 00f70ee0
> > <4>[   76.287643] DR0:  DR1:  DR2:
> 
> > <4>[   76.294711] DR3:  DR6: 07f0 DR7:
> 0400
> > <4>[   76.301775] PKRU: 5554
> > <4>[   76.304463] Call Trace:
> > <4>[   76.306893]  
> > <4>[   76.308983]  ? __die_body+0x1a/0x60
> > <4>[   76.312444]  ? page_fault_oops+0x156/0x450
> > <4>[   76.316510]  ? do_user_addr_fault+0x65/0x980
> > <4>[   76.320747]  ? exc_page_fault+0x68/0x1a0
> > <4>[   76.324643]  ? asm_exc_page_fault+0x26/0x30
> > <4>[   76.328796]  ? __mmu_notifier_register+0x40/0x210
> > <4>[   76.333460]  ? __mmu_notifier_register+0x11c/0x210
> > <4>[   76.338206]  ? preempt_count_add+0x4c/0xa0
> > <4>[   76.342273]  mmu_notifier_register+0x30/0xe0
> > <4>[   76.346509]  mmu_interval_notifier_insert+0x74/0xb0
> > <4>[   76.351344]  i915_gem_userptr_ioctl+0x21a/0x320 [i915]
> > <4>[   76.356565]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
> > <4>[   76.362271]  drm_ioctl_kernel+0xb4/0x150
> > <4>[   76.366159]  drm_ioctl+0x21d/0x420
> > <4>[   76.369537]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
> > <4>[   76.375242]  ? find_held_lock+0x2b/0x80
> > <4>[   76.379046]  __x64_sys_ioctl+0x79/0xb0
> > <4>[   76.382766]  do_syscall_64+0x3c/0x90
> > <4>[   76.386312]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > <4>[   76.391317] RIP: 0033:0x7fbcae63f3ab
> >
> > Details log can be found in [3].
> >
> > After bisecting the tree, the following patch seems to be causing the
> > regression.
> >
> > commit 828fe4085cae77acb3abf7dd3d25b3ed6c560edf
> > Author: Alistair Popple apop...@nvidia.com
> > Date:   Wed Jul 19 22:18:46 2023 +1000
> >
> > mmu_notifiers: rename invalidate_range notifier
> >
>

Re: [Intel-gfx] Regression in linux-next

2023-07-25 Thread Tvrtko Ursulin



On 25/07/2023 07:42, Borah, Chaitanya Kumar wrote:

Hello Alistair,

Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
  
This mail is regarding a regression we are seeing in our CI runs[1] on linux-next

repository.
  
On next-20230720 [2], we are seeing the following error


<4>[   76.189375] Hardware name: Intel Corporation Meteor Lake Client 
Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.3271.D81.2307101805 07/10/2023
<4>[   76.202534] RIP: 0010:__mmu_notifier_register+0x40/0x210
<4>[   76.207804] Code: 1a 71 5a 01 85 c0 0f 85 ec 00 00 00 48 8b 85 30 01 00 00 48 
85 c0 0f 84 04 01 00 00 8b 85 cc 00 00 00 85 c0 0f 8e bb 01 00 00 <49> 8b 44 24 10 48 
83 78 38 00 74 1a 48 83 78 28 00 74 0c 0f 0b b8
<4>[   76.226368] RSP: 0018:c900019d7ca8 EFLAGS: 00010202
<4>[   76.231549] RAX: 0001 RBX: 1000 RCX: 
0001
<4>[   76.238613] RDX:  RSI: 823ceb7b RDI: 
823ee12d
<4>[   76.245680] RBP: 888102ec9b40 R08:  R09: 
0001
<4>[   76.252747] R10: 0001 R11: 8881157cd2c0 R12: 

<4>[   76.259811] R13: 888102ec9c70 R14: a07de500 R15: 
888102ec9ce0
<4>[   76.266875] FS:  7fbcabe11c00() GS:88846ec0() 
knlGS:
<4>[   76.274884] CS:  0010 DS:  ES:  CR0: 80050033
<4>[   76.280578] CR2: 0010 CR3: 00010d4c2005 CR4: 
00f70ee0
<4>[   76.287643] DR0:  DR1:  DR2: 

<4>[   76.294711] DR3:  DR6: 07f0 DR7: 
0400
<4>[   76.301775] PKRU: 5554
<4>[   76.304463] Call Trace:
<4>[   76.306893]  
<4>[   76.308983]  ? __die_body+0x1a/0x60
<4>[   76.312444]  ? page_fault_oops+0x156/0x450
<4>[   76.316510]  ? do_user_addr_fault+0x65/0x980
<4>[   76.320747]  ? exc_page_fault+0x68/0x1a0
<4>[   76.324643]  ? asm_exc_page_fault+0x26/0x30
<4>[   76.328796]  ? __mmu_notifier_register+0x40/0x210
<4>[   76.333460]  ? __mmu_notifier_register+0x11c/0x210
<4>[   76.338206]  ? preempt_count_add+0x4c/0xa0
<4>[   76.342273]  mmu_notifier_register+0x30/0xe0
<4>[   76.346509]  mmu_interval_notifier_insert+0x74/0xb0
<4>[   76.351344]  i915_gem_userptr_ioctl+0x21a/0x320 [i915]
<4>[   76.356565]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
<4>[   76.362271]  drm_ioctl_kernel+0xb4/0x150
<4>[   76.366159]  drm_ioctl+0x21d/0x420
<4>[   76.369537]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
<4>[   76.375242]  ? find_held_lock+0x2b/0x80
<4>[   76.379046]  __x64_sys_ioctl+0x79/0xb0
<4>[   76.382766]  do_syscall_64+0x3c/0x90
<4>[   76.386312]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
<4>[   76.391317] RIP: 0033:0x7fbcae63f3ab

Details log can be found in [3].

After bisecting the tree, the following patch seems to be causing the
regression.

commit 828fe4085cae77acb3abf7dd3d25b3ed6c560edf
Author: Alistair Popple apop...@nvidia.com
Date:   Wed Jul 19 22:18:46 2023 +1000

 mmu_notifiers: rename invalidate_range notifier

 There are two main use cases for mmu notifiers.  One is by KVM which uses
 mmu_notifier_invalidate_range_start()/end() to manage a software TLB.

 The other is to manage hardware TLBs which need to use the
 invalidate_range() callback because HW can establish new TLB entries at
 any time.  Hence using start/end() can lead to memory corruption as these
 callbacks happen too soon/late during page unmap.

 mmu notifier users should therefore either use the start()/end() callbacks
 or the invalidate_range() callbacks.  To make this usage clearer rename
 the invalidate_range() callback to arch_invalidate_secondary_tlbs() and
 update documention.

 Link: 
https://lkml.kernel.org/r/9a02dde2f8ddaad2db31e54706a80c12d1817aaf.1689768831.git-series.apop...@nvidia.com


We also verified by reverting the patch in the tree.

Could you please check why this patch causes the regression and if we can find
a solution for it soon?


Without checking out the whole tree but only looking at this patch in 
isolation, it could be that it is not considering NULL subscription can be 
passed to mmu_notifier_register. For instance from 
mmu_interval_notifier_insert, which i915 is calling. So the check patch added 
to __mmu_notifier_register causes a null pointer dereference:

@@ -616,6 +617,15 @@ int __mmu_notifier_register(struct mmu_notifier 
*subscription,
mmap_assert_write_locked(mm);
BUG_ON(atomic_read(&mm->mm_users) <= 0);
 
+   /*

+* Subsystems should only register for invalidate_secondary_tlbs() or
+* invalidate_range_start()/end() callbacks, not both.
+*/
+   if (WARN_ON_ONCE(subscription->ops->arch_invalidate_secondary_tlbs &&

---> subscription is NULL here <---

+   (subscription->ops->invalidate_range_start ||
+   subscription->ops->invalidate_range_end)))
+  

[Intel-gfx] Regression in linux-next

2023-07-24 Thread Borah, Chaitanya Kumar
Hello Alistair,

Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
 
This mail is regarding a regression we are seeing in our CI runs[1] on 
linux-next
repository.
 
On next-20230720 [2], we are seeing the following error

<4>[   76.189375] Hardware name: Intel Corporation Meteor Lake Client 
Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.3271.D81.2307101805 
07/10/2023
<4>[   76.202534] RIP: 0010:__mmu_notifier_register+0x40/0x210
<4>[   76.207804] Code: 1a 71 5a 01 85 c0 0f 85 ec 00 00 00 48 8b 85 30 01 00 
00 48 85 c0 0f 84 04 01 00 00 8b 85 cc 00 00 00 85 c0 0f 8e bb 01 00 00 <49> 8b 
44 24 10 48 83 78 38 00 74 1a 48 83 78 28 00 74 0c 0f 0b b8
<4>[   76.226368] RSP: 0018:c900019d7ca8 EFLAGS: 00010202
<4>[   76.231549] RAX: 0001 RBX: 1000 RCX: 
0001
<4>[   76.238613] RDX:  RSI: 823ceb7b RDI: 
823ee12d
<4>[   76.245680] RBP: 888102ec9b40 R08:  R09: 
0001
<4>[   76.252747] R10: 0001 R11: 8881157cd2c0 R12: 

<4>[   76.259811] R13: 888102ec9c70 R14: a07de500 R15: 
888102ec9ce0
<4>[   76.266875] FS:  7fbcabe11c00() GS:88846ec0() 
knlGS:
<4>[   76.274884] CS:  0010 DS:  ES:  CR0: 80050033
<4>[   76.280578] CR2: 0010 CR3: 00010d4c2005 CR4: 
00f70ee0
<4>[   76.287643] DR0:  DR1:  DR2: 

<4>[   76.294711] DR3:  DR6: 07f0 DR7: 
0400
<4>[   76.301775] PKRU: 5554
<4>[   76.304463] Call Trace:
<4>[   76.306893]  
<4>[   76.308983]  ? __die_body+0x1a/0x60
<4>[   76.312444]  ? page_fault_oops+0x156/0x450
<4>[   76.316510]  ? do_user_addr_fault+0x65/0x980
<4>[   76.320747]  ? exc_page_fault+0x68/0x1a0
<4>[   76.324643]  ? asm_exc_page_fault+0x26/0x30
<4>[   76.328796]  ? __mmu_notifier_register+0x40/0x210
<4>[   76.333460]  ? __mmu_notifier_register+0x11c/0x210
<4>[   76.338206]  ? preempt_count_add+0x4c/0xa0
<4>[   76.342273]  mmu_notifier_register+0x30/0xe0
<4>[   76.346509]  mmu_interval_notifier_insert+0x74/0xb0
<4>[   76.351344]  i915_gem_userptr_ioctl+0x21a/0x320 [i915]
<4>[   76.356565]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
<4>[   76.362271]  drm_ioctl_kernel+0xb4/0x150
<4>[   76.366159]  drm_ioctl+0x21d/0x420
<4>[   76.369537]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
<4>[   76.375242]  ? find_held_lock+0x2b/0x80
<4>[   76.379046]  __x64_sys_ioctl+0x79/0xb0
<4>[   76.382766]  do_syscall_64+0x3c/0x90
<4>[   76.386312]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
<4>[   76.391317] RIP: 0033:0x7fbcae63f3ab

Details log can be found in [3].

After bisecting the tree, the following patch seems to be causing the
regression.

commit 828fe4085cae77acb3abf7dd3d25b3ed6c560edf
Author: Alistair Popple apop...@nvidia.com
Date:   Wed Jul 19 22:18:46 2023 +1000

mmu_notifiers: rename invalidate_range notifier

There are two main use cases for mmu notifiers.  One is by KVM which uses
mmu_notifier_invalidate_range_start()/end() to manage a software TLB.

The other is to manage hardware TLBs which need to use the
invalidate_range() callback because HW can establish new TLB entries at
any time.  Hence using start/end() can lead to memory corruption as these
callbacks happen too soon/late during page unmap.

mmu notifier users should therefore either use the start()/end() callbacks
or the invalidate_range() callbacks.  To make this usage clearer rename
the invalidate_range() callback to arch_invalidate_secondary_tlbs() and
update documention.

Link: 
https://lkml.kernel.org/r/9a02dde2f8ddaad2db31e54706a80c12d1817aaf.1689768831.git-series.apop...@nvidia.com


We also verified by reverting the patch in the tree.

Could you please check why this patch causes the regression and if we can find
a solution for it soon?

[1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20230720
 
[3] 
https://intel-gfx-ci.01.org/tree/linux-next/next-20230720/bat-mtlp-6/dmesg0.txt


Re: [Intel-gfx] [regression in linux-next] i915: broken graphics on laptop

2015-02-04 Thread Chris Wilson
On Wed, Feb 04, 2015 at 09:26:27PM +0300, Andrey Skvortsov wrote:
> On Tue, Feb 03, 2015 at 08:21:52PM +, Chris Wilson wrote:
> > On Tue, Feb 03, 2015 at 10:15:47PM +0300, Andrey Skvortsov wrote:
> > > Hi,
> > > 
> > > tested next-20150202. System boots, but graphic output is broken (empty 
> > > black screen).
> > > Booted five times the same kernel, always got the same result. The system 
> > > works with 3.19-rc7.
> > 
> > Those two warnings are more or less symptoms of the black screen (well
> > the first is just overzealous). More important would be the drm.debug=6
> > dmesg from boot along with the gdm.log (or equivalent) aned Xorg.0.log
> > as my guess is that X (or the display server) is crashing.
> 
> Requested logs with drm.debug=6 are attached. lightdm was running after 
> WARN_ON, but I couldn't restart it.
> The command hanged.
> 
> As I booted next-20150202 system crashed several times with a lot of drm_ 
> calls in the backtrace, but I couldn't catch kernel logs,
> because I have not serial port on the laptop.
> 
> If you need to get other information or to test patches, I would be glad to 
> help.

Right, here it looks like it freezing in intel_get_load_detect_pipe()
during the initial configuration probe of X. Given the other crashes,
we're back to worring about memory corruption.

> [   29.292333] [drm:intel_tv_detect] [CONNECTOR:33:SVIDEO-1] force=1
> [   29.292336] [drm:intel_get_load_detect_pipe] [CONNECTOR:33:SVIDEO-1], 
> [ENCODER:34:TV-34]
> [   29.292339] [drm:intel_get_load_detect_pipe] creating tmp fb for 
> load-detection
> [   29.292396] [drm:intel_modeset_affected_pipes] set mode pipe masks: 
> modeset: 1, prepare: 1, disable: 0
> [   29.292408] [drm:connected_sink_compute_bpp] [CONNECTOR:33:SVIDEO-1] 
> checking for sink bpp constrains
> [   29.292413] [drm:intel_tv_compute_config] forcing bpc to 8 for TV
> [   29.292416] [drm:intel_modeset_pipe_config] plane bpp: 24, pipe bpp: 24, 
> dithering: 0
> [   29.292418] [drm:intel_dump_pipe_config] [CRTC:20][modeset] config for 
> pipe A
> [   29.292419] [drm:intel_dump_pipe_config] cpu_transcoder: A
> [   29.292421] [drm:intel_dump_pipe_config] pipe bpp: 24, dithering: 0
> [   29.292423] [drm:intel_dump_pipe_config] fdi/pch: 0, lanes: 0, gmch_m: 0, 
> gmch_n: 0, link_m: 0, link_n: 0, tu: 0
> [   29.292425] [drm:intel_dump_pipe_config] dp: 0, gmch_m: 0, gmch_n: 0, 
> link_m: 0, link_n: 0, tu: 0
> [   29.292428] [drm:intel_dump_pipe_config] dp: 0, gmch_m2: 0, gmch_n2: 0, 
> link_m2: 0, link_n2: 0, tu2: 0
> [   29.292429] [drm:intel_dump_pipe_config] audio: 0, infoframes: 0
> [   29.292431] [drm:intel_dump_pipe_config] requested mode:
> [   29.292433] [drm:drm_mode_debug_printmodeline] Modeline 0:"NTSC 480i" 0 
> 107520 1280 1368 1496 1712 1024 1027 1034 1104 0x40 0x0
> [   29.292435] [drm:intel_dump_pipe_config] adjusted mode:
> [   29.292438] [drm:drm_mode_debug_printmodeline] Modeline 0:"NTSC 480i" 0 
> 107520 1280 1368 1496 1712 1024 1027 1034 1104 0x40 0x0
> [   29.292440] [drm:intel_dump_crtc_timings] crtc timings: 108000 1280 1368 
> 1496 1712 1024 1027 1034 1104, type: 0x40 flags: 0x0
> [   29.292442] [drm:intel_dump_pipe_config] port clock: 108000
> [   29.292444] [drm:intel_dump_pipe_config] pipe src size: 1280x1024
> [   29.292446] [drm:intel_dump_pipe_config] gmch pfit: control: 0x, 
> ratios: 0x, lvds border: 0x
> [   29.292447] [drm:intel_dump_pipe_config] pch pfit: pos: 0x, size: 
> 0x, disabled
> [   29.292449] [drm:intel_dump_pipe_config] ips: 0
> [   29.292451] [drm:intel_dump_pipe_config] double wide: 0
> [   29.292565] [ cut here ]
> [   29.293785] WARNING: CPU: 0 PID: 53 at include/linux/kref.h:47 
> drm_framebuffer_reference+0x5b/0x64 [drm]()
> [   29.295032] Modules linked in: bnep(E) cfg80211(E) cpufreq_stats(E) 
> cpufreq_powersave(E) cpufreq_userspace(E) cpufreq_conservative(E) nfsd(E) 
> auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) cdc_wdm(E) cdc_acm(E) 
> cdc_ether(E) usbnet(E) joydev(E) coretemp(E) kvm_intel(E) kvm(E) i8k(E) 
> btusb(E) psmouse(E) snd_pcsp(E) i915(E) evdev(E) bluetooth(E) i2c_i801(E) 
> snd_hda_codec_generic(E) lpc_ich(E) mfd_core(E) xhci_pci(E) xhci_hcd(E) 
> serio_raw(E) rfkill(E) drm_kms_helper(E) drm(E) i2c_algo_bit(E) i2c_core(E) 
> snd_hda_intel(E) snd_hda_controller(E) snd_hda_codec(E) button(E) 
> snd_hwdep(E) battery(E) snd_pcm(E) snd_timer(E) snd(E) soundcore(E) video(E) 
> ac(E) acpi_cpufreq(E) processor(E) fuse(E) parport_pc(E) ppdev(E) lp(E) 
> parport(E) autofs4(E) ext4(E) crc16(E) jbd2(E) mbcache(E) sd_mod(E) 
> ata_generic(E)
> [   29.295080]  ahci(E) libahci(E) ata_piix(E) libata(E) scsi_mod(E) b44(E) 
> firewire_ohci(E) sdhci_pci(E) sdhci(E) firewire_core(E) crc_itu_t(E) mii(E) 
> ssb(E) mmc_core(E) libphy(E) uhci_hcd(E) ehci_pci(E) ehci_hcd(E) thermal(E) 
> thermal_sys(E) usbcore(E) usb_common(E)
> [   29.296301] CPU: 0 PID: 53 Comm: kworker/0:3 Tainted: GW   E   
> 3.19.0-rc6-next-2015

Re: [Intel-gfx] [regression in linux-next] i915: broken graphics on laptop

2015-02-03 Thread Chris Wilson
On Tue, Feb 03, 2015 at 10:15:47PM +0300, Andrey Skvortsov wrote:
> Hi,
> 
> tested next-20150202. System boots, but graphic output is broken (empty black 
> screen).
> Booted five times the same kernel, always got the same result. The system 
> works with 3.19-rc7.

Those two warnings are more or less symptoms of the black screen (well
the first is just overzealous). More important would be the drm.debug=6
dmesg from boot along with the gdm.log (or equivalent) aned Xorg.0.log
as my guess is that X (or the display server) is crashing.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [regression in linux-next] i915: broken graphics on laptop

2015-02-03 Thread Andrey Skvortsov
Hi,

tested next-20150202. System boots, but graphic output is broken (empty black 
screen).
Booted five times the same kernel, always got the same result. The system works 
with 3.19-rc7.

This is the first warning in the log: 

 WARNING: CPU: 0 PID: 855 at drivers/gpu/drm/i915/intel_uncore.c:169 
intel_uncore_forcewake_reset+0x188/0x24d [i915]()
 WARN_ON(dev_priv->uncore.fw_domains == 0)
 Modules linked in: i915(E+) lpc_ich(E) mfd_core(E) snd_hda_controller(E) 
snd_hda_codec(E) snd_hwdep(E) snd_pcm(E) snd_timer(E) drm_kms_helper(E) drm(E) 
battery(E) button(E) video(E) ac(E) snd(E) soundcore(E) i2c_algo_bit(E) 
i2c_core(E) acpi_cpufreq(E) processor(E) fuse(E) parport_pc(E) ppdev(E) lp(E) 
parport(E) autofs4(E) ext4(E) crc16(E) jbd2(E) mbcache(E) sd_mod(E) 
ata_generic(E) ahci(E) libahci(E) ata_piix(E) libata(E) scsi_mod(E) 
sdhci_pci(E) firewire_ohci(E) sdhci(E) b44(E) firewire_core(E) crc_itu_t(E) 
mii(E) ssb(E) mmc_core(E) libphy(E) ehci_pci(E) thermal(E) thermal_sys(E) 
uhci_hcd(E) ehci_hcd(E) usbcore(E) usb_common(E)
 CPU: 0 PID: 855 Comm: systemd-udevd Tainted: GE   
3.19.0-rc6-next-20150202-150201- #4
 Hardware name: Dell Inc. Vostro 1500 /0NX907, BIOS A06 
04/21/2008
   0009 813e790a 8800da083958
  8104178e a0601100 a0589bc4 8800da083988
  8800da6c00c8 8800da6c00c8  0246
 Call Trace:
  [] ? dump_stack+0x4a/0x74
  [] ? warn_slowpath_common+0x9d/0xb5
  [] ? intel_uncore_forcewake_reset+0x188/0x24d [i915]
  [] ? warn_slowpath_fmt+0x4a/0x4f
  [] ? intel_uncore_forcewake_reset+0x188/0x24d [i915]
  [] ? intel_uncore_init+0x1e4/0x4a8 [i915]
  [] ? i915_driver_load+0x58f/0xeda [i915]
  [] ? kobject_uevent_env+0x581/0x5d8
  [] ? kfree+0xa4/0x127
  [] ? kobject_uevent_env+0x581/0x5d8
  [] ? devtmpfs_create_node+0x102/0x117
  [] ? preempt_count_sub+0xab/0xca
  [] ? preempt_count_sub+0xab/0xca
  [] ? drm_dev_register+0x79/0xec [drm]
  [] ? drm_get_pci_dev+0xfc/0x1b7 [drm]
  [] ? pci_device_probe+0x74/0xd1
  [] ? driver_probe_device+0x2ff/0x2ff
  [] ? driver_probe_device+0x11c/0x2ff
  [] ? driver_probe_device+0x2ff/0x2ff
  [] ? __driver_attach+0x58/0x78
  [] ? bus_for_each_dev+0x53/0x84
  [] ? bus_add_driver+0x113/0x1f8
  [] ? driver_register+0x87/0xba
  [] ? 0xa0627000
  [] ? do_one_initcall+0xf7/0x18e
  [] ? kmem_cache_alloc_trace+0xd6/0xe8
  [] ? load_module+0x1c81/0x202e
  [] ? load_module+0x1cf9/0x202e
  [] ? mod_kobject_put+0x48/0x48
  [] ? copy_module_from_fd+0x8c/0xf5
  [] ? SyS_finit_module+0x82/0x9a
  [] ? system_call_fastpath+0x12/0x17


Other warnings are below:

 [   19.253096] WARNING: CPU: 1 PID: 746 at 
drivers/gpu/drm/i915/i915_gem.c:4525 i915_gem_free_object+0x134/0x272 [i915]()
 [   19.253098] WARN_ON(obj->frontbuffer_bits)
 [   19.253126] Modules linked in: snd_hda_intel(E+) i2c_i801(E+) i915(E+) 
lpc_ich(E) mfd_core(E) snd_hda_controller(E) snd_hda_codec(E) snd_hwdep(E) 
snd_pcm(E) snd_timer(E) drm_kms_helper(E) drm(E) battery(E) button(E) video(E) 
ac(E) snd(E) soundcore(E) i2c_algo_bit(E) i2c_core(E) acpi_cpufreq(E) 
processor(E) fuse(E) parport_pc(E) ppdev(E) lp(E) parport(E) autofs4(E) ext4(E) 
crc16(E) jbd2(E) mbcache(E) sd_mod(E) ata_generic(E) ahci(E) libahci(E) 
ata_piix(E) libata(E) scsi_mod(E) sdhci_pci(E) firewire_ohci(E) sdhci(E) b44(E) 
firewire_core(E) crc_itu_t(E) mii(E) ssb(E) mmc_core(E) libphy(E) ehci_pci(E) 
thermal(E) thermal_sys(E) uhci_hcd(E) ehci_hcd(E) usbcore(E) usb_common(E)
 [   19.253129] CPU: 1 PID: 746 Comm: kworker/u4:5 Tainted: GW   E   
3.19.0-rc6-next-20150202-150201- #4
 [   19.253130] Hardware name: Dell Inc. Vostro 1500 
/0NX907, BIOS A06 04/21/2008
 [   19.253135] Workqueue: events_unbound async_run_entry_fn
 [   19.253137]   0009 813e790a 
880037a9b9e8
 [   19.253139]  8104178e 8800da6c a056e22b 
880196dda800
 [   19.253140]  880196e54000 8800da6c 880196e54040 
880196e54040
 [   19.253141] Call Trace:
 [   19.253144]  [] ? dump_stack+0x4a/0x74
 [   19.253147]  [] ? warn_slowpath_common+0x9d/0xb5
 [   19.253173]  [] ? i915_gem_free_object+0x134/0x272 [i915]
 [   19.253176]  [] ? warn_slowpath_fmt+0x4a/0x4f
 [   19.253202]  [] ? i915_vma_unbind+0x18f/0x1cb [i915]
 [   19.253228]  [] ? i915_gem_free_object+0x134/0x272 [i915]
 [   19.253246]  [] ? drm_gem_object_release+0x3b/0x3b [drm]
 [   19.253277]  [] ? kref_sub.constprop.59+0x2f/0x38 [i915]
 [   19.253308]  [] ? 
intel_user_framebuffer_destroy+0x62/0x75 [i915]
 [   19.253321]  [] ? 
drm_framebuffer_unregister_private+0x37/0x37 [drm]
 [   19.25]  [] ? kref_sub.constprop.33+0x2f/0x38 [drm]
 [   19.253346]  [] ? drm_mode_set_config_internal+0xa6/0xd7 
[drm]
 [   19.253355]  [] ? restore_fbdev_mode+0xad/0xc8 
[drm_kms_helper]
 [   19.253361]  [] ? 
drm_fb_helper_restore_fbdev_mode_unlocked+0x24/0x5a [drm_kms_helper]
 [   19.253367]  [] ? drm_fb_