Re: linux-5.10.11 build failure

2021-02-01 Thread Chris Clayton
Hi Greg,

On 29/01/2021 15:14, Josh Poimboeuf wrote:
> On Fri, Jan 29, 2021 at 12:09:53PM +0100, Greg Kroah-Hartman wrote:
>> On Fri, Jan 29, 2021 at 11:03:26AM +0000, Chris Clayton wrote:
>>>
>>>
>>> On 29/01/2021 10:11, Greg Kroah-Hartman wrote:
>>>> On Thu, Jan 28, 2021 at 10:00:15AM -0600, Josh Poimboeuf wrote:
...
>>>>
>>>> It is in Linus's tree now :)
>>>>
>>>> Now grabbed.
>>>>
>>>
>>> Are you sure, Greg? I don't see the patch in Linus' tree at
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git. Nor do 
>>> is see it in your stable queue at
>>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/. 
>>> For clarity, I've attached the patch which
>>> fixes problem I reported and is currently sat in 
>>> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git As I
>>> understand it, the patch is scheduled to be included in a pull request to 
>>> Linus this weekend in time for -rc6.
>>>
>>> In fact, I did a pull from Linus' tree a few minutes ago and the build 
>>> failed in the way I reported in this thread. I
>>> added the patch and the build now succeeds.
>>
>> Ok, sorry, no, I grabbed 1d489151e9f9 ("objtool: Don't fail on missing
>> symbol table") which is what Josh asked me to take.  I got that confused
>> here.
> 
> I'm probably responsible for that confusion, I got mixed up myself.
> It'll be a good idea to take both anyway.
> 

The patch is now in Linus' tree at 5e6dca82bcaa49348f9e5fcb48df4881f6d6c4ae

Thanks.

Chris


Re: linux-5.10.11 build failure

2021-01-28 Thread Chris Clayton



On 28/01/2021 15:52, Josh Poimboeuf wrote:
> On Thu, Jan 28, 2021 at 11:24:47AM +, Thomas Backlund wrote:
>> Den 28.1.2021 kl. 12:05, skrev Chris Clayton:
>>>
>>> On 28/01/2021 09:34, Greg Kroah-Hartman wrote:
>>>> On Thu, Jan 28, 2021 at 09:17:10AM +, Chris Clayton wrote:
>>>>> Hi,
>>>>>
>>>>> Building 5.10.11 fails on my (x86-64) laptop thusly:
>>>>>
>>>>> ..
>>>>>
>>>>>   AS  arch/x86/entry/thunk_64.o
>>>>>CC  arch/x86/entry/vsyscall/vsyscall_64.o
>>>>>AS  arch/x86/realmode/rm/header.o
>>>>>CC  arch/x86/mm/pat/set_memory.o
>>>>>CC  arch/x86/events/amd/core.o
>>>>>CC  arch/x86/kernel/fpu/init.o
>>>>>CC  arch/x86/entry/vdso/vma.o
>>>>>CC  kernel/sched/core.o
>>>>> arch/x86/entry/thunk_64.o: warning: objtool: missing symbol for insn at 
>>>>> offset 0x3e
>>>>>
>>>>>AS  arch/x86/realmode/rm/trampoline_64.o
>>>>> make[2]: *** [scripts/Makefile.build:360: arch/x86/entry/thunk_64.o] 
>>>>> Error 255
>>>>> make[2]: *** Deleting file 'arch/x86/entry/thunk_64.o'
>>>>> make[2]: *** Waiting for unfinished jobs
>>>>>
>>>>> ..
>>>>>
>>>>> Compiler is latest snapshot of gcc-10.
>>>>>
>>>>> Happy to test the fix but please cc me as I'm not subscribed
>>>>
>>>> Can you do 'git bisect' to track down the offending commit?
>>>>
>>>
>>> Sure, but I'll hold that request for a while. I updated to binutils-2.36 on 
>>> Monday and I'm pretty sure that is a feature
>>> of this build fail. I've reverted binutils to 2.35.1, and the build 
>>> succeeds. Updated to 2.36 again and, surprise,
>>> surprise, the kernel build fails again.
>>>
>>> I've had a glance at the binutils ML and there are all sorts of issues 
>>> being reported, but it's beyond my knowledge to
>>> assess if this build error is related to any of them.
>>>
>>> I'll stick with binutils-2.35.1 for the time being.
>>>
>>>> And what exact gcc version are you using?
>>>>
>>>
>>>   It's built from the 10-20210123 snapshot tarball.
>>>
>>> I can report this to the binutils folks, but might it be better if the 
>>> objtool maintainer looks at it first? The
>>> binutils change might just have opened the gate to a bug in objtool.
>>>
>>>> thanks,
>>>>
>>>> greg k-h
>>>>
>>>
>>
>>
>> AFAIK you need this in stable trees:
>>
>>  From 1d489151e9f9d1647110277ff77282fe4d96d09b Mon Sep 17 00:00:00 2001
>> From: Josh Poimboeuf 
>> Date: Thu, 14 Jan 2021 16:14:01 -0600
>> Subject: [PATCH] objtool: Don't fail on missing symbol table
> 
> Actually I think you need:
> 
>   5e6dca82bcaa ("x86/entry: Emit a symbol for register restoring thunk")
> 
> I submitted a patch to stable list a few days ago.
> 

Yes, that's what I concluded, Josh. 5.10.11 builds with that patch added but 
it's not in Linus's tree yet, so, as I
understand it, is not yet a candidate from stable.


> (Though it's possible you need both commits, I'm not sure if binutils
>  2.36 has the symbol stripping stuff)
> 


Re: linux-5.10.11 build failure

2021-01-28 Thread Chris Clayton



On 28/01/2021 14:41, Greg Kroah-Hartman wrote:
> On Thu, Jan 28, 2021 at 01:38:25PM +0000, Chris Clayton wrote:
>> Thanks, Thomas.
>>
>> On 28/01/2021 11:24, Thomas Backlund wrote:
>>> Den 28.1.2021 kl. 12:05, skrev Chris Clayton:
>>>>
>>>> On 28/01/2021 09:34, Greg Kroah-Hartman wrote:
>>>>> On Thu, Jan 28, 2021 at 09:17:10AM +, Chris Clayton wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Building 5.10.11 fails on my (x86-64) laptop thusly:
>>>>>>
>>>>>> ..
>>>>>>
>>>>>>   AS  arch/x86/entry/thunk_64.o
>>>>>>CC  arch/x86/entry/vsyscall/vsyscall_64.o
>>>>>>AS  arch/x86/realmode/rm/header.o
>>>>>>CC  arch/x86/mm/pat/set_memory.o
>>>>>>CC  arch/x86/events/amd/core.o
>>>>>>CC  arch/x86/kernel/fpu/init.o
>>>>>>CC  arch/x86/entry/vdso/vma.o
>>>>>>CC  kernel/sched/core.o
>>>>>> arch/x86/entry/thunk_64.o: warning: objtool: missing symbol for insn at 
>>>>>> offset 0x3e
>>>>>>
>>>>>>AS  arch/x86/realmode/rm/trampoline_64.o
>>>>>> make[2]: *** [scripts/Makefile.build:360: arch/x86/entry/thunk_64.o] 
>>>>>> Error 255
>>>>>> make[2]: *** Deleting file 'arch/x86/entry/thunk_64.o'
>>>>>> make[2]: *** Waiting for unfinished jobs
>>>>>>
>>>>>> ..
>>>>>>
>>>>>> Compiler is latest snapshot of gcc-10.
>>>>>>
>>>>>> Happy to test the fix but please cc me as I'm not subscribed
>>>>>
>>>>> Can you do 'git bisect' to track down the offending commit?
>>>>>
>>>>
>>>> Sure, but I'll hold that request for a while. I updated to binutils-2.36 
>>>> on Monday and I'm pretty sure that is a feature
>>>> of this build fail. I've reverted binutils to 2.35.1, and the build 
>>>> succeeds. Updated to 2.36 again and, surprise,
>>>> surprise, the kernel build fails again.
>>>>
>>>> I've had a glance at the binutils ML and there are all sorts of issues 
>>>> being reported, but it's beyond my knowledge to
>>>> assess if this build error is related to any of them.
>>>>
>>>> I'll stick with binutils-2.35.1 for the time being.
>>>>
>>>>> And what exact gcc version are you using?
>>>>>
>>>>
>>>>   It's built from the 10-20210123 snapshot tarball.
>>>>
>>>> I can report this to the binutils folks, but might it be better if the 
>>>> objtool maintainer looks at it first? The
>>>> binutils change might just have opened the gate to a bug in objtool.
>>>>
>>>>> thanks,
>>>>>
>>>>> greg k-h
>>>>>
>>>>
>>>
>>>
>>> AFAIK you need this in stable trees:
>>>
>>>  From 1d489151e9f9d1647110277ff77282fe4d96d09b Mon Sep 17 00:00:00 2001
>>> From: Josh Poimboeuf 
>>> Date: Thu, 14 Jan 2021 16:14:01 -0600
>>> Subject: [PATCH] objtool: Don't fail on missing symbol table
>>>
>>>
>>
>> That may be the caae, but it doesn't fix the build failure I've reported in 
>> this thread. However, as suggested by Tor,
>> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/patch/?id=5e6dca82bcaa49348f9e5fcb48df4881f6d6c4ae
>>  does fix it.
>>
>> That hasn't made Linus' tree yet and I don't see a pull request, but it is 
>> in linux-next so I guess it could make it in
>> -rc6.
> 
> Ok, thanks, so this is not a new regression for 5.10.y.
> 

That seems to be the case, Greg. Neither 5.10.10 nor 5.10.9 build either.

> greg k-h
> 


Re: linux-5.10.11 build failure

2021-01-28 Thread Chris Clayton
Thanks, Thomas.

On 28/01/2021 11:24, Thomas Backlund wrote:
> Den 28.1.2021 kl. 12:05, skrev Chris Clayton:
>>
>> On 28/01/2021 09:34, Greg Kroah-Hartman wrote:
>>> On Thu, Jan 28, 2021 at 09:17:10AM +, Chris Clayton wrote:
>>>> Hi,
>>>>
>>>> Building 5.10.11 fails on my (x86-64) laptop thusly:
>>>>
>>>> ..
>>>>
>>>>   AS  arch/x86/entry/thunk_64.o
>>>>CC  arch/x86/entry/vsyscall/vsyscall_64.o
>>>>AS  arch/x86/realmode/rm/header.o
>>>>CC  arch/x86/mm/pat/set_memory.o
>>>>CC  arch/x86/events/amd/core.o
>>>>CC  arch/x86/kernel/fpu/init.o
>>>>CC  arch/x86/entry/vdso/vma.o
>>>>CC  kernel/sched/core.o
>>>> arch/x86/entry/thunk_64.o: warning: objtool: missing symbol for insn at 
>>>> offset 0x3e
>>>>
>>>>AS  arch/x86/realmode/rm/trampoline_64.o
>>>> make[2]: *** [scripts/Makefile.build:360: arch/x86/entry/thunk_64.o] Error 
>>>> 255
>>>> make[2]: *** Deleting file 'arch/x86/entry/thunk_64.o'
>>>> make[2]: *** Waiting for unfinished jobs
>>>>
>>>> ..
>>>>
>>>> Compiler is latest snapshot of gcc-10.
>>>>
>>>> Happy to test the fix but please cc me as I'm not subscribed
>>>
>>> Can you do 'git bisect' to track down the offending commit?
>>>
>>
>> Sure, but I'll hold that request for a while. I updated to binutils-2.36 on 
>> Monday and I'm pretty sure that is a feature
>> of this build fail. I've reverted binutils to 2.35.1, and the build 
>> succeeds. Updated to 2.36 again and, surprise,
>> surprise, the kernel build fails again.
>>
>> I've had a glance at the binutils ML and there are all sorts of issues being 
>> reported, but it's beyond my knowledge to
>> assess if this build error is related to any of them.
>>
>> I'll stick with binutils-2.35.1 for the time being.
>>
>>> And what exact gcc version are you using?
>>>
>>
>>   It's built from the 10-20210123 snapshot tarball.
>>
>> I can report this to the binutils folks, but might it be better if the 
>> objtool maintainer looks at it first? The
>> binutils change might just have opened the gate to a bug in objtool.
>>
>>> thanks,
>>>
>>> greg k-h
>>>
>>
> 
> 
> AFAIK you need this in stable trees:
> 
>  From 1d489151e9f9d1647110277ff77282fe4d96d09b Mon Sep 17 00:00:00 2001
> From: Josh Poimboeuf 
> Date: Thu, 14 Jan 2021 16:14:01 -0600
> Subject: [PATCH] objtool: Don't fail on missing symbol table
> 
> 

That may be the caae, but it doesn't fix the build failure I've reported in 
this thread. However, as suggested by Tor,
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/patch/?id=5e6dca82bcaa49348f9e5fcb48df4881f6d6c4ae
 does fix it.

That hasn't made Linus' tree yet and I don't see a pull request, but it is in 
linux-next so I guess it could make it in
-rc6.

Chris
> --
> Thomas
> 


Re: linux-5.10.11 build failure

2021-01-28 Thread Chris Clayton


On 28/01/2021 09:34, Greg Kroah-Hartman wrote:
> On Thu, Jan 28, 2021 at 09:17:10AM +0000, Chris Clayton wrote:
>> Hi,
>>
>> Building 5.10.11 fails on my (x86-64) laptop thusly:
>>
>> ..
>>
>>  AS  arch/x86/entry/thunk_64.o
>>   CC  arch/x86/entry/vsyscall/vsyscall_64.o
>>   AS  arch/x86/realmode/rm/header.o
>>   CC  arch/x86/mm/pat/set_memory.o
>>   CC  arch/x86/events/amd/core.o
>>   CC  arch/x86/kernel/fpu/init.o
>>   CC  arch/x86/entry/vdso/vma.o
>>   CC  kernel/sched/core.o
>> arch/x86/entry/thunk_64.o: warning: objtool: missing symbol for insn at 
>> offset 0x3e
>>
>>   AS  arch/x86/realmode/rm/trampoline_64.o
>> make[2]: *** [scripts/Makefile.build:360: arch/x86/entry/thunk_64.o] Error 
>> 255
>> make[2]: *** Deleting file 'arch/x86/entry/thunk_64.o'
>> make[2]: *** Waiting for unfinished jobs
>>
>> ..
>>
>> Compiler is latest snapshot of gcc-10.
>>
>> Happy to test the fix but please cc me as I'm not subscribed
> 
> Can you do 'git bisect' to track down the offending commit?
> 

Sure, but I'll hold that request for a while. I updated to binutils-2.36 on 
Monday and I'm pretty sure that is a feature
of this build fail. I've reverted binutils to 2.35.1, and the build succeeds. 
Updated to 2.36 again and, surprise,
surprise, the kernel build fails again.

I've had a glance at the binutils ML and there are all sorts of issues being 
reported, but it's beyond my knowledge to
assess if this build error is related to any of them.

I'll stick with binutils-2.35.1 for the time being.

> And what exact gcc version are you using?
>

 It's built from the 10-20210123 snapshot tarball.

I can report this to the binutils folks, but might it be better if the objtool 
maintainer looks at it first? The
binutils change might just have opened the gate to a bug in objtool.

> thanks,
> 
> greg k-h
> 

Thanks.

Chris


linux-5.10.11 build failure

2021-01-28 Thread Chris Clayton
Hi,

Building 5.10.11 fails on my (x86-64) laptop thusly:

..

 AS  arch/x86/entry/thunk_64.o
  CC  arch/x86/entry/vsyscall/vsyscall_64.o
  AS  arch/x86/realmode/rm/header.o
  CC  arch/x86/mm/pat/set_memory.o
  CC  arch/x86/events/amd/core.o
  CC  arch/x86/kernel/fpu/init.o
  CC  arch/x86/entry/vdso/vma.o
  CC  kernel/sched/core.o
arch/x86/entry/thunk_64.o: warning: objtool: missing symbol for insn at offset 
0x3e

  AS  arch/x86/realmode/rm/trampoline_64.o
make[2]: *** [scripts/Makefile.build:360: arch/x86/entry/thunk_64.o] Error 255
make[2]: *** Deleting file 'arch/x86/entry/thunk_64.o'
make[2]: *** Waiting for unfinished jobs

..

Compiler is latest snapshot of gcc-10.

Happy to test the fix but please cc me as I'm not subscribed


Thanks,

Chris


Re: [PATCH] misc: rtsx: do not setting OC_POWER_DOWN reg in rtsx_pci_init_ocp()

2020-10-15 Thread Chris Clayton
Hi Greg,

On 18/09/2020 15:35, Chris Clayton wrote:
> Mmm, gmail on android seems to have snuck some html into my reply, so here 
> goes again...
> 
> On 14/09/2020 16:58, Greg KH wrote:
>> On Sun, Sep 13, 2020 at 09:40:56AM +0100, Chris Clayton wrote:
>>> Hi Greg and Arnd,
>>>
>>> On 24/08/2020 04:00, ricky...@realtek.com wrote:
>>>> From: Ricky Wu 
>>>>
>>>> this power saving action in rtsx_pci_init_ocp() cause INTEL-NUC6 platform
>>>> missing card reader
>>>>
>>>
>>> In his changelog above, Ricky didn't mention that this patch fixes a 
>>> regression that was introduced (in 5.1) by commit
>>> bede03a579b3.
>>>
>>> The patch that I posted to LKML contained the appropriate Fixes, etc tags. 
>>> After discussion, the patch was changed to
>>> remove the code that effectively disables the RTS5229 cardreader on (at 
>>> least some) Intel NUC boxes. I prepared the
>>> patch that Ricky submitted but he didn't include my Signed-off-by or the 
>>> Fixes tag. I think the following needs to be
>>> added to the changelog.
>>>
>>> Fixes: bede03a579b3 ("misc: rtsx: Enable OCP for rts522a rts524a rts525a 
>>> rts5260")
>>> Link: https://marc.info/?l=linux-kernel=159105912832257
>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=204003
>>> Signed-off-by: Chris Clayton 
>>>
>>> bede03a579b3 introduced a bug which leaves the rts5229 PCI Express card 
>>> reader on the Intel NUC6CAYH box.
>>>
>>> My main point, however, is that the patch is also needed in the 5.4 
>>> (longterm) and 5.8 (stable) series kernels.
>>
>> It's too late to change the commit log now that it is in my tree, but
>> once it hits Linus's tree for 5.9-rc1, I can backport it to those stable
>> trees if someone reminds me :)
>>

This is the reminder you suggested. The patch is now in Linus's tree and the 
commit id is
551b6729578a8981c46af964c10bf7d5d9ddca83.

Chris
> 
> Thanks, Greg. I'll send the reminder.
> 
> Chris
>> thanks,
>>
>> greg k-h
>>


Re: [PATCH] misc: rtsx: do not setting OC_POWER_DOWN reg in rtsx_pci_init_ocp()

2020-09-18 Thread Chris Clayton
Mmm, gmail on android seems to have snuck some html into my reply, so here goes 
again...

On 14/09/2020 16:58, Greg KH wrote:
> On Sun, Sep 13, 2020 at 09:40:56AM +0100, Chris Clayton wrote:
>> Hi Greg and Arnd,
>>
>> On 24/08/2020 04:00, ricky...@realtek.com wrote:
>>> From: Ricky Wu 
>>>
>>> this power saving action in rtsx_pci_init_ocp() cause INTEL-NUC6 platform
>>> missing card reader
>>>
>>
>> In his changelog above, Ricky didn't mention that this patch fixes a 
>> regression that was introduced (in 5.1) by commit
>> bede03a579b3.
>>
>> The patch that I posted to LKML contained the appropriate Fixes, etc tags. 
>> After discussion, the patch was changed to
>> remove the code that effectively disables the RTS5229 cardreader on (at 
>> least some) Intel NUC boxes. I prepared the
>> patch that Ricky submitted but he didn't include my Signed-off-by or the 
>> Fixes tag. I think the following needs to be
>> added to the changelog.
>>
>> Fixes: bede03a579b3 ("misc: rtsx: Enable OCP for rts522a rts524a rts525a 
>> rts5260")
>> Link: https://marc.info/?l=linux-kernel=159105912832257
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=204003
>> Signed-off-by: Chris Clayton 
>>
>> bede03a579b3 introduced a bug which leaves the rts5229 PCI Express card 
>> reader on the Intel NUC6CAYH box.
>>
>> My main point, however, is that the patch is also needed in the 5.4 
>> (longterm) and 5.8 (stable) series kernels.
> 
> It's too late to change the commit log now that it is in my tree, but
> once it hits Linus's tree for 5.9-rc1, I can backport it to those stable
> trees if someone reminds me :)
> 

Thanks, Greg. I'll send the reminder.

Chris
> thanks,
> 
> greg k-h
> 


Re: [PATCH] misc: rtsx: do not setting OC_POWER_DOWN reg in rtsx_pci_init_ocp()

2020-09-14 Thread Chris Clayton
Thanks Bjorn.

On 13/09/2020 17:49, Bjorn Helgaas wrote:
> On Sun, Sep 13, 2020 at 09:40:56AM +0100, Chris Clayton wrote:
>> Hi Greg and Arnd,
>>
>> On 24/08/2020 04:00, ricky...@realtek.com wrote:
>>> From: Ricky Wu 
>>>
>>> this power saving action in rtsx_pci_init_ocp() cause INTEL-NUC6 platform
>>> missing card reader
>>
>> In his changelog above, Ricky didn't mention that this patch fixes a
>> regression that was introduced (in 5.1) by commit bede03a579b3.
>>
>> The patch that I posted to LKML contained the appropriate Fixes, etc
>> tags. After discussion, the patch was changed to remove the code
>> that effectively disables the RTS5229 cardreader on (at least some)
>> Intel NUC boxes. I prepared the patch that Ricky submitted but he
>> didn't include my Signed-off-by or the Fixes tag. I think the
>> following needs to be added to the changelog.
>>
>> Fixes: bede03a579b3 ("misc: rtsx: Enable OCP for rts522a rts524a rts525a 
>> rts5260")
>> Link: https://marc.info/?l=linux-kernel=159105912832257
> 
> Better lore link:
> 
>   Link: 
> https://lore.kernel.org/r/CACYmiSer8FA+qjh8NHZJ2maxSd-=rwddz2f7_-e4um1nxuz...@mail.gmail.com/
> 
> But I'm not sure the above is the most relevant.  Seems like the one
> below is more to the point since it has the exact patch below and is
> part of a thread developing it:
> 
>   Link: 
> https://lore.kernel.org/r/20da8b4b-8426-9568-c0f1-4d1c2006c...@googlemail.com/
> 

Yes, I meant to change the quote to that thread but ... more haste less speed.

>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=204003
>> Signed-off-by: Chris Clayton 
>>
>> bede03a579b3 introduced a bug which leaves the rts5229 PCI Express
>> card reader on the Intel NUC6CAYH box.
>>
>> My main point, however, is that the patch is also needed in the 5.4
>> (longterm) and 5.8 (stable) series kernels.
> 
> This would be accomplished by:
> 
> Cc: sta...@vger.kernel.org# v5.1+
> 

Thanks for the tip.

I'm about to set off on a 4-day journey, so I'll send an updated patch when I 
return on Friday,


>>> Signed-off-by: Ricky Wu 
>>> ---
>>>  drivers/misc/cardreader/rtsx_pcr.c | 4 
>>>  1 file changed, 4 deletions(-)
>>>
>>> diff --git a/drivers/misc/cardreader/rtsx_pcr.c 
>>> b/drivers/misc/cardreader/rtsx_pcr.c
>>> index 37ccc67f4914..3a4a7b0cc098 100644
>>> --- a/drivers/misc/cardreader/rtsx_pcr.c
>>> +++ b/drivers/misc/cardreader/rtsx_pcr.c
>>> @@ -1155,10 +1155,6 @@ void rtsx_pci_init_ocp(struct rtsx_pcr *pcr)
>>> rtsx_pci_write_register(pcr, REG_OCPGLITCH,
>>> SD_OCP_GLITCH_MASK, pcr->hw_param.ocp_glitch);
>>> rtsx_pci_enable_ocp(pcr);
>>> -   } else {
>>> -   /* OC power down */
>>> -   rtsx_pci_write_register(pcr, FPDCTL, OC_POWER_DOWN,
>>> -   OC_POWER_DOWN);
>>> }
>>> }
>>>  }
>>>


Re: [PATCH] misc: rtsx: do not setting OC_POWER_DOWN reg in rtsx_pci_init_ocp()

2020-09-13 Thread Chris Clayton
Hi Greg and Arnd,

On 24/08/2020 04:00, ricky...@realtek.com wrote:
> From: Ricky Wu 
> 
> this power saving action in rtsx_pci_init_ocp() cause INTEL-NUC6 platform
> missing card reader
> 

In his changelog above, Ricky didn't mention that this patch fixes a regression 
that was introduced (in 5.1) by commit
bede03a579b3.

The patch that I posted to LKML contained the appropriate Fixes, etc tags. 
After discussion, the patch was changed to
remove the code that effectively disables the RTS5229 cardreader on (at least 
some) Intel NUC boxes. I prepared the
patch that Ricky submitted but he didn't include my Signed-off-by or the Fixes 
tag. I think the following needs to be
added to the changelog.

Fixes: bede03a579b3 ("misc: rtsx: Enable OCP for rts522a rts524a rts525a 
rts5260")
Link: https://marc.info/?l=linux-kernel=159105912832257
Link: https://bugzilla.kernel.org/show_bug.cgi?id=204003
Signed-off-by: Chris Clayton 

bede03a579b3 introduced a bug which leaves the rts5229 PCI Express card reader 
on the Intel NUC6CAYH box.

My main point, however, is that the patch is also needed in the 5.4 (longterm) 
and 5.8 (stable) series kernels.

Thanks.


> Signed-off-by: Ricky Wu 
> ---
>  drivers/misc/cardreader/rtsx_pcr.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/drivers/misc/cardreader/rtsx_pcr.c 
> b/drivers/misc/cardreader/rtsx_pcr.c
> index 37ccc67f4914..3a4a7b0cc098 100644
> --- a/drivers/misc/cardreader/rtsx_pcr.c
> +++ b/drivers/misc/cardreader/rtsx_pcr.c
> @@ -1155,10 +1155,6 @@ void rtsx_pci_init_ocp(struct rtsx_pcr *pcr)
>   rtsx_pci_write_register(pcr, REG_OCPGLITCH,
>   SD_OCP_GLITCH_MASK, pcr->hw_param.ocp_glitch);
>   rtsx_pci_enable_ocp(pcr);
> - } else {
> - /* OC power down */
> - rtsx_pci_write_register(pcr, FPDCTL, OC_POWER_DOWN,
> - OC_POWER_DOWN);
>   }
>   }
>  }
> 


Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader on Intel NUC boxes

2020-08-22 Thread Chris Clayton



Hi Ricky.

On 05/08/2020 13:48, Chris Clayton wrote:
> Hi Ricky

>>
>> Ah, OK. I'll prepare the patch and send it to you once I've tested it.
>>
> 
> Please see the patch included below. As you suggested, it removes the code 
> that does the OC power down on card readers
> that are not members of your A series. The patch is against a fresh pull of 
> Linus's tree this morning
> (v5.8-2768-g4da9f3302615).
> 
> I've tested the resultant kernel on my Intel NUC6CAYH box (which contains an 
> NUC66AYB board) and the rts5229 works fine.
> I've also tested it on my laptop which also has a card reader supported by 
> the rtsx_pci driver (an RTL8411B) and that
> also works fine.
> 
> As I mentioned yesterday, I think it's a candidate for the 5.4 ans 5.7 stable 
> series.
> 
> Thanks for your patience!
> 
> Chris
> 
> Signed-off-by: Chris Clayton 
> 
> --- a/drivers/misc/cardreader/rtsx_pcr.c2020-08-05 07:10:21.752072515 
> +0100
> +++ b/drivers/misc/cardreader/rtsx_pcr.c2020-08-05 07:11:05.568074278 
> +0100
> @@ -1172,10 +1172,6 @@ void rtsx_pci_init_ocp(struct rtsx_pcr *
> rtsx_pci_write_register(pcr, REG_OCPGLITCH,
> SD_OCP_GLITCH_MASK, pcr->hw_param.ocp_glitch);
> rtsx_pci_enable_ocp(pcr);
> -   } else {
> -   /* OC power down */
> -   rtsx_pci_write_register(pcr, FPDCTL, OC_POWER_DOWN,
> -   OC_POWER_DOWN);
> }
> }
>  }
> 
> 

Is there some problem with my patch? If you are too busy to deal with it, 
perhaps I can submit directly it to Greg/Arnd.

Thanks

Chris


Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader on Intel NUC boxes

2020-08-05 Thread Chris Clayton
Hi Ricky

On 05/08/2020 06:51, Chris Clayton wrote:
> Thanks, Ricky.
> 
> On 05/08/2020 03:35, 吳昊澄 Ricky wrote:
>>
>>
>>> -Original Message-
>>> From: Chris Clayton [mailto:chris2...@googlemail.com]
>>> Sent: Tuesday, August 04, 2020 7:52 PM
>>> To: 吳昊澄 Ricky; gre...@linuxfoundation.org
>>> Cc: LKML; rdun...@infradead.org; philqua...@gmail.com; Arnd Bergmann
>>> Subject: Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader 
>>> on
>>> Intel NUC boxes
>>>
>>>
>>>
>>> On 04/08/2020 11:46, 吳昊澄 Ricky wrote:
>>>>> -Original Message-
>>>>> From: Chris Clayton [mailto:chris2...@googlemail.com]
>>>>> Sent: Tuesday, August 04, 2020 4:51 PM
>>>>> To: 吳昊澄 Ricky; gre...@linuxfoundation.org
>>>>> Cc: LKML; rdun...@infradead.org; philqua...@gmail.com; Arnd Bergmann
>>>>> Subject: Re: PATCH: rtsx_pci driver - don't disable the rts5229 card 
>>>>> reader on
>>>>> Intel NUC boxes
>>>>>
>>>>>
>>>>>
>>>>> On 04/08/2020 09:08, 吳昊澄 Ricky wrote:
>>>>>>> -Original Message-
>>>>>>> From: gre...@linuxfoundation.org [mailto:gre...@linuxfoundation.org]
>>>>>>> Sent: Tuesday, August 04, 2020 3:49 PM
>>>>>>> To: 吳昊澄 Ricky
>>>>>>> Cc: Chris Clayton; LKML; rdun...@infradead.org; philqua...@gmail.com;
>>>>> Arnd
>>>>>>> Bergmann
>>>>>>> Subject: Re: PATCH: rtsx_pci driver - don't disable the rts5229 card 
>>>>>>> reader
>>> on
>>>>>>> Intel NUC boxes
>>>>>>>
>>>>>>> On Tue, Aug 04, 2020 at 02:44:41AM +, 吳昊澄 Ricky wrote:
>>>>>>>> Hi Chris,
>>>>>>>>
>>>>>>>> rtsx_pci_write_register(pcr, FPDTL, OC_POWER_DOWN,
>>>>> OC_POWER_DOWN);
>>>>>>>> This register operation saved power under 1mA, so if do not care the 
>>>>>>>> 1mA
>>>>>>> power we can have a patch to remove it, make compatible with NUC6
>>>>>>>> We tested others our card reader that remove it, we did not see any 
>>>>>>>> side
>>>>> effect
>>>>>>>>
>>>>>>>> Hi Greg k-h,
>>>>>>>>
>>>>>>>> Do you have any comments?
>>>>>>>
>>>>>>> comments on what?  I don't know what you are responding to here, sorry.
>>>>>>>
>>>>>> Can we have a patch to kernel for NUC6? It may cause more power(1mA) but
>>> it
>>>>> will have more compatibility
>>>>>>
>>>>>
>>>>> Ricky,
>>>>>
>>>>> I don't understand why you want to completely remove the code that sets up
>>> the
>>>>> 1mA power saving. That code was there
>>>>> even before your patch (bede03a579b3b4a036003c4862cc1baa4ddc351f), so I
>>>>> assume it benefits some of the Realtek card
>>>>> readers. Before your patch however, rtsx_pci_init_ocp() was not called 
>>>>> for the
>>>>> rts5229 reader, but the patch introduced
>>>>> an unconditional call to that function into rtsx_pci_init_hw(), which is 
>>>>> run for
>>> the
>>>>> rts5229. That is what now disables
>>>>> the card reader.
>>>>>
>>>>> Now, I don't know whether other cards are affected, although I don't 
>>>>> recall
>>>>> seeing any reported as I searched the kernel
>>>>> and ubuntu bugzillas for any analysis of the problem. I know this is not 
>>>>> what
>>> the
>>>>> patch I sent does, but having thought
>>>>> about it more, seems to me that the simplest fix is to skip the new call 
>>>>> to
>>>>> rtsx_pci_init_ocp() if the reader is an rts5229.
>>>>>
>>>>
>>>> Because we are thinking about if others our card reader that not belong A
>>> series(my ocp patch coverage) also on NUC6 platform maybe have the same
>>> problem...
>>>>
>>>
>>> OK. What if we do make the new call but only for the card readers that are 
>>> in the
>>> A series? Are they the ones that have
>>> PID_5n

Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader on Intel NUC boxes

2020-08-04 Thread Chris Clayton
Thanks, Ricky.

On 05/08/2020 03:35, 吳昊澄 Ricky wrote:
> 
> 
>> -Original Message-----
>> From: Chris Clayton [mailto:chris2...@googlemail.com]
>> Sent: Tuesday, August 04, 2020 7:52 PM
>> To: 吳昊澄 Ricky; gre...@linuxfoundation.org
>> Cc: LKML; rdun...@infradead.org; philqua...@gmail.com; Arnd Bergmann
>> Subject: Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader 
>> on
>> Intel NUC boxes
>>
>>
>>
>> On 04/08/2020 11:46, 吳昊澄 Ricky wrote:
>>>> -Original Message-
>>>> From: Chris Clayton [mailto:chris2...@googlemail.com]
>>>> Sent: Tuesday, August 04, 2020 4:51 PM
>>>> To: 吳昊澄 Ricky; gre...@linuxfoundation.org
>>>> Cc: LKML; rdun...@infradead.org; philqua...@gmail.com; Arnd Bergmann
>>>> Subject: Re: PATCH: rtsx_pci driver - don't disable the rts5229 card 
>>>> reader on
>>>> Intel NUC boxes
>>>>
>>>>
>>>>
>>>> On 04/08/2020 09:08, 吳昊澄 Ricky wrote:
>>>>>> -Original Message-
>>>>>> From: gre...@linuxfoundation.org [mailto:gre...@linuxfoundation.org]
>>>>>> Sent: Tuesday, August 04, 2020 3:49 PM
>>>>>> To: 吳昊澄 Ricky
>>>>>> Cc: Chris Clayton; LKML; rdun...@infradead.org; philqua...@gmail.com;
>>>> Arnd
>>>>>> Bergmann
>>>>>> Subject: Re: PATCH: rtsx_pci driver - don't disable the rts5229 card 
>>>>>> reader
>> on
>>>>>> Intel NUC boxes
>>>>>>
>>>>>> On Tue, Aug 04, 2020 at 02:44:41AM +, 吳昊澄 Ricky wrote:
>>>>>>> Hi Chris,
>>>>>>>
>>>>>>> rtsx_pci_write_register(pcr, FPDTL, OC_POWER_DOWN,
>>>> OC_POWER_DOWN);
>>>>>>> This register operation saved power under 1mA, so if do not care the 1mA
>>>>>> power we can have a patch to remove it, make compatible with NUC6
>>>>>>> We tested others our card reader that remove it, we did not see any side
>>>> effect
>>>>>>>
>>>>>>> Hi Greg k-h,
>>>>>>>
>>>>>>> Do you have any comments?
>>>>>>
>>>>>> comments on what?  I don't know what you are responding to here, sorry.
>>>>>>
>>>>> Can we have a patch to kernel for NUC6? It may cause more power(1mA) but
>> it
>>>> will have more compatibility
>>>>>
>>>>
>>>> Ricky,
>>>>
>>>> I don't understand why you want to completely remove the code that sets up
>> the
>>>> 1mA power saving. That code was there
>>>> even before your patch (bede03a579b3b4a036003c4862cc1baa4ddc351f), so I
>>>> assume it benefits some of the Realtek card
>>>> readers. Before your patch however, rtsx_pci_init_ocp() was not called for 
>>>> the
>>>> rts5229 reader, but the patch introduced
>>>> an unconditional call to that function into rtsx_pci_init_hw(), which is 
>>>> run for
>> the
>>>> rts5229. That is what now disables
>>>> the card reader.
>>>>
>>>> Now, I don't know whether other cards are affected, although I don't recall
>>>> seeing any reported as I searched the kernel
>>>> and ubuntu bugzillas for any analysis of the problem. I know this is not 
>>>> what
>> the
>>>> patch I sent does, but having thought
>>>> about it more, seems to me that the simplest fix is to skip the new call to
>>>> rtsx_pci_init_ocp() if the reader is an rts5229.
>>>>
>>>
>>> Because we are thinking about if others our card reader that not belong A
>> series(my ocp patch coverage) also on NUC6 platform maybe have the same
>> problem...
>>>
>>
>> OK. What if we do make the new call but only for the card readers that are 
>> in the
>> A series? Are they the ones that have
>> PID_5nnn defines in include/linux/rtsx_pci.h? Or is there another simple way 
>> of
>> identifying that a reader is a member of
>> the A series?
>>
>> I'm thinking of something like:
>> static bool rtsx_pci_is_series_A(pcr)
>> {
>>  unsigned short device = pcr->pci->device;
>>
>>  return device == PID524A || device == PID_5249 || device == PID_5250 ||
>> device == PID_525A
>>  || device == PID_525A || device == PID_5260 || device ==
>> PID_5261;
>> }
>>
>> then in rtsx_pci_init_hw() change the unconditional call to:
>>
>>  if rtsx_pci_is_series_A(pcr)
>>  rtsx_pci_init_ocp();
>>
>> Does that seem OK?
>>
> Previously, I want to remove
> else {
>   /* OC power down */
>   rtsx_pci_write_register(pcr, FPDCTL, OC_POWER_DOWN,
>   OC_POWER_DOWN);
> }
> Because in our A-series card Reader we already assigned option->ocp_en to 1 
> in self init_params() , this is an easy way to fix this problem
> 

Ah, OK. I'll prepare the patch and send it to you once I've tested it.

Chris


Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader on Intel NUC boxes

2020-08-04 Thread Chris Clayton



On 04/08/2020 11:46, 吳昊澄 Ricky wrote:
>> -Original Message-
>> From: Chris Clayton [mailto:chris2...@googlemail.com]
>> Sent: Tuesday, August 04, 2020 4:51 PM
>> To: 吳昊澄 Ricky; gre...@linuxfoundation.org
>> Cc: LKML; rdun...@infradead.org; philqua...@gmail.com; Arnd Bergmann
>> Subject: Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader 
>> on
>> Intel NUC boxes
>>
>>
>>
>> On 04/08/2020 09:08, 吳昊澄 Ricky wrote:
>>>> -Original Message-
>>>> From: gre...@linuxfoundation.org [mailto:gre...@linuxfoundation.org]
>>>> Sent: Tuesday, August 04, 2020 3:49 PM
>>>> To: 吳昊澄 Ricky
>>>> Cc: Chris Clayton; LKML; rdun...@infradead.org; philqua...@gmail.com;
>> Arnd
>>>> Bergmann
>>>> Subject: Re: PATCH: rtsx_pci driver - don't disable the rts5229 card 
>>>> reader on
>>>> Intel NUC boxes
>>>>
>>>> On Tue, Aug 04, 2020 at 02:44:41AM +, 吳昊澄 Ricky wrote:
>>>>> Hi Chris,
>>>>>
>>>>> rtsx_pci_write_register(pcr, FPDTL, OC_POWER_DOWN,
>> OC_POWER_DOWN);
>>>>> This register operation saved power under 1mA, so if do not care the 1mA
>>>> power we can have a patch to remove it, make compatible with NUC6
>>>>> We tested others our card reader that remove it, we did not see any side
>> effect
>>>>>
>>>>> Hi Greg k-h,
>>>>>
>>>>> Do you have any comments?
>>>>
>>>> comments on what?  I don't know what you are responding to here, sorry.
>>>>
>>> Can we have a patch to kernel for NUC6? It may cause more power(1mA) but it
>> will have more compatibility
>>>
>>
>> Ricky,
>>
>> I don't understand why you want to completely remove the code that sets up 
>> the
>> 1mA power saving. That code was there
>> even before your patch (bede03a579b3b4a036003c4862cc1baa4ddc351f), so I
>> assume it benefits some of the Realtek card
>> readers. Before your patch however, rtsx_pci_init_ocp() was not called for 
>> the
>> rts5229 reader, but the patch introduced
>> an unconditional call to that function into rtsx_pci_init_hw(), which is run 
>> for the
>> rts5229. That is what now disables
>> the card reader.
>>
>> Now, I don't know whether other cards are affected, although I don't recall
>> seeing any reported as I searched the kernel
>> and ubuntu bugzillas for any analysis of the problem. I know this is not 
>> what the
>> patch I sent does, but having thought
>> about it more, seems to me that the simplest fix is to skip the new call to
>> rtsx_pci_init_ocp() if the reader is an rts5229.
>>
> 
> Because we are thinking about if others our card reader that not belong A 
> series(my ocp patch coverage) also on NUC6 platform maybe have the same 
> problem... 
>  

OK. What if we do make the new call but only for the card readers that are in 
the A series? Are they the ones that have
PID_5nnn defines in include/linux/rtsx_pci.h? Or is there another simple way of 
identifying that a reader is a member of
the A series?

I'm thinking of something like:
static bool rtsx_pci_is_series_A(pcr)
{
unsigned short device = pcr->pci->device;

return device == PID524A || device == PID_5249 || device == PID_5250 || 
device == PID_525A
|| device == PID_525A || device == PID_5260 || device 
== PID_5261;
}

then in rtsx_pci_init_hw() change the unconditional call to:

if rtsx_pci_is_series_A(pcr)
rtsx_pci_init_ocp();

Does that seem OK?

>> If you agree, I can prepare a patch and send it to you. Whatever the 
>> solution is, it
>> will also be needed in the stable
>> kernels later than 5.0.
>>
> 
> OK, I agree your opinion, for now can only patch rts5229 first make NUC6 user 
> can work well
> 
> Thank you 
> 
>> Chris
>>>> greg k-h
>>>>
>>>> --Please consider the environment before printing this e-mail.


Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader on Intel NUC boxes

2020-08-04 Thread Chris Clayton



On 04/08/2020 09:08, 吳昊澄 Ricky wrote:
>> -Original Message-
>> From: gre...@linuxfoundation.org [mailto:gre...@linuxfoundation.org]
>> Sent: Tuesday, August 04, 2020 3:49 PM
>> To: 吳昊澄 Ricky
>> Cc: Chris Clayton; LKML; rdun...@infradead.org; philqua...@gmail.com; Arnd
>> Bergmann
>> Subject: Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader 
>> on
>> Intel NUC boxes
>>
>> On Tue, Aug 04, 2020 at 02:44:41AM +, 吳昊澄 Ricky wrote:
>>> Hi Chris,
>>>
>>> rtsx_pci_write_register(pcr, FPDTL, OC_POWER_DOWN, OC_POWER_DOWN);
>>> This register operation saved power under 1mA, so if do not care the 1mA
>> power we can have a patch to remove it, make compatible with NUC6
>>> We tested others our card reader that remove it, we did not see any side 
>>> effect
>>>
>>> Hi Greg k-h,
>>>
>>> Do you have any comments?
>>
>> comments on what?  I don't know what you are responding to here, sorry.
>>
> Can we have a patch to kernel for NUC6? It may cause more power(1mA) but it 
> will have more compatibility
> 

Ricky,

I don't understand why you want to completely remove the code that sets up the 
1mA power saving. That code was there
even before your patch (bede03a579b3b4a036003c4862cc1baa4ddc351f), so I assume 
it benefits some of the Realtek card
readers. Before your patch however, rtsx_pci_init_ocp() was not called for the 
rts5229 reader, but the patch introduced
an unconditional call to that function into rtsx_pci_init_hw(), which is run 
for the rts5229. That is what now disables
the card reader.

Now, I don't know whether other cards are affected, although I don't recall 
seeing any reported as I searched the kernel
and ubuntu bugzillas for any analysis of the problem. I know this is not what 
the patch I sent does, but having thought
about it more, seems to me that the simplest fix is to skip the new call to 
rtsx_pci_init_ocp() if the reader is an rts5229.

If you agree, I can prepare a patch and send it to you. Whatever the solution 
is, it will also be needed in the stable
kernels later than 5.0.

Chris
>> greg k-h
>>
>> --Please consider the environment before printing this e-mail.


Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader on Intel NUC boxes

2020-08-02 Thread Chris Clayton
Hi, Ricky

On 03/08/2020 04:01, 吳昊澄 Ricky wrote:
> Hi Chris,
> 
> We don’t think this is our bug...
> This register(FPDCTL) write to OC_POWER_DOWN is for our power saving feature, 
> not to disable the reader
> In your case, we cannot reproduce this on our side that we mention before, we 
> don’t have the platform(Intel NUC Tall Arches Canyon NUC6CAYH Celeron J345) 
> to see this issue
> But we think this issue maybe only on this platform, our RTS5229 works well 
> on the new kernel all platform that we have  
> 
> Ricky

Perhaps I should have used the word regression rather than bug. I didn't buy 
the machine until earlier this year, but
other people who have reported this problem have indicated that until 
bede03a579b3 was applied (during the 5.1 merge
window), the driver supported the card reader on this on the Intel NUC boxes. I 
know that is true because I built and
tested a 5.0 kernel and the card reader worked fine. I've also built and tested 
an 5.1-rc1 kernel and the card reader no
longer works. Whether by design or by accident, the card reader worked and now 
it doesn't. That's a regression in my book!

Since you signed off the patch that caused the regression, I believe it is your 
bug.

Thanks.

Chris
> 
>> -Original Message-
>> From: Chris Clayton [mailto:chris2...@googlemail.com]
>> Sent: Monday, August 03, 2020 3:59 AM
>> To: LKML; 吳昊澄 Ricky; gre...@linuxfoundation.org; rdun...@infradead.org;
>> philqua...@gmail.com; Arnd Bergmann
>> Subject: Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader 
>> on
>> Intel NUC boxes
>>
>> Sorry, I should have said that the patch is against 5.7.12. It applies to 
>> upstream,
>> but with offsets.
>>
>> On 02/08/2020 20:48, Chris Clayton wrote:
>>> bede03a579b3 introduced a bug which leaves the rts5229 PCI Express card
>> reader on my Intel NUC6CAYH box.
>>>
>>> The bug is in drivers/misc/cardreader/rtsx_pcr.c. A call to 
>>> rtsx_pci_init_ocp()
>> was added to rtsx_pci_init_hw().
>>> At the call point, pcr->ops->init_ocp is NULL and pcr->option.ocp_en is 0, 
>>> so in
>> rtsx_pci_init_ocp() the cardreader
>>> gets disabled.
>>>
>>> I've avoided this by making excution code that results in the reader being
>> disabled conditional on the device
>>> not being an RTS5229. Of course, other rtsxxx card readers may also be
>> disabled by this bug. I don't have the
>>> knowledge to address that, so I'll leave to the driver maintainers.
>>>
>>> The patch to avoid the bug is attached.
>>>
>>> Fixes: bede03a579b3 ("misc: rtsx: Enable OCP for rts522a rts524a rts525a
>> rts5260")
>>> Link: https://marc.info/?l=linux-kernel=159105912832257
>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=204003
>>> Signed-off-by: Chris Clayton 
>>>
>>> bede03a579b3 introduced a bug which leaves the rts5229 PCI Express card
>> reader on my Intel NUC6CAYH box.
>>>
>>> The bug is in drivers/misc/cardreader/rtsx_pcr.c. A call to 
>>> rtsx_pci_init_ocp()
>> was added to rtsx_pci_init_hw().
>>> At the call point, pcr->ops->init_ocp is NULL and pcr->option.ocp_en is 0, 
>>> so in
>> rtsx_pci_init_ocp() the cardreader
>>> gets disabled.
>>>
>>> I've avoided this by making excution code that results in the reader being
>> disabled conditional on the device
>>> not being an RTS5229. Of course, other rtsxxx card readers may also be
>> disabled by this bug. I don't have the
>>> knowledge to address that, so I'll leave to the driver maintainers.
>>>
>>> The patch to avoid the bug is attached.
>>>
>>> Chris
>>>
>>
>> --Please consider the environment before printing this e-mail.


Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader on Intel NUC boxes

2020-08-02 Thread Chris Clayton
Sorry, I should have said that the patch is against 5.7.12. It applies to 
upstream, but with offsets.

On 02/08/2020 20:48, Chris Clayton wrote:
> bede03a579b3 introduced a bug which leaves the rts5229 PCI Express card 
> reader on my Intel NUC6CAYH box.
> 
> The bug is in drivers/misc/cardreader/rtsx_pcr.c. A call to 
> rtsx_pci_init_ocp() was added to rtsx_pci_init_hw().
> At the call point, pcr->ops->init_ocp is NULL and pcr->option.ocp_en is 0, so 
> in rtsx_pci_init_ocp() the cardreader
> gets disabled.
> 
> I've avoided this by making excution code that results in the reader being 
> disabled conditional on the device
> not being an RTS5229. Of course, other rtsxxx card readers may also be 
> disabled by this bug. I don't have the
> knowledge to address that, so I'll leave to the driver maintainers.
> 
> The patch to avoid the bug is attached.
> 
> Fixes: bede03a579b3 ("misc: rtsx: Enable OCP for rts522a rts524a rts525a 
> rts5260")
> Link: https://marc.info/?l=linux-kernel=159105912832257
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=204003
> Signed-off-by: Chris Clayton 
> 
> bede03a579b3 introduced a bug which leaves the rts5229 PCI Express card 
> reader on my Intel NUC6CAYH box.
> 
> The bug is in drivers/misc/cardreader/rtsx_pcr.c. A call to 
> rtsx_pci_init_ocp() was added to rtsx_pci_init_hw().
> At the call point, pcr->ops->init_ocp is NULL and pcr->option.ocp_en is 0, so 
> in rtsx_pci_init_ocp() the cardreader
> gets disabled.
> 
> I've avoided this by making excution code that results in the reader being 
> disabled conditional on the device
> not being an RTS5229. Of course, other rtsxxx card readers may also be 
> disabled by this bug. I don't have the
> knowledge to address that, so I'll leave to the driver maintainers.
> 
> The patch to avoid the bug is attached.
> 
> Chris
> 


PATCH: rtsx_pci driver - don't disable the rts5229 card reader on Intel NUC boxes

2020-08-02 Thread Chris Clayton
bede03a579b3 introduced a bug which leaves the rts5229 PCI Express card reader 
on my Intel NUC6CAYH box.

The bug is in drivers/misc/cardreader/rtsx_pcr.c. A call to rtsx_pci_init_ocp() 
was added to rtsx_pci_init_hw().
At the call point, pcr->ops->init_ocp is NULL and pcr->option.ocp_en is 0, so 
in rtsx_pci_init_ocp() the cardreader
gets disabled.

I've avoided this by making excution code that results in the reader being 
disabled conditional on the device
not being an RTS5229. Of course, other rtsxxx card readers may also be disabled 
by this bug. I don't have the
knowledge to address that, so I'll leave to the driver maintainers.

The patch to avoid the bug is attached.

Fixes: bede03a579b3 ("misc: rtsx: Enable OCP for rts522a rts524a rts525a 
rts5260")
Link: https://marc.info/?l=linux-kernel=159105912832257
Link: https://bugzilla.kernel.org/show_bug.cgi?id=204003
Signed-off-by: Chris Clayton 

bede03a579b3 introduced a bug which leaves the rts5229 PCI Express card reader 
on my Intel NUC6CAYH box.

The bug is in drivers/misc/cardreader/rtsx_pcr.c. A call to rtsx_pci_init_ocp() 
was added to rtsx_pci_init_hw().
At the call point, pcr->ops->init_ocp is NULL and pcr->option.ocp_en is 0, so 
in rtsx_pci_init_ocp() the cardreader
gets disabled.

I've avoided this by making excution code that results in the reader being 
disabled conditional on the device
not being an RTS5229. Of course, other rtsxxx card readers may also be disabled 
by this bug. I don't have the
knowledge to address that, so I'll leave to the driver maintainers.

The patch to avoid the bug is attached.

Chris
--- linux-5.7.12/drivers/misc/cardreader/rtsx_pcr.c.orig	2020-08-02 13:36:50.216947944 +0100
+++ linux-5.7.12/drivers/misc/cardreader/rtsx_pcr.c	2020-08-02 18:37:30.456610731 +0100
@@ -1200,9 +1200,13 @@ void rtsx_pci_init_ocp(struct rtsx_pcr *
 SD_OCP_GLITCH_MASK, pcr->hw_param.ocp_glitch);
 			rtsx_pci_enable_ocp(pcr);
 		} else {
-			/* OC power down */
-			rtsx_pci_write_register(pcr, FPDCTL, OC_POWER_DOWN,
-OC_POWER_DOWN);
+			/* On (some?) Intel NUC platforms, this disables
+			 * the rts5229 cardreader, so don't do it
+			 */
+			if(!CHK_PCI_PID(pcr, 0x5229))
+/* OC power down */
+rtsx_pci_write_register(pcr, FPDCTL, OC_POWER_DOWN,
+	OC_POWER_DOWN);
 		}
 	}
 }


Re: Linux 5.3.6

2019-10-13 Thread Chris Clayton



On 12/10/2019 21:55, Gabriel C wrote:
> Am Sa., 12. Okt. 2019 um 21:16 Uhr schrieb Chris Clayton
> :
>>
>>
>>> I'm announcing the release of the 5.3.6 kernel.
>>
>>
>> 5.3.6 build fails here with:
>>
>> arch/x86/entry/vdso/vdso64.so.dbg: undefined symbols found
>>   CC  arch/x86/kernel/cpu/mce/threshold.o
>> make[3]: *** [arch/x86/entry/vdso/Makefile:59: 
>> arch/x86/entry/vdso/vdso64.so.dbg] Error 1
>> make[3]: *** Deleting file 'arch/x86/entry/vdso/vdso64.so.dbg'
>> make[2]: *** [scripts/Makefile.build:497: arch/x86/entry/vdso] Error 2
>> make[1]: *** [scripts/Makefile.build:497: arch/x86/entry] Error 2
>> make[1]: *** Waiting for unfinished jobs
>>
> 
> What is your default linker ?
> 
> Also does make LD=ld.bfd fixes that for you ?
> 

Thanks, Gabriel. The default linker is gold, but your suggestion above fixed 
the build. I think I'll set the default to
LD.bfd.

> See https://bugzilla.kernel.org/show_bug.cgi?id=204951
> 
> BR,
> 
> Gabriel C.
> 


Re: Linux 5.3.6

2019-10-12 Thread Chris Clayton


> I'm announcing the release of the 5.3.6 kernel.


5.3.6 build fails here with:

arch/x86/entry/vdso/vdso64.so.dbg: undefined symbols found
  CC  arch/x86/kernel/cpu/mce/threshold.o
make[3]: *** [arch/x86/entry/vdso/Makefile:59: 
arch/x86/entry/vdso/vdso64.so.dbg] Error 1
make[3]: *** Deleting file 'arch/x86/entry/vdso/vdso64.so.dbg'
make[2]: *** [scripts/Makefile.build:497: arch/x86/entry/vdso] Error 2
make[1]: *** [scripts/Makefile.build:497: arch/x86/entry] Error 2
make[1]: *** Waiting for unfinished jobs....

Chris Clayton


Re: [PATCH] timekeeping/vsyscall: Prevent math overflow in BOOTTIME update

2019-08-22 Thread Chris Clayton
Thanks Thomas.

On 22/08/2019 12:00, Thomas Gleixner wrote:
> The VDSO update for CLOCK_BOOTTIME has a overflow issue as it shifts the
> nanoseconds based boot time offset left by the clocksource shift. That
> overflows once the boot time offset becomes large enough. As a consequence
> CLOCK_BOOTTIME in the VDSO becomes a random number causing applications to
> misbehave.
> 
> Fix it by storing a timespec64 representation of the offset when boot time
> is adjusted and add that to the MONOTONIC base time value in the vdso data
> page. Using the timespec64 representation avoids a 64bit division in the
> update code.
> 

I've tested resume from both suspend and hibernate and this patch fixes the 
problem I reported.

Tested-by: Chris Clayton 

> Fixes: 44f57d788e7d ("timekeeping: Provide a generic update_vsyscall() 
> implementation")
> Reported-by: Chris Clayton 
> Signed-off-by: Thomas Gleixner 
> ---
>  include/linux/timekeeper_internal.h |5 +
>  kernel/time/timekeeping.c   |5 +
>  kernel/time/vsyscall.c  |   22 +-
>  3 files changed, 23 insertions(+), 9 deletions(-)
> 
> --- a/include/linux/timekeeper_internal.h
> +++ b/include/linux/timekeeper_internal.h
> @@ -57,6 +57,7 @@ struct tk_read_base {
>   * @cs_was_changed_seq:  The sequence number of clocksource change events
>   * @next_leap_ktime: CLOCK_MONOTONIC time value of a pending leap-second
>   * @raw_sec: CLOCK_MONOTONIC_RAW  time in seconds
> + * @monotonic_to_boot:   CLOCK_MONOTONIC to CLOCK_BOOTTIME offset
>   * @cycle_interval:  Number of clock cycles in one NTP interval
>   * @xtime_interval:  Number of clock shifted nano seconds in one NTP
>   *   interval.
> @@ -84,6 +85,9 @@ struct tk_read_base {
>   *
>   * wall_to_monotonic is no longer the boot time, getboottime must be
>   * used instead.
> + *
> + * @monotonic_to_boottime is a timespec64 representation of @offs_boot to
> + * accelerate the VDSO update for CLOCK_BOOTTIME.
>   */
>  struct timekeeper {
>   struct tk_read_base tkr_mono;
> @@ -99,6 +103,7 @@ struct timekeeper {
>   u8  cs_was_changed_seq;
>   ktime_t next_leap_ktime;
>   u64 raw_sec;
> + struct timespec64   monotonic_to_boot;
>  
>   /* The following members are for timekeeping internal use */
>   u64 cycle_interval;
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -146,6 +146,11 @@ static void tk_set_wall_to_mono(struct t
>  static inline void tk_update_sleep_time(struct timekeeper *tk, ktime_t delta)
>  {
>   tk->offs_boot = ktime_add(tk->offs_boot, delta);
> + /*
> +  * Timespec representation for VDSO update to avoid 64bit division
> +  * on every update.
> +  */
> + tk->monotonic_to_boot = ktime_to_timespec64(tk->offs_boot);
>  }
>  
>  /*
> --- a/kernel/time/vsyscall.c
> +++ b/kernel/time/vsyscall.c
> @@ -17,7 +17,7 @@ static inline void update_vdso_data(stru
>   struct timekeeper *tk)
>  {
>   struct vdso_timestamp *vdso_ts;
> - u64 nsec;
> + u64 nsec, sec;
>  
>   vdata[CS_HRES_COARSE].cycle_last= tk->tkr_mono.cycle_last;
>   vdata[CS_HRES_COARSE].mask  = tk->tkr_mono.mask;
> @@ -45,23 +45,27 @@ static inline void update_vdso_data(stru
>   }
>   vdso_ts->nsec   = nsec;
>  
> - /* CLOCK_MONOTONIC_RAW */
> - vdso_ts = [CS_RAW].basetime[CLOCK_MONOTONIC_RAW];
> - vdso_ts->sec= tk->raw_sec;
> - vdso_ts->nsec   = tk->tkr_raw.xtime_nsec;
> + /* Copy MONOTONIC time for BOOTTIME */
> + sec = vdso_ts->sec;
> + /* Add the boot offset */
> + sec += tk->monotonic_to_boot.tv_sec;
> + nsec+= (u64)tk->monotonic_to_boot.tv_nsec << tk->tkr_mono.shift;
>  
>   /* CLOCK_BOOTTIME */
>   vdso_ts = [CS_HRES_COARSE].basetime[CLOCK_BOOTTIME];
> - vdso_ts->sec= tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
> - nsec = tk->tkr_mono.xtime_nsec;
> - nsec += ((u64)(tk->wall_to_monotonic.tv_nsec +
> -ktime_to_ns(tk->offs_boot)) << tk->tkr_mono.shift);
> + vdso_ts->sec= sec;
> +
>   while (nsec >= (((u64)NSEC_PER_SEC) << tk->tkr_mono.shift)) {
>   nsec -= (((u64)NSEC_PER_SEC) << tk->tkr_mono.shift);
>   vdso_ts->sec++;
>   }
>   vdso_ts->nsec   = nsec;
>  
> + /* CLOCK_MONOTONIC_RAW */
> + vdso_ts = [CS_RAW].basetime[CLOCK_MONOTONIC_RAW];
> + vdso_ts->sec= tk->raw_sec;
> + vdso_ts->nsec   = tk->tkr_raw.xtime_nsec;
> +
>   /* CLOCK_TAI */
>   vdso_ts = [CS_HRES_COARSE].basetime[CLOCK_TAI];
>   vdso_ts->sec= tk->xtime_sec + (s64)tk->tai_offset;
> 


Re: PROBLEM: 5.3.0-rc* causes iwlwifi failure

2019-08-22 Thread Chris Clayton
Thanks, Stuart.

On 18/08/2019 11:55, Stuart Little wrote:
> On Sun, Aug 18, 2019 at 09:17:59AM +0100, Chris Clayton wrote:
>>
>>
>> On 17/08/2019 22:44, Stuart Little wrote:
>>> After some private coaching from Serge Belyshev on git-revert I can confirm 
>>> that reverting that commit atop the current tree resolves the issue (the 
>>> wifi card scans for and finds networks just fine, no dmesg errors reported, 
>>> etc.).
>>>
>>
>> I've reported the "Microcode SW error detected" issue too, but, wrongly, 
>> only to LKML. I'll point that thread to this
>> one. I've also been experiencing my network stopping working after suspend 
>> resume, but haven't got round to reporting
>> that yet.
>>
>> What was the git magic that you acquired to revert the patch, please?
>>

By following the advice below, I reverted 
4fd445a2c855bbcab81fbe06d110e78dbd974a5b and using the resultant kernel I
haven't seen the "Microcode SW error detected" again. I am, however, still 
experiencing the problem of my network not
working after resume from suspend. I've reported it to LKML, so it can be 
followed there should anyone need/want to.

> 
> $ git revert 
> 
> This will fail as noted, but will place in a revert mode where you can fix 
> the errors.
> 
> $ git status
> 
> will show (it did in my case, for the latest Linux tree at the time I did 
> this) a modified file
> 
> drivers/net/wireless/intel/iwlwifi/mvm/fw.c
> 
> to be committed without issue and a conflicted file
> 
> drivers/net/wireless/intel/iwlwifi/mvm/nvm.c
> 
> whose conflicts you have to first resolve.
> 
> I then opened that conflicted file in a text editor and simply removed 
> everything between the lines
> 
> <<<<<<< HEAD
> 
> and 
> 
>>>>>>>> parent of 4fd445a2c855... iwlwifi: mvm: Add log information about SAR 
>>>>>>>> status
> 
> (inclusive). This resolved the conflict, whereupon
> 
> $ git revert --continue
> 
> and
> 
> $ git commit -a
> 
> will finish the reversion. 
> 
>>> On Sat, Aug 17, 2019 at 11:59:59AM +0300, Serge Belyshev wrote:
>>>>
>>>>> I am on an Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz running Linux
>>>>> x86_64 (Slackware), with a custom-compiled 5.3.0-rc4 (.config
>>>>> attached).
>>>>>
>>>>> I am using the Intel wifi adapter on this machine:
>>>>>
>>>>> 02:00.0 Network controller: Intel Corporation Device 24fb (rev 10)
>>>>>
>>>>> with the iwlwifi driver. I am attaching the output to 'lspci -vv -s
>>>>> 02:00.0' as the file device-info.
>>>>>
>>>>> All 5.3.0-rc* versions I have tried (including rc4) cause multiple
>>>>> dmesg iwlwifi-related errors (dmesg attached). Examples:
>>>>>
>>>>> iwlwifi :02:00.0: Failed to get geographic profile info -5
>>>>> iwlwifi :02:00.0: Microcode SW error detected.  Restarting 0x8200
>>>>> iwlwifi :02:00.0: 0x0038 | BAD_COMMAND
>>>>>
>>>>
>>>> I have my logs filled with similar garbage throughout 5.3-rc*. Also
>>>> since 5.3-rcsomething not only it WARNS in dmesg about firmware failure,
>>>> but completely stops working after suspend/resume cycle.
>>>>
>>>> It looks like that:
>>>>
>>>> commit 4fd445a2c855bbcab81fbe06d110e78dbd974a5b
>>>> Author: Haim Dreyfuss 
>>>> Date:   Thu May 2 11:45:02 2019 +0300
>>>>
>>>> iwlwifi: mvm: Add log information about SAR status
>>>> 
>>>> Inform users when SAR status is changing.
>>>> 
>>>> Signed-off-by: Haim Dreyfuss 
>>>> Signed-off-by: Luca Coelho 
>>>>
>>>>
>>>> is the culprit. (manually) reverting it on top of 5.3-rc4 makes
>>>> everything work again.
>>>


Regression in 5.3-rc1 and later

2019-08-22 Thread Chris Clayton
Hi everyone,

Firstly, apologies to anyone on the long cc list that turns out not to be 
particularly interested in the following, but
you were all marked as cc'd in the commit message below.

I've found a problem that isn't present in 5.2 series or 4.19 series kernels, 
and seems to have arrived in 5.3-rc1. The
problem is that if I suspend (to ram) my laptop, on resume 14 minutes or more 
after suspending, I have no networking
functionality. If I resume the laptop after 13 minutes or less, networking 
works fine. I haven't tried to get finer
grained timings between 13 and 14 minutes, but can do if it would help.

ifconfig shows that wlan0 is still up and still has its assigned ip address 
but, for instance, a ping of any other
device on my network, fails as does pinging, say, kernel.org. I've tried 
"downing" the network with (/sbin/ifdown) and
unloading the iwlmvm module and then reloading the module and "upping" 
(/sbin/ifup) the network, but my network is still
unusable. I should add that the problem also manifests if I hibernate the 
laptop, although my testing of this has been
minimal. I can do more if required.

As I say, the problem first appears in 5.3-rc1, so I've bisected between 5.2.0 
and 5.3-rc1 and that concluded with:

[chris:~/kernel/linux]$ git bisect good
7ac8707479886c75f353bfb6a8273f423cfccb23 is the first bad commit
commit 7ac8707479886c75f353bfb6a8273f423cfccb23
Author: Vincenzo Frascino 
Date:   Fri Jun 21 10:52:49 2019 +0100

x86/vdso: Switch to generic vDSO implementation

The x86 vDSO library requires some adaptations to take advantage of the
newly introduced generic vDSO library.

Introduce the following changes:
 - Modification of vdso.c to be compliant with the common vdso datapage
 - Use of lib/vdso for gettimeofday

[ tglx: Massaged changelog and cleaned up the function signature formatting 
]

Signed-off-by: Vincenzo Frascino 
Signed-off-by: Thomas Gleixner 
Cc: linux-a...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-kselft...@vger.kernel.org
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: Ralf Baechle 
Cc: Paul Burton 
Cc: Daniel Lezcano 
Cc: Mark Salyzyn 
Cc: Peter Collingbourne 
Cc: Shuah Khan 
Cc: Dmitry Safonov <0x7f454...@gmail.com>
Cc: Rasmus Villemoes 
Cc: Huw Davies 
Cc: Shijith Thotton 
Cc: Andre Przywara 
Link: 
https://lkml.kernel.org/r/20190621095252.32307-23-vincenzo.frasc...@arm.com

 arch/x86/Kconfig |   3 +
 arch/x86/entry/vdso/Makefile |   9 ++
 arch/x86/entry/vdso/vclock_gettime.c | 245 ---
 arch/x86/entry/vdso/vdsox32.lds.S|   1 +
 arch/x86/entry/vsyscall/Makefile |   2 -
 arch/x86/entry/vsyscall/vsyscall_gtod.c  |  83 ---
 arch/x86/include/asm/pvclock.h   |   2 +-
 arch/x86/include/asm/vdso/gettimeofday.h | 191 
 arch/x86/include/asm/vdso/vsyscall.h |  44 ++
 arch/x86/include/asm/vgtod.h |  75 +-
 arch/x86/include/asm/vvar.h  |   7 +-
 arch/x86/kernel/pvclock.c|   1 +
 12 files changed, 284 insertions(+), 379 deletions(-)
 delete mode 100644 arch/x86/entry/vsyscall/vsyscall_gtod.c
 create mode 100644 arch/x86/include/asm/vdso/gettimeofday.h
 create mode 100644 arch/x86/include/asm/vdso/vsyscall.h

To confirm my bisection was correct, I did a git checkout of 
7ac8707479886c75f353bfb6a8273f423cfccb2. As expected, the
kernel exhibited the problem I've described. However, a kernel built at the 
immediately preceding (parent?) commit
(bfe801ebe84f42b4666d3f0adde90f504d56e35b) has a working network after a (>= 
14minute) suspend/resume cycle.

As the module name implies, I'm using wireless networking. The hardware is 
detected as "Intel(R) Wireless-AC 9260
160MHz, REV=0x324" by iwlwifi.

I'm more than happy to provide additional diagnostics (but may need a little 
hand-holding) and to apply diagnostic or
fix patches, but please cc me on any reply as I'm not subscribed to any of the 
kernel-related mailing lists.

Chris


Re: iwlwifi: microcode SW error detected

2019-08-20 Thread Chris Clayton



On 18/08/2019 09:21, Chris Clayton wrote:
> 
> 
> On 17/08/2019 08:19, Chris Clayton wrote:
>> Hi.
>>
>> I just found the following error in the output from dmesg.
>>
>> [ 4023.460058] iwlwifi :02:00.0: Microcode SW error detected. Restarting 
>> 0x0.
> 
> Since reporting, I've found that this problem is being explored in the thread 
> that starts at
> https://marc.info/?l=linux-kernel=15660151913.

Mmm, that's a dead link. Don't knwo what happened there but the real link is
https://marc.info/?l=linux-kernel=156265244614126

> 
> Chris
> 


linux-5.3.0-rc5: new build warning

2019-08-18 Thread Chris Clayton
Hi,

I've just built 5.3.0-rc5 and a warning that I do not recall having seen before 
was emitted:

...
  HOSTCC  scripts/extract-cert
  HOSTCC   /mnt/kernel/linux/tools/objtool/fixdep.o
  HOSTLD  arch/x86/tools/relocs
  HOSTLD   /mnt/kernel/linux/tools/objtool/fixdep-in.o
  LINK /mnt/kernel/linux/tools/objtool/fixdep
  CC   /mnt/kernel/linux/tools/objtool/builtin-check.o
  CC   /mnt/kernel/linux/tools/objtool/builtin-orc.o
  GEN  /mnt/kernel/linux/tools/objtool/arch/x86/lib/inat-tables.c
awk: arch/x86/tools/gen-insn-attr-x86.awk:260: warning: regexp escape sequence 
`\:' is not a known regexp operator
awk: arch/x86/tools/gen-insn-attr-x86.awk:350: 
(FILENAME=arch/x86/lib/x86-opcode-map.txt FNR=41) warning: regexp escape
sequence `\&' is not a known regexp operator
  CC   /mnt/kernel/linux/tools/objtool/exec-cmd.o
  CC   /mnt/kernel/linux/tools/objtool/check.o
  CC   /mnt/kernel/linux/tools/objtool/arch/x86/decode.o
  CC   /mnt/kernel/linux/tools/objtool/orc_gen.o
  CC   /mnt/kernel/linux/tools/objtool/help.o
  CC   /mnt/kernel/linux/tools/objtool/orc_dump.o
  CC   /mnt/kernel/linux/tools/objtool/pager.o
 ...


Happy to test the fix, but please cc me as I'm not subscribed

Chris


Re: iwlwifi: microcode SW error detected

2019-08-18 Thread Chris Clayton



On 17/08/2019 08:19, Chris Clayton wrote:
> Hi.
> 
> I just found the following error in the output from dmesg.
> 
> [ 4023.460058] iwlwifi :02:00.0: Microcode SW error detected. Restarting 
> 0x0.

Since reporting, I've found that this problem is being explored in the thread 
that starts at
https://marc.info/?l=linux-kernel=15660151913.

Chris


Re: PROBLEM: 5.3.0-rc* causes iwlwifi failure

2019-08-18 Thread Chris Clayton



On 17/08/2019 22:44, Stuart Little wrote:
> After some private coaching from Serge Belyshev on git-revert I can confirm 
> that reverting that commit atop the current tree resolves the issue (the wifi 
> card scans for and finds networks just fine, no dmesg errors reported, etc.).
> 

I've reported the "Microcode SW error detected" issue too, but, wrongly, only 
to LKML. I'll point that thread to this
one. I've also been experiencing my network stopping working after suspend 
resume, but haven't got round to reporting
that yet.

What was the git magic that you acquired to revert the patch, please?

> On Sat, Aug 17, 2019 at 11:59:59AM +0300, Serge Belyshev wrote:
>>
>>> I am on an Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz running Linux
>>> x86_64 (Slackware), with a custom-compiled 5.3.0-rc4 (.config
>>> attached).
>>>
>>> I am using the Intel wifi adapter on this machine:
>>>
>>> 02:00.0 Network controller: Intel Corporation Device 24fb (rev 10)
>>>
>>> with the iwlwifi driver. I am attaching the output to 'lspci -vv -s
>>> 02:00.0' as the file device-info.
>>>
>>> All 5.3.0-rc* versions I have tried (including rc4) cause multiple
>>> dmesg iwlwifi-related errors (dmesg attached). Examples:
>>>
>>> iwlwifi :02:00.0: Failed to get geographic profile info -5
>>> iwlwifi :02:00.0: Microcode SW error detected.  Restarting 0x8200
>>> iwlwifi :02:00.0: 0x0038 | BAD_COMMAND
>>>
>>
>> I have my logs filled with similar garbage throughout 5.3-rc*. Also
>> since 5.3-rcsomething not only it WARNS in dmesg about firmware failure,
>> but completely stops working after suspend/resume cycle.
>>
>> It looks like that:
>>
>> commit 4fd445a2c855bbcab81fbe06d110e78dbd974a5b
>> Author: Haim Dreyfuss 
>> Date:   Thu May 2 11:45:02 2019 +0300
>>
>> iwlwifi: mvm: Add log information about SAR status
>> 
>> Inform users when SAR status is changing.
>> 
>> Signed-off-by: Haim Dreyfuss 
>> Signed-off-by: Luca Coelho 
>>
>>
>> is the culprit. (manually) reverting it on top of 5.3-rc4 makes
>> everything work again.
> 


iwlwifi: microcode SW error detected

2019-08-17 Thread Chris Clayton
Hi.

I just found the following error in the output from dmesg.

[ 4023.460058] iwlwifi :02:00.0: Microcode SW error detected. Restarting 
0x0.
[ 4023.460178] iwlwifi :02:00.0: Start IWL Error Log Dump:
[ 4023.460179] iwlwifi :02:00.0: Status: 0x0080, count: 6
[ 4023.460180] iwlwifi :02:00.0: Loaded firmware version: 46.93e59cf4.0
[ 4023.460181] iwlwifi :02:00.0: 0x22CE | ADVANCED_SYSASSERT
[ 4023.460182] iwlwifi :02:00.0: 0x0590A2F0 | trm_hw_status0
[ 4023.460182] iwlwifi :02:00.0: 0x | trm_hw_status1
[ 4023.460183] iwlwifi :02:00.0: 0x00488472 | branchlink2
[ 4023.460183] iwlwifi :02:00.0: 0x00479392 | interruptlink1
[ 4023.460184] iwlwifi :02:00.0: 0x | interruptlink2
[ 4023.460184] iwlwifi :02:00.0: 0x012C | data1
[ 4023.460185] iwlwifi :02:00.0: 0x | data2
[ 4023.460186] iwlwifi :02:00.0: 0x0400 | data3
[ 4023.460186] iwlwifi :02:00.0: 0x42001A44 | beacon time
[ 4023.460187] iwlwifi :02:00.0: 0x4E9F05CD | tsf low
[ 4023.460187] iwlwifi :02:00.0: 0x00D8 | tsf hi
[ 4023.460188] iwlwifi :02:00.0: 0x | time gp1
[ 4023.460188] iwlwifi :02:00.0: 0xEF55F6D0 | time gp2
[ 4023.460189] iwlwifi :02:00.0: 0x0001 | uCode revision type
[ 4023.460190] iwlwifi :02:00.0: 0x002E | uCode version major
[ 4023.460190] iwlwifi :02:00.0: 0x93E59CF4 | uCode version minor
[ 4023.460191] iwlwifi :02:00.0: 0x0321 | hw version
[ 4023.460191] iwlwifi :02:00.0: 0x00C89004 | board version
[ 4023.460192] iwlwifi :02:00.0: 0x0A05001C | hcmd
[ 4023.460192] iwlwifi :02:00.0: 0xA2F93802 | isr0
[ 4023.460193] iwlwifi :02:00.0: 0x0004 | isr1
[ 4023.460193] iwlwifi :02:00.0: 0x1802 | isr2
[ 4023.460194] iwlwifi :02:00.0: 0x40417DCD | isr3
[ 4023.460195] iwlwifi :02:00.0: 0x | isr4
[ 4023.460195] iwlwifi :02:00.0: 0x0A04001C | last cmd Id
[ 4023.460196] iwlwifi :02:00.0: 0x00018802 | wait_event
[ 4023.460196] iwlwifi :02:00.0: 0x4A88 | l2p_control
[ 4023.460197] iwlwifi :02:00.0: 0x0020 | l2p_duration
[ 4023.460197] iwlwifi :02:00.0: 0x03BF | l2p_mhvalid
[ 4023.460198] iwlwifi :02:00.0: 0x00EF | l2p_addr_match
[ 4023.460198] iwlwifi :02:00.0: 0x000D | lmpm_pmg_sel
[ 4023.460199] iwlwifi :02:00.0: 0x19071250 | timestamp
[ 4023.460199] iwlwifi :02:00.0: 0x14C0E8E8 | flow_handler
[ 4023.460257] iwlwifi :02:00.0: 0x | ADVANCED_SYSASSERT
[ 4023.460257] iwlwifi :02:00.0: 0x | umac branchlink1
[ 4023.460258] iwlwifi :02:00.0: 0x | umac branchlink2
[ 4023.460258] iwlwifi :02:00.0: 0x | umac interruptlink1
[ 4023.460259] iwlwifi :02:00.0: 0x | umac interruptlink2
[ 4023.460260] iwlwifi :02:00.0: 0x | umac data1
[ 4023.460260] iwlwifi :02:00.0: 0x | umac data2
[ 4023.460261] iwlwifi :02:00.0: 0x | umac data3
[ 4023.460261] iwlwifi :02:00.0: 0x | umac major
[ 4023.460262] iwlwifi :02:00.0: 0x | umac minor
[ 4023.460262] iwlwifi :02:00.0: 0x | frame pointer
[ 4023.460263] iwlwifi :02:00.0: 0x | stack pointer
[ 4023.460263] iwlwifi :02:00.0: 0x | last host cmd
[ 4023.460264] iwlwifi :02:00.0: 0x | isr status reg
[ 4023.460278] iwlwifi :02:00.0: Fseq Registers:
[ 4023.460282] iwlwifi :02:00.0: 0x0568FC22 | FSEQ_ERROR_CODE
[ 4023.460289] iwlwifi :02:00.0: 0x | FSEQ_TOP_INIT_VERSION
[ 4023.460297] iwlwifi :02:00.0: 0xDFFC324F | FSEQ_CNVIO_INIT_VERSION
[ 4023.460304] iwlwifi :02:00.0: 0xA371 | FSEQ_OTP_VERSION
[ 4023.460312] iwlwifi :02:00.0: 0xC338B29A | FSEQ_TOP_CONTENT_VERSION
[ 4023.460319] iwlwifi :02:00.0: 0xD9E91E16 | FSEQ_ALIVE_TOKEN
[ 4023.460327] iwlwifi :02:00.0: 0xAC99E6BF | FSEQ_CNVI_ID
[ 4023.460334] iwlwifi :02:00.0: 0x07665623 | FSEQ_CNVR_ID
[ 4023.460342] iwlwifi :02:00.0: 0x01000200 | CNVI_AUX_MISC_CHIP
[ 4023.460349] iwlwifi :02:00.0: 0x01300202 | CNVR_AUX_MISC_CHIP
[ 4023.460357] iwlwifi :02:00.0: 0x485B | 
CNVR_SCU_SD_REGS_SD_REG_DIG_DCDC_VTRIM
[ 4023.460413] iwlwifi :02:00.0: 0x0BADCAFE | 
CNVR_SCU_SD_REGS_SD_REG_ACTIVE_VDIG_MIRROR
[ 4023.460421] iwlwifi :02:00.0: Collecting data: trigger 2 fired.
[ 4023.460424] ieee80211 phy0: Hardware restart was requested
[ 4024.639366] iwlwifi :02:00.0: Applying debug destination EXTERNAL_DRAM
[ 4024.753171] iwlwifi :02:00.0: Applying debug destination EXTERNAL_DRAM
[ 4024.817999] iwlwifi :02:00.0: FW already configured (0) - re-configuring
[ 4024.829374] iwlwifi :02:00.0: BIOS contains WGDS but no WRDS

The output messages from the driver when the system starts are:

[3.667365] iwlwifi :02:00.0: enabling device ( -> 0002)
[3.670357] iwlwifi :02:00.0: Found debug destination: EXTERNAL_DRAM
[3.670360] iwlwifi :02:00.0: Found debug configuration: 0
[3.670525] iwlwifi 

Re: [PATCH v2] x86/boot: save fields explicitly, zero out everything else

2019-08-10 Thread Chris Clayton



On 31/07/2019 06:46, john.hubb...@gmail.com wrote:
> From: John Hubbard 
> 
> Recent gcc compilers (gcc 9.1) generate warnings about an
> out of bounds memset, if you trying memset across several fields
> of a struct. This generated a couple of warnings on x86_64 builds.
> 
> Fix this by explicitly saving the fields in struct boot_params
> that are intended to be preserved, and zeroing all the rest.
> 

I applied John's patch below to v5.3-rc3-285-gecb095bff5d4 and have been 
running the resultant kernel for two days now,
including 7 or 8 cold starts and reboots. The warnings that were produced by 
gcc9 are no longer emitted and, other than
a pre-existing problem (no network after resume from suspend or hibernate which 
I will investigate and, if necessary,
report later today), the kernel has supported my typical day to day activities 
(building software, email, browsing,
listening to music, watching video) without problem.

Tested-by: Chris Clayton 

> Suggested-by: Thomas Gleixner 
> Suggested-by: H. Peter Anvin 
> Signed-off-by: John Hubbard 
> ---
>  arch/x86/include/asm/bootparam_utils.h | 62 +++---
>  1 file changed, 47 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/include/asm/bootparam_utils.h 
> b/arch/x86/include/asm/bootparam_utils.h
> index 101eb944f13c..514aee24b8de 100644
> --- a/arch/x86/include/asm/bootparam_utils.h
> +++ b/arch/x86/include/asm/bootparam_utils.h
> @@ -18,6 +18,20 @@
>   * Note: efi_info is commonly left uninitialized, but that field has a
>   * private magic, so it is better to leave it unchanged.
>   */
> +
> +#define sizeof_mbr(type, member) ({ sizeof(((type *)0)->member); })
> +
> +#define BOOT_PARAM_PRESERVE(struct_member)   \
> + {   \
> + .start = offsetof(struct boot_params, struct_member),   \
> + .len   = sizeof_mbr(struct boot_params, struct_member), \
> + }
> +
> +struct boot_params_to_save {
> + unsigned int start;
> + unsigned int len;
> +};
> +
>  static void sanitize_boot_params(struct boot_params *boot_params)
>  {
>   /* 
> @@ -35,21 +49,39 @@ static void sanitize_boot_params(struct boot_params 
> *boot_params)
>* problems again.
>*/
>   if (boot_params->sentinel) {
> - /* fields in boot_params are left uninitialized, clear them */
> - boot_params->acpi_rsdp_addr = 0;
> - memset(_params->ext_ramdisk_image, 0,
> -(char *)_params->efi_info -
> - (char *)_params->ext_ramdisk_image);
> - memset(_params->kbd_status, 0,
> -(char *)_params->hdr -
> -(char *)_params->kbd_status);
> - memset(_params->_pad7[0], 0,
> -(char *)_params->edd_mbr_sig_buffer[0] -
> - (char *)_params->_pad7[0]);
> - memset(_params->_pad8[0], 0,
> -(char *)_params->eddbuf[0] -
> - (char *)_params->_pad8[0]);
> - memset(_params->_pad9[0], 0, sizeof(boot_params->_pad9));
> + static struct boot_params scratch;
> + char *bp_base = (char *)boot_params;
> + char *save_base = (char *)
> + int i;
> +
> + const struct boot_params_to_save to_save[] = {
> + BOOT_PARAM_PRESERVE(screen_info),
> + BOOT_PARAM_PRESERVE(apm_bios_info),
> + BOOT_PARAM_PRESERVE(tboot_addr),
> + BOOT_PARAM_PRESERVE(ist_info),
> + BOOT_PARAM_PRESERVE(acpi_rsdp_addr),
> + BOOT_PARAM_PRESERVE(hd0_info),
> + BOOT_PARAM_PRESERVE(hd1_info),
> + BOOT_PARAM_PRESERVE(sys_desc_table),
> + BOOT_PARAM_PRESERVE(olpc_ofw_header),
> + BOOT_PARAM_PRESERVE(efi_info),
> + BOOT_PARAM_PRESERVE(alt_mem_k),
> + BOOT_PARAM_PRESERVE(scratch),
> + BOOT_PARAM_PRESERVE(e820_entries),
> + BOOT_PARAM_PRESERVE(eddbuf_entries),
> + BOOT_PARAM_PRESERVE(edd_mbr_sig_buf_entries),
> + BOOT_PARAM_PRESERVE(edd_mbr_sig_buffer),
> + BOOT_PARAM_PRESERVE(e820_table),
> + BOOT_PARAM_PRESERVE(eddbuf),
> + };
> +
> + memset(, 0, sizeof(scratch));
> +
> + for (i = 0; i < ARRAY_SIZE(to_save); i++)
> + memcpy(save_base + to_save[i].start,
> +bp_base + to_save[i].start, to_save[i].len);
> +
> + memcpy(boot_params, save_base, sizeof(*boot_params));
>   }
>  }
>  
> 


Re: Warnings whilst building 5.2.0+

2019-08-07 Thread Chris Clayton



On 09/07/2019 12:39, Chris Clayton wrote:
> 
> 
> On 09/07/2019 11:37, Enrico Weigelt, metux IT consult wrote:
>> On 09.07.19 08:06, Chris Clayton wrote:
>>
>> Hi,
>>
>>> I've pulled Linus' tree this morning and, after running 'make oldconfig', 
>>> tried a build. During that build I got the
>>> following warnings, which look to me like they should be fixed. 'git 
>>> describe' shows v5.2-915-g5ad18b2e60b7 and my
>>> compiler is the 20190706 snapshot of gcc 9.
>>
>> Thanks for the report. I'm rebuilding right know anyways, so I'll look
>> out for it.
> 
> Thanks for the reply.
> 
>>> In file included from arch/x86/kernel/head64.c:35:
>>> In function 'sanitize_boot_params',
>>> inlined from 'copy_bootdata' at arch/x86/kernel/head64.c:391:2:
>>> ./arch/x86/include/asm/bootparam_utils.h:40:3: warning: 'memset' offset 
>>> [197, 448] from the object at 'boot_params' is
>>> out of the bounds of referenced subobject 'ext_ramdisk_image' with type 
>>> 'unsigned int' at offset 192 [-Warray-bounds]
>>>40 |   memset(_params->ext_ramdisk_image, 0,
>>>   |   ^~
>>>41 |  (char *)_params->efi_info -
>>>   |  
>>>42 |(char *)_params->ext_ramdisk_image);
>>>   |
>>> ./arch/x86/include/asm/bootparam_utils.h:43:3: warning: 'memset' offset 
>>> [493, 497] from the object at 'boot_params' is
>>> out of the bounds of referenced subobject 'kbd_status' with type 'unsigned 
>>> char' at offset 491 [-Warray-bounds]
>>>43 |   memset(_params->kbd_status, 0,
>>>   |   ^~~
>>>44 |  (char *)_params->hdr -
>>>   |  ~~~
>>>45 |  (char *)_params->kbd_status);
>>>   |  ~
>>
>> Can you check older versions, too ? Maybe also trying older gcc ?
>>
> 
> I see the same warnings building linux-5.2.0 with gcc9. However, I don't see 
> the warnings building linux-5.2.0 with the
> the 20190705 of gcc8. So the warnings could result from an improvement (i.e. 
> the problem was in the kernel, but
> undiscovered by gcc8) or from a regression in gcc9.
> 

>From the discussion starting at 
>https://marc.info/?l=linux-kernel=156401014023908, it would appear that the 
>problem is
undiscovered by gcc8. Building a fresh pull of Linus' tree this morning 
(v5.3-rc3-282-g33920f1ec5bf), I see that the
warnings are still being emitted. Adding the participants in the other 
discussion to this one.

>>
>> --mtx
>>


Re: Warnings whilst building 5.2.0+

2019-07-09 Thread Chris Clayton



On 09/07/2019 11:37, Enrico Weigelt, metux IT consult wrote:
> On 09.07.19 08:06, Chris Clayton wrote:
> 
> Hi,
> 
>> I've pulled Linus' tree this morning and, after running 'make oldconfig', 
>> tried a build. During that build I got the
>> following warnings, which look to me like they should be fixed. 'git 
>> describe' shows v5.2-915-g5ad18b2e60b7 and my
>> compiler is the 20190706 snapshot of gcc 9.
> 
> Thanks for the report. I'm rebuilding right know anyways, so I'll look
> out for it.

Thanks for the reply.

>> In file included from arch/x86/kernel/head64.c:35:
>> In function 'sanitize_boot_params',
>> inlined from 'copy_bootdata' at arch/x86/kernel/head64.c:391:2:
>> ./arch/x86/include/asm/bootparam_utils.h:40:3: warning: 'memset' offset 
>> [197, 448] from the object at 'boot_params' is
>> out of the bounds of referenced subobject 'ext_ramdisk_image' with type 
>> 'unsigned int' at offset 192 [-Warray-bounds]
>>40 |   memset(_params->ext_ramdisk_image, 0,
>>   |   ^~
>>41 |  (char *)_params->efi_info -
>>   |  
>>42 |(char *)_params->ext_ramdisk_image);
>>   |
>> ./arch/x86/include/asm/bootparam_utils.h:43:3: warning: 'memset' offset 
>> [493, 497] from the object at 'boot_params' is
>> out of the bounds of referenced subobject 'kbd_status' with type 'unsigned 
>> char' at offset 491 [-Warray-bounds]
>>43 |   memset(_params->kbd_status, 0,
>>   |   ^~~
>>44 |  (char *)_params->hdr -
>>   |  ~~~
>>45 |  (char *)_params->kbd_status);
>>   |  ~
> 
> Can you check older versions, too ? Maybe also trying older gcc ?
> 

I see the same warnings building linux-5.2.0 with gcc9. However, I don't see 
the warnings building linux-5.2.0 with the
the 20190705 of gcc8. So the warnings could result from an improvement (i.e. 
the problem was in the kernel, but
undiscovered by gcc8) or from a regression in gcc9.

> 
> --mtx
> 


Warnings whilst building 5.2.0+

2019-07-09 Thread Chris Clayton
Hi,

I've pulled Linus' tree this morning and, after running 'make oldconfig', tried 
a build. During that build I got the
following warnings, which look to me like they should be fixed. 'git describe' 
shows v5.2-915-g5ad18b2e60b7 and my
compiler is the 20190706 snapshot of gcc 9.

In file included from arch/x86/kernel/head64.c:35:
In function 'sanitize_boot_params',
inlined from 'copy_bootdata' at arch/x86/kernel/head64.c:391:2:
./arch/x86/include/asm/bootparam_utils.h:40:3: warning: 'memset' offset [197, 
448] from the object at 'boot_params' is
out of the bounds of referenced subobject 'ext_ramdisk_image' with type 
'unsigned int' at offset 192 [-Warray-bounds]
   40 |   memset(_params->ext_ramdisk_image, 0,
  |   ^~
   41 |  (char *)_params->efi_info -
  |  
   42 |(char *)_params->ext_ramdisk_image);
  |
./arch/x86/include/asm/bootparam_utils.h:43:3: warning: 'memset' offset [493, 
497] from the object at 'boot_params' is
out of the bounds of referenced subobject 'kbd_status' with type 'unsigned 
char' at offset 491 [-Warray-bounds]
   43 |   memset(_params->kbd_status, 0,
  |   ^~~
   44 |  (char *)_params->hdr -
  |  ~~~
   45 |  (char *)_params->kbd_status);
  |  ~


Happy to test any patches, but please cc me as I'm not subscribed to LKML.

Chris


Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-11 Thread Chris Clayton



On 11/10/2018 13:23, Maciej S. Szmigiero wrote:
> On 11.10.2018 10:24, Chris Clayton wrote:
>> On 11/10/2018 01:12, Maciej S. Szmigiero wrote:
>>> On 11.10.2018 00:49, Chris Clayton wrote:
>>>>> Now, knowing the "right" value you can experiment with what 
>>>>> rtl_init_rxcfg()
>>>>> writes (under the "default:" label for your NIC model).
>>>>>
>>>>
>>>> This might be more interesting. Through a combination of viewing the 
>>>> output from pr_notice() and the output from
>>>> "ethtool -d", I can see RxConfig with the following values
>>>>
>>>>During boot:0x00028700
>>>>Before suspend: 0x0002870e
>>>>During resume:  0x00024000
>>>>Post resume:0x0002870e
>>>>
>>>> As I did with 4.18.10 early on in the process, I removed the call to 
>>>> rtl_init_rxcfg() from rtl_hw_start() and rebuilt,
>>>> installed and rebooted. Now I see the following values:
>>>>
>>>>During boot:0x00028700
>>>>Before suspend: 0x0002870e
>>>>During resume:  0x00024000
>>>>Post resume:0x0002400e
>>>>
>>>
>>> Now we can finally see some difference...
>>> Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST
>>> (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this
>>> is kind of expected - one can see that the working configuration
>>> post-resume has bit 14 (or 0x4000) set, too.
>>>
>>> This bit is described in the driver as RX_MULTI_EN ("8111c only") and is
>>> set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35.
>>>
>>> RTL_GIGA_MAC_VER_35 is described in the driver as being in the same
>>> family as your RTL_GIGA_MAC_VER_38, so can you please try the following
>>> change:
>>> --- r8169.c
>>> +++ r8169.c
>>> @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816
>>> case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24:
>>> case RTL_GIGA_MAC_VER_34:
>>> case RTL_GIGA_MAC_VER_35:
>>> +   case RTL_GIGA_MAC_VER_38:
>>> RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | 
>>> RX_DMA_BURST);
>>> break;
>>> case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51:
>>>
>>> This will add RX_MULTI_EN also for your chip model (you need to add back
>>> the call to rtl_init_rxcfg() to rtl_hw_start(), naturally).
>>>
>>
>> That's done the trick. With the above change applied, my network runs 
>> running fine after a suspend/resume cycle and the
>> ping times are back in the 14-15ms range.
> 
> Nice!
> 
> I will submit a patch, it would be great if you could test it and then
> add a "Tested-by:" tag.
>  

Will do, Maciej.

Thanks for solving this.
>> Chris
> 
> Maciej
> 


Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-11 Thread Chris Clayton



On 11/10/2018 13:23, Maciej S. Szmigiero wrote:
> On 11.10.2018 10:24, Chris Clayton wrote:
>> On 11/10/2018 01:12, Maciej S. Szmigiero wrote:
>>> On 11.10.2018 00:49, Chris Clayton wrote:
>>>>> Now, knowing the "right" value you can experiment with what 
>>>>> rtl_init_rxcfg()
>>>>> writes (under the "default:" label for your NIC model).
>>>>>
>>>>
>>>> This might be more interesting. Through a combination of viewing the 
>>>> output from pr_notice() and the output from
>>>> "ethtool -d", I can see RxConfig with the following values
>>>>
>>>>During boot:0x00028700
>>>>Before suspend: 0x0002870e
>>>>During resume:  0x00024000
>>>>Post resume:0x0002870e
>>>>
>>>> As I did with 4.18.10 early on in the process, I removed the call to 
>>>> rtl_init_rxcfg() from rtl_hw_start() and rebuilt,
>>>> installed and rebooted. Now I see the following values:
>>>>
>>>>During boot:0x00028700
>>>>Before suspend: 0x0002870e
>>>>During resume:  0x00024000
>>>>Post resume:0x0002400e
>>>>
>>>
>>> Now we can finally see some difference...
>>> Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST
>>> (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this
>>> is kind of expected - one can see that the working configuration
>>> post-resume has bit 14 (or 0x4000) set, too.
>>>
>>> This bit is described in the driver as RX_MULTI_EN ("8111c only") and is
>>> set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35.
>>>
>>> RTL_GIGA_MAC_VER_35 is described in the driver as being in the same
>>> family as your RTL_GIGA_MAC_VER_38, so can you please try the following
>>> change:
>>> --- r8169.c
>>> +++ r8169.c
>>> @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816
>>> case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24:
>>> case RTL_GIGA_MAC_VER_34:
>>> case RTL_GIGA_MAC_VER_35:
>>> +   case RTL_GIGA_MAC_VER_38:
>>> RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | 
>>> RX_DMA_BURST);
>>> break;
>>> case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51:
>>>
>>> This will add RX_MULTI_EN also for your chip model (you need to add back
>>> the call to rtl_init_rxcfg() to rtl_hw_start(), naturally).
>>>
>>
>> That's done the trick. With the above change applied, my network runs 
>> running fine after a suspend/resume cycle and the
>> ping times are back in the 14-15ms range.
> 
> Nice!
> 
> I will submit a patch, it would be great if you could test it and then
> add a "Tested-by:" tag.
>  

Will do, Maciej.

Thanks for solving this.
>> Chris
> 
> Maciej
> 


Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-11 Thread Chris Clayton



On 11/10/2018 01:12, Maciej S. Szmigiero wrote:
> On 11.10.2018 00:49, Chris Clayton wrote:
>>> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
>>> writes (under the "default:" label for your NIC model).
>>>
>>
>> This might be more interesting. Through a combination of viewing the output 
>> from pr_notice() and the output from
>> "ethtool -d", I can see RxConfig with the following values
>>
>>  During boot:0x00028700
>>  Before suspend: 0x0002870e
>>  During resume:  0x00024000
>>  Post resume:0x0002870e
>>
>> As I did with 4.18.10 early on in the process, I removed the call to 
>> rtl_init_rxcfg() from rtl_hw_start() and rebuilt,
>> installed and rebooted. Now I see the following values:
>>
>>  During boot:0x00028700
>>  Before suspend: 0x0002870e
>>  During resume:  0x00024000
>>  Post resume:0x0002400e
>>
> 
> Now we can finally see some difference...
> Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST
> (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this
> is kind of expected - one can see that the working configuration
> post-resume has bit 14 (or 0x4000) set, too.
> 
> This bit is described in the driver as RX_MULTI_EN ("8111c only") and is
> set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35.
> 
> RTL_GIGA_MAC_VER_35 is described in the driver as being in the same
> family as your RTL_GIGA_MAC_VER_38, so can you please try the following
> change:
> --- r8169.c
> +++ r8169.c
> @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816
>   case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24:
>   case RTL_GIGA_MAC_VER_34:
>   case RTL_GIGA_MAC_VER_35:
> + case RTL_GIGA_MAC_VER_38:
>   RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | 
> RX_DMA_BURST);
>   break;
>   case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51:
> 
> This will add RX_MULTI_EN also for your chip model (you need to add back
> the call to rtl_init_rxcfg() to rtl_hw_start(), naturally).
>

That's done the trick. With the above change applied, my network runs running 
fine after a suspend/resume cycle and the
ping times are back in the 14-15ms range.

Chris

> If this does not help then I would try another values in the above write:
> 1) RTL_W32(tp, RxConfig, 0x00024000);
> 2) RTL_W32(tp, RxConfig, 0x4000);
> 3) RTL_W32(tp, RxConfig, RX_DMA_BURST);
> 4) RTL_W32(tp, RxConfig, RX128_INT_EN);
> 
>> Chris
> 
> Maciej
> 


Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-11 Thread Chris Clayton



On 11/10/2018 01:12, Maciej S. Szmigiero wrote:
> On 11.10.2018 00:49, Chris Clayton wrote:
>>> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
>>> writes (under the "default:" label for your NIC model).
>>>
>>
>> This might be more interesting. Through a combination of viewing the output 
>> from pr_notice() and the output from
>> "ethtool -d", I can see RxConfig with the following values
>>
>>  During boot:0x00028700
>>  Before suspend: 0x0002870e
>>  During resume:  0x00024000
>>  Post resume:0x0002870e
>>
>> As I did with 4.18.10 early on in the process, I removed the call to 
>> rtl_init_rxcfg() from rtl_hw_start() and rebuilt,
>> installed and rebooted. Now I see the following values:
>>
>>  During boot:0x00028700
>>  Before suspend: 0x0002870e
>>  During resume:  0x00024000
>>  Post resume:0x0002400e
>>
> 
> Now we can finally see some difference...
> Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST
> (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this
> is kind of expected - one can see that the working configuration
> post-resume has bit 14 (or 0x4000) set, too.
> 
> This bit is described in the driver as RX_MULTI_EN ("8111c only") and is
> set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35.
> 
> RTL_GIGA_MAC_VER_35 is described in the driver as being in the same
> family as your RTL_GIGA_MAC_VER_38, so can you please try the following
> change:
> --- r8169.c
> +++ r8169.c
> @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816
>   case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24:
>   case RTL_GIGA_MAC_VER_34:
>   case RTL_GIGA_MAC_VER_35:
> + case RTL_GIGA_MAC_VER_38:
>   RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | 
> RX_DMA_BURST);
>   break;
>   case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51:
> 
> This will add RX_MULTI_EN also for your chip model (you need to add back
> the call to rtl_init_rxcfg() to rtl_hw_start(), naturally).
>

That's done the trick. With the above change applied, my network runs running 
fine after a suspend/resume cycle and the
ping times are back in the 14-15ms range.

Chris

> If this does not help then I would try another values in the above write:
> 1) RTL_W32(tp, RxConfig, 0x00024000);
> 2) RTL_W32(tp, RxConfig, 0x4000);
> 3) RTL_W32(tp, RxConfig, RX_DMA_BURST);
> 4) RTL_W32(tp, RxConfig, RX128_INT_EN);
> 
>> Chris
> 
> Maciej
> 


Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-10 Thread Chris Clayton
OK, right kernel/module used this time. Please see findings below.

On 10/10/2018 01:24, Maciej S. Szmigiero wrote:
> On 09.10.2018 22:36, Heiner Kallweit wrote:
>> On 09.10.2018 16:40, Chris Clayton wrote:
>>> Thanks to Maciej and Heiner for their replies.
>>>
>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>>> Hi again,
>>>>>
>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but 
>>>>> tried it anyway. I can confirm that the
>>>>> regression is still present and my network still fails when, after a 
>>>>> resume from suspend (to ram or disk), I open my
>>>>> browser or my mail client. In both those cases the failure is almost 
>>>>> immediate - e.g. my home page doesn't get displayed
>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so 
>>>>> quickly but the reported time increases from
>>>>> 14-15ms to more than 1000ms.
>>>>
>>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>>> state (before a suspend) and in the broken state (after a resume).
>>>> Maybe there will be some obvious in the difference.
>>>>
>>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>>
>>> Maciej suggested comparing the output from lspci -vv for the ethernet 
>>> device. They are identical.
>>>
>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre 
>>> and post suspend. Again, they are identical.
>>> Heiner specifically suggested looking at the RxConfig. The value of that is 
>>> 0x0002870e both pre and post suspend.
>>>
>> Hmm, this is very weird, especially taking into account that in your original
>> report you state that removing the call to rtl_init_rxcfg() from 
>> rtl_hw_start()
>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
>> register values seem to be the same before and after resume. So how can the
>> chip behave differently?
>> So far my best guess is that some chip quirk causes it to accept writes to
>> register RxConfig, but to misinterpret or ignore the written value.
>> So far your report is the only one (affecting RTL8411), but we don't know
>> whether other chip versions are affected too.
> 
> Also, it is interesting that even if one removes a call to
> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
> written to moments later by rtl_set_rx_mode().
> 
> The only chip accesses in the meantime seems to be a write to TxConfig by
> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
> to MAR0 earlier in rtl_set_rx_mode().
> 
> My proposals are:
> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
> in rtl_hw_start().
> Maybe the chip does not like sometimes that RxConfig is written before
> TxConfig.
> 

This change made no difference. Networking still dies if I open a browser or 
leave ping running long enough.

> 2) Check the original value of RxConfig (after a resume) before
> rtl_init_rxcfg() overwrites it (compile tested only):
> --- r8169.c.ori
> +++ r8169.c
> @@ -5155,6 +5155,9 @@
>   /* Initially a 10 us delay. Turned it into a PCI commit. - FR */
>   RTL_R8(tp, IntrMask);
>   RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
> +
> + pr_notice("RxConfig before init was %.8x\n",
> + (unsigned int)RTL_R32(tp, RxConfig));
>   rtl_init_rxcfg(tp);
>   rtl_set_tx_config_registers(tp);
>  
> 
> This should be the value that you got when you removed the call to
> rtl_init_rxcfg() for testing.
> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
> writes (under the "default:" label for your NIC model).
> 

This might be more interesting. Through a combination of viewing the output 
from pr_notice() and the output from
"ethtool -d", I can see RxConfig with the following values

During boot:0x00028700
Before suspend: 0x0002870e
During resume:  0x00024000
Post resume:0x0002870e

As I did with 4.18.10 early on in the process, I removed the call to 
rtl_init_rxcfg() from rtl_hw_start() and rebuilt,
installed and rebooted. Now I see the following values:

During boot:0x00028700
Before suspend: 0x0002870e
During resume:  0x00024000
Post resume:0x0002400e

As with 4.18.10, networking now appears to be stable after the resume. Starting 
a browser results in my homepage being
displayed and I've spent a few minutes surfing with no interruptions. 
Similarly, ping runs without stopping. I simply
don't know enough to know what might now be enabled or disabled by this change 
in value, but hopefully it will provide a
clue to someone as to what is going on.

Chris

> Hope this helps,
> Maciej
> 


Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-10 Thread Chris Clayton
OK, right kernel/module used this time. Please see findings below.

On 10/10/2018 01:24, Maciej S. Szmigiero wrote:
> On 09.10.2018 22:36, Heiner Kallweit wrote:
>> On 09.10.2018 16:40, Chris Clayton wrote:
>>> Thanks to Maciej and Heiner for their replies.
>>>
>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>>> Hi again,
>>>>>
>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but 
>>>>> tried it anyway. I can confirm that the
>>>>> regression is still present and my network still fails when, after a 
>>>>> resume from suspend (to ram or disk), I open my
>>>>> browser or my mail client. In both those cases the failure is almost 
>>>>> immediate - e.g. my home page doesn't get displayed
>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so 
>>>>> quickly but the reported time increases from
>>>>> 14-15ms to more than 1000ms.
>>>>
>>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>>> state (before a suspend) and in the broken state (after a resume).
>>>> Maybe there will be some obvious in the difference.
>>>>
>>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>>
>>> Maciej suggested comparing the output from lspci -vv for the ethernet 
>>> device. They are identical.
>>>
>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre 
>>> and post suspend. Again, they are identical.
>>> Heiner specifically suggested looking at the RxConfig. The value of that is 
>>> 0x0002870e both pre and post suspend.
>>>
>> Hmm, this is very weird, especially taking into account that in your original
>> report you state that removing the call to rtl_init_rxcfg() from 
>> rtl_hw_start()
>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
>> register values seem to be the same before and after resume. So how can the
>> chip behave differently?
>> So far my best guess is that some chip quirk causes it to accept writes to
>> register RxConfig, but to misinterpret or ignore the written value.
>> So far your report is the only one (affecting RTL8411), but we don't know
>> whether other chip versions are affected too.
> 
> Also, it is interesting that even if one removes a call to
> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
> written to moments later by rtl_set_rx_mode().
> 
> The only chip accesses in the meantime seems to be a write to TxConfig by
> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
> to MAR0 earlier in rtl_set_rx_mode().
> 
> My proposals are:
> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
> in rtl_hw_start().
> Maybe the chip does not like sometimes that RxConfig is written before
> TxConfig.
> 

This change made no difference. Networking still dies if I open a browser or 
leave ping running long enough.

> 2) Check the original value of RxConfig (after a resume) before
> rtl_init_rxcfg() overwrites it (compile tested only):
> --- r8169.c.ori
> +++ r8169.c
> @@ -5155,6 +5155,9 @@
>   /* Initially a 10 us delay. Turned it into a PCI commit. - FR */
>   RTL_R8(tp, IntrMask);
>   RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
> +
> + pr_notice("RxConfig before init was %.8x\n",
> + (unsigned int)RTL_R32(tp, RxConfig));
>   rtl_init_rxcfg(tp);
>   rtl_set_tx_config_registers(tp);
>  
> 
> This should be the value that you got when you removed the call to
> rtl_init_rxcfg() for testing.
> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
> writes (under the "default:" label for your NIC model).
> 

This might be more interesting. Through a combination of viewing the output 
from pr_notice() and the output from
"ethtool -d", I can see RxConfig with the following values

During boot:0x00028700
Before suspend: 0x0002870e
During resume:  0x00024000
Post resume:0x0002870e

As I did with 4.18.10 early on in the process, I removed the call to 
rtl_init_rxcfg() from rtl_hw_start() and rebuilt,
installed and rebooted. Now I see the following values:

During boot:0x00028700
Before suspend: 0x0002870e
During resume:  0x00024000
Post resume:0x0002400e

As with 4.18.10, networking now appears to be stable after the resume. Starting 
a browser results in my homepage being
displayed and I've spent a few minutes surfing with no interruptions. 
Similarly, ping runs without stopping. I simply
don't know enough to know what might now be enabled or disabled by this change 
in value, but hopefully it will provide a
clue to someone as to what is going on.

Chris

> Hope this helps,
> Maciej
> 


Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-10 Thread Chris Clayton
Too late at night to be doing this stuff. Clicked send instead of saving a 
draft. Sorry, please ignore.

On 10/10/2018 23:30, Chris Clayton wrote:
> OK, right kernel/module used this time. Please see findings below.
> 
> On 10/10/2018 01:24, Maciej S. Szmigiero wrote:
>> On 09.10.2018 22:36, Heiner Kallweit wrote:
>>> On 09.10.2018 16:40, Chris Clayton wrote:
>>>> Thanks to Maciej and Heiner for their replies.
>>>>
>>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>>>> Hi again,
>>>>>>
>>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, 
>>>>>> but tried it anyway. I can confirm that the
>>>>>> regression is still present and my network still fails when, after a 
>>>>>> resume from suspend (to ram or disk), I open my
>>>>>> browser or my mail client. In both those cases the failure is almost 
>>>>>> immediate - e.g. my home page doesn't get displayed
>>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite 
>>>>>> so quickly but the reported time increases from
>>>>>> 14-15ms to more than 1000ms.
>>>>>
>>>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>>>> state (before a suspend) and in the broken state (after a resume).
>>>>> Maybe there will be some obvious in the difference.
>>>>>
>>>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>>>
>>>> Maciej suggested comparing the output from lspci -vv for the ethernet 
>>>> device. They are identical.
>>>>
>>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" 
>>>> pre and post suspend. Again, they are identical.
>>>> Heiner specifically suggested looking at the RxConfig. The value of that 
>>>> is 0x0002870e both pre and post suspend.
>>>>
>>> Hmm, this is very weird, especially taking into account that in your 
>>> original
>>> report you state that removing the call to rtl_init_rxcfg() from 
>>> rtl_hw_start()
>>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
>>> register values seem to be the same before and after resume. So how can the
>>> chip behave differently?
>>> So far my best guess is that some chip quirk causes it to accept writes to
>>> register RxConfig, but to misinterpret or ignore the written value.
>>> So far your report is the only one (affecting RTL8411), but we don't know
>>> whether other chip versions are affected too.
>>
>> Also, it is interesting that even if one removes a call to
>> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
>> written to moments later by rtl_set_rx_mode().
>>
>> The only chip accesses in the meantime seems to be a write to TxConfig by
>> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
>> to MAR0 earlier in rtl_set_rx_mode().
>>
>> My proposals are:
>> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
>> in rtl_hw_start().
>> Maybe the chip does not like sometimes that RxConfig is written before
>> TxConfig.
>>
> 
> This change made no difference. Networking still dies if I open a browser or 
> leave ping running long enough.
> 
>> 2) Check the original value of RxConfig (after a resume) before
>> rtl_init_rxcfg() overwrites it (compile tested only):
>> --- r8169.c.ori
>> +++ r8169.c
>> @@ -5155,6 +5155,9 @@
>>  /* Initially a 10 us delay. Turned it into a PCI commit. - FR */
>>  RTL_R8(tp, IntrMask);
>>  RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
>> +
>> +pr_notice("RxConfig before init was %.8x\n",
>> +(unsigned int)RTL_R32(tp, RxConfig));
>>  rtl_init_rxcfg(tp);
>>  rtl_set_tx_config_registers(tp);
>>  
>>
>> This should be the value that you got when you removed the call to
>> rtl_init_rxcfg() for testing.
>> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
>> writes (under the "default:" label for your NIC model).
> 
> This might be more interesting. Through combination of viewing the output 
> from pr_notice() and the output from "ethtool
> -d", I can see RxConfig with the following values
> 
>   During boot:0x00028700
>   Before suspend: 0x0002870e
>   During resume:  0x00024000
>   Post resume:0x0002870e
> 
> I then removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, 
> installed and rebooted. Now I see the
> following values:
> 
>   During boot:0x00028700
>   Before suspend: 0x0002870e
>   During resume:  0x00024000
>   Post resume:0x0002870e
> 
>>
>> Hope this helps,
>> Maciej
>>


Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-10 Thread Chris Clayton
Too late at night to be doing this stuff. Clicked send instead of saving a 
draft. Sorry, please ignore.

On 10/10/2018 23:30, Chris Clayton wrote:
> OK, right kernel/module used this time. Please see findings below.
> 
> On 10/10/2018 01:24, Maciej S. Szmigiero wrote:
>> On 09.10.2018 22:36, Heiner Kallweit wrote:
>>> On 09.10.2018 16:40, Chris Clayton wrote:
>>>> Thanks to Maciej and Heiner for their replies.
>>>>
>>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>>>> Hi again,
>>>>>>
>>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, 
>>>>>> but tried it anyway. I can confirm that the
>>>>>> regression is still present and my network still fails when, after a 
>>>>>> resume from suspend (to ram or disk), I open my
>>>>>> browser or my mail client. In both those cases the failure is almost 
>>>>>> immediate - e.g. my home page doesn't get displayed
>>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite 
>>>>>> so quickly but the reported time increases from
>>>>>> 14-15ms to more than 1000ms.
>>>>>
>>>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>>>> state (before a suspend) and in the broken state (after a resume).
>>>>> Maybe there will be some obvious in the difference.
>>>>>
>>>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>>>
>>>> Maciej suggested comparing the output from lspci -vv for the ethernet 
>>>> device. They are identical.
>>>>
>>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" 
>>>> pre and post suspend. Again, they are identical.
>>>> Heiner specifically suggested looking at the RxConfig. The value of that 
>>>> is 0x0002870e both pre and post suspend.
>>>>
>>> Hmm, this is very weird, especially taking into account that in your 
>>> original
>>> report you state that removing the call to rtl_init_rxcfg() from 
>>> rtl_hw_start()
>>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
>>> register values seem to be the same before and after resume. So how can the
>>> chip behave differently?
>>> So far my best guess is that some chip quirk causes it to accept writes to
>>> register RxConfig, but to misinterpret or ignore the written value.
>>> So far your report is the only one (affecting RTL8411), but we don't know
>>> whether other chip versions are affected too.
>>
>> Also, it is interesting that even if one removes a call to
>> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
>> written to moments later by rtl_set_rx_mode().
>>
>> The only chip accesses in the meantime seems to be a write to TxConfig by
>> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
>> to MAR0 earlier in rtl_set_rx_mode().
>>
>> My proposals are:
>> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
>> in rtl_hw_start().
>> Maybe the chip does not like sometimes that RxConfig is written before
>> TxConfig.
>>
> 
> This change made no difference. Networking still dies if I open a browser or 
> leave ping running long enough.
> 
>> 2) Check the original value of RxConfig (after a resume) before
>> rtl_init_rxcfg() overwrites it (compile tested only):
>> --- r8169.c.ori
>> +++ r8169.c
>> @@ -5155,6 +5155,9 @@
>>  /* Initially a 10 us delay. Turned it into a PCI commit. - FR */
>>  RTL_R8(tp, IntrMask);
>>  RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
>> +
>> +pr_notice("RxConfig before init was %.8x\n",
>> +(unsigned int)RTL_R32(tp, RxConfig));
>>  rtl_init_rxcfg(tp);
>>  rtl_set_tx_config_registers(tp);
>>  
>>
>> This should be the value that you got when you removed the call to
>> rtl_init_rxcfg() for testing.
>> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
>> writes (under the "default:" label for your NIC model).
> 
> This might be more interesting. Through combination of viewing the output 
> from pr_notice() and the output from "ethtool
> -d", I can see RxConfig with the following values
> 
>   During boot:0x00028700
>   Before suspend: 0x0002870e
>   During resume:  0x00024000
>   Post resume:0x0002870e
> 
> I then removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, 
> installed and rebooted. Now I see the
> following values:
> 
>   During boot:0x00028700
>   Before suspend: 0x0002870e
>   During resume:  0x00024000
>   Post resume:0x0002870e
> 
>>
>> Hope this helps,
>> Maciej
>>


Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-10 Thread Chris Clayton
OK, right kernel/module used this time. Please see findings below.

On 10/10/2018 01:24, Maciej S. Szmigiero wrote:
> On 09.10.2018 22:36, Heiner Kallweit wrote:
>> On 09.10.2018 16:40, Chris Clayton wrote:
>>> Thanks to Maciej and Heiner for their replies.
>>>
>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>>> Hi again,
>>>>>
>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but 
>>>>> tried it anyway. I can confirm that the
>>>>> regression is still present and my network still fails when, after a 
>>>>> resume from suspend (to ram or disk), I open my
>>>>> browser or my mail client. In both those cases the failure is almost 
>>>>> immediate - e.g. my home page doesn't get displayed
>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so 
>>>>> quickly but the reported time increases from
>>>>> 14-15ms to more than 1000ms.
>>>>
>>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>>> state (before a suspend) and in the broken state (after a resume).
>>>> Maybe there will be some obvious in the difference.
>>>>
>>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>>
>>> Maciej suggested comparing the output from lspci -vv for the ethernet 
>>> device. They are identical.
>>>
>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre 
>>> and post suspend. Again, they are identical.
>>> Heiner specifically suggested looking at the RxConfig. The value of that is 
>>> 0x0002870e both pre and post suspend.
>>>
>> Hmm, this is very weird, especially taking into account that in your original
>> report you state that removing the call to rtl_init_rxcfg() from 
>> rtl_hw_start()
>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
>> register values seem to be the same before and after resume. So how can the
>> chip behave differently?
>> So far my best guess is that some chip quirk causes it to accept writes to
>> register RxConfig, but to misinterpret or ignore the written value.
>> So far your report is the only one (affecting RTL8411), but we don't know
>> whether other chip versions are affected too.
> 
> Also, it is interesting that even if one removes a call to
> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
> written to moments later by rtl_set_rx_mode().
> 
> The only chip accesses in the meantime seems to be a write to TxConfig by
> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
> to MAR0 earlier in rtl_set_rx_mode().
> 
> My proposals are:
> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
> in rtl_hw_start().
> Maybe the chip does not like sometimes that RxConfig is written before
> TxConfig.
> 

This change made no difference. Networking still dies if I open a browser or 
leave ping running long enough.

> 2) Check the original value of RxConfig (after a resume) before
> rtl_init_rxcfg() overwrites it (compile tested only):
> --- r8169.c.ori
> +++ r8169.c
> @@ -5155,6 +5155,9 @@
>   /* Initially a 10 us delay. Turned it into a PCI commit. - FR */
>   RTL_R8(tp, IntrMask);
>   RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
> +
> + pr_notice("RxConfig before init was %.8x\n",
> + (unsigned int)RTL_R32(tp, RxConfig));
>   rtl_init_rxcfg(tp);
>   rtl_set_tx_config_registers(tp);
>  
> 
> This should be the value that you got when you removed the call to
> rtl_init_rxcfg() for testing.
> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
> writes (under the "default:" label for your NIC model).

This might be more interesting. Through combination of viewing the output from 
pr_notice() and the output from "ethtool
-d", I can see RxConfig with the following values

During boot:0x00028700
Before suspend: 0x0002870e
During resume:  0x00024000
Post resume:0x0002870e

I then removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, 
installed and rebooted. Now I see the
following values:

During boot:0x00028700
Before suspend: 0x0002870e
During resume:  0x00024000
Post resume:0x0002870e

> 
> Hope this helps,
> Maciej
> 


Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-10 Thread Chris Clayton
OK, right kernel/module used this time. Please see findings below.

On 10/10/2018 01:24, Maciej S. Szmigiero wrote:
> On 09.10.2018 22:36, Heiner Kallweit wrote:
>> On 09.10.2018 16:40, Chris Clayton wrote:
>>> Thanks to Maciej and Heiner for their replies.
>>>
>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>>> Hi again,
>>>>>
>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but 
>>>>> tried it anyway. I can confirm that the
>>>>> regression is still present and my network still fails when, after a 
>>>>> resume from suspend (to ram or disk), I open my
>>>>> browser or my mail client. In both those cases the failure is almost 
>>>>> immediate - e.g. my home page doesn't get displayed
>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so 
>>>>> quickly but the reported time increases from
>>>>> 14-15ms to more than 1000ms.
>>>>
>>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>>> state (before a suspend) and in the broken state (after a resume).
>>>> Maybe there will be some obvious in the difference.
>>>>
>>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>>
>>> Maciej suggested comparing the output from lspci -vv for the ethernet 
>>> device. They are identical.
>>>
>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre 
>>> and post suspend. Again, they are identical.
>>> Heiner specifically suggested looking at the RxConfig. The value of that is 
>>> 0x0002870e both pre and post suspend.
>>>
>> Hmm, this is very weird, especially taking into account that in your original
>> report you state that removing the call to rtl_init_rxcfg() from 
>> rtl_hw_start()
>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
>> register values seem to be the same before and after resume. So how can the
>> chip behave differently?
>> So far my best guess is that some chip quirk causes it to accept writes to
>> register RxConfig, but to misinterpret or ignore the written value.
>> So far your report is the only one (affecting RTL8411), but we don't know
>> whether other chip versions are affected too.
> 
> Also, it is interesting that even if one removes a call to
> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
> written to moments later by rtl_set_rx_mode().
> 
> The only chip accesses in the meantime seems to be a write to TxConfig by
> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
> to MAR0 earlier in rtl_set_rx_mode().
> 
> My proposals are:
> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
> in rtl_hw_start().
> Maybe the chip does not like sometimes that RxConfig is written before
> TxConfig.
> 

This change made no difference. Networking still dies if I open a browser or 
leave ping running long enough.

> 2) Check the original value of RxConfig (after a resume) before
> rtl_init_rxcfg() overwrites it (compile tested only):
> --- r8169.c.ori
> +++ r8169.c
> @@ -5155,6 +5155,9 @@
>   /* Initially a 10 us delay. Turned it into a PCI commit. - FR */
>   RTL_R8(tp, IntrMask);
>   RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
> +
> + pr_notice("RxConfig before init was %.8x\n",
> + (unsigned int)RTL_R32(tp, RxConfig));
>   rtl_init_rxcfg(tp);
>   rtl_set_tx_config_registers(tp);
>  
> 
> This should be the value that you got when you removed the call to
> rtl_init_rxcfg() for testing.
> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
> writes (under the "default:" label for your NIC model).

This might be more interesting. Through combination of viewing the output from 
pr_notice() and the output from "ethtool
-d", I can see RxConfig with the following values

During boot:0x00028700
Before suspend: 0x0002870e
During resume:  0x00024000
Post resume:0x0002870e

I then removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, 
installed and rebooted. Now I see the
following values:

During boot:0x00028700
Before suspend: 0x0002870e
During resume:  0x00024000
Post resume:0x0002870e

> 
> Hope this helps,
> Maciej
> 


Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-10 Thread Chris Clayton
Sorry, I forgot that editing r8169.c and rebuilding would result in rc7+, so I 
tested the wrong kernel/module to get the
results I provided below. That, however, may make the results more interesting 
because they happened with a virgin rc7
kernel/module.

I'll test your proposals properly later.

Chris

On 10/10/2018 09:09, Chris Clayton wrote:
> 
> 
> On 10/10/2018 01:24, Maciej S. Szmigiero wrote:
>> On 09.10.2018 22:36, Heiner Kallweit wrote:
>>> On 09.10.2018 16:40, Chris Clayton wrote:
>>>> Thanks to Maciej and Heiner for their replies.
>>>>
>>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>>>> Hi again,
>>>>>>
>>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, 
>>>>>> but tried it anyway. I can confirm that the
>>>>>> regression is still present and my network still fails when, after a 
>>>>>> resume from suspend (to ram or disk), I open my
>>>>>> browser or my mail client. In both those cases the failure is almost 
>>>>>> immediate - e.g. my home page doesn't get displayed
>>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite 
>>>>>> so quickly but the reported time increases from
>>>>>> 14-15ms to more than 1000ms.
>>>>>
>>>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>>>> state (before a suspend) and in the broken state (after a resume).
>>>>> Maybe there will be some obvious in the difference.
>>>>>
>>>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>>>
>>>> Maciej suggested comparing the output from lspci -vv for the ethernet 
>>>> device. They are identical.
>>>>
>>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" 
>>>> pre and post suspend. Again, they are identical.
>>>> Heiner specifically suggested looking at the RxConfig. The value of that 
>>>> is 0x0002870e both pre and post suspend.
>>>>
>>> Hmm, this is very weird, especially taking into account that in your 
>>> original
>>> report you state that removing the call to rtl_init_rxcfg() from 
>>> rtl_hw_start()
>>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
>>> register values seem to be the same before and after resume. So how can the
>>> chip behave differently?
>>> So far my best guess is that some chip quirk causes it to accept writes to
>>> register RxConfig, but to misinterpret or ignore the written value.
>>> So far your report is the only one (affecting RTL8411), but we don't know
>>> whether other chip versions are affected too.
>>
>> Also, it is interesting that even if one removes a call to
>> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
>> written to moments later by rtl_set_rx_mode().
>>
>> The only chip accesses in the meantime seems to be a write to TxConfig by
>> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
>> to MAR0 earlier in rtl_set_rx_mode().
>>
>> My proposals are:
>> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
>> in rtl_hw_start().
>> Maybe the chip does not like sometimes that RxConfig is written before
>> TxConfig.
>>
> After testing your first proposal, which made no  difference, I founf the 
> following in dmesg in the output from dmesg:
> 
> [  761.999468] [ cut here ]
> [  761.999471] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
> [  761.999483] WARNING: CPU: 0 PID: 8938 at net/sched/sch_generic.c:461 
> dev_watchdog+0x1e9/0x1f0
> [  761.999484] Modules linked in: btusb btintel r8169 rfcomm bnep 
> iptable_filter xt_conntrack iptable_nat ipt_MASQUERADE
> nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv4 uvcvideo videobuf2_vmalloc 
> videobuf2_memops snd_hda_codec_via
> videobuf2_v4l2 snd_hda_codec_hdmi snd_hda_codec_generic videobuf2_common 
> usbhid realtek coretemp snd_hda_intel hwmon
> snd_hda_codec x86_pkg_temp_thermal snd_hwdep libphy snd_hda_core [last 
> unloaded: btintel]
> [  761.999503] CPU: 0 PID: 8938 Comm: kworker/0:0 Not tainted 4.19.0-rc7 #328
> [  761.999504] Hardware name: Notebook W65_67SZ   
>  /W65_67SZ
>, BIOS 1.03.05 02/26/2014
> [  761.999508] Workqu

Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-10 Thread Chris Clayton
Sorry, I forgot that editing r8169.c and rebuilding would result in rc7+, so I 
tested the wrong kernel/module to get the
results I provided below. That, however, may make the results more interesting 
because they happened with a virgin rc7
kernel/module.

I'll test your proposals properly later.

Chris

On 10/10/2018 09:09, Chris Clayton wrote:
> 
> 
> On 10/10/2018 01:24, Maciej S. Szmigiero wrote:
>> On 09.10.2018 22:36, Heiner Kallweit wrote:
>>> On 09.10.2018 16:40, Chris Clayton wrote:
>>>> Thanks to Maciej and Heiner for their replies.
>>>>
>>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>>>> Hi again,
>>>>>>
>>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, 
>>>>>> but tried it anyway. I can confirm that the
>>>>>> regression is still present and my network still fails when, after a 
>>>>>> resume from suspend (to ram or disk), I open my
>>>>>> browser or my mail client. In both those cases the failure is almost 
>>>>>> immediate - e.g. my home page doesn't get displayed
>>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite 
>>>>>> so quickly but the reported time increases from
>>>>>> 14-15ms to more than 1000ms.
>>>>>
>>>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>>>> state (before a suspend) and in the broken state (after a resume).
>>>>> Maybe there will be some obvious in the difference.
>>>>>
>>>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>>>
>>>> Maciej suggested comparing the output from lspci -vv for the ethernet 
>>>> device. They are identical.
>>>>
>>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" 
>>>> pre and post suspend. Again, they are identical.
>>>> Heiner specifically suggested looking at the RxConfig. The value of that 
>>>> is 0x0002870e both pre and post suspend.
>>>>
>>> Hmm, this is very weird, especially taking into account that in your 
>>> original
>>> report you state that removing the call to rtl_init_rxcfg() from 
>>> rtl_hw_start()
>>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
>>> register values seem to be the same before and after resume. So how can the
>>> chip behave differently?
>>> So far my best guess is that some chip quirk causes it to accept writes to
>>> register RxConfig, but to misinterpret or ignore the written value.
>>> So far your report is the only one (affecting RTL8411), but we don't know
>>> whether other chip versions are affected too.
>>
>> Also, it is interesting that even if one removes a call to
>> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
>> written to moments later by rtl_set_rx_mode().
>>
>> The only chip accesses in the meantime seems to be a write to TxConfig by
>> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
>> to MAR0 earlier in rtl_set_rx_mode().
>>
>> My proposals are:
>> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
>> in rtl_hw_start().
>> Maybe the chip does not like sometimes that RxConfig is written before
>> TxConfig.
>>
> After testing your first proposal, which made no  difference, I founf the 
> following in dmesg in the output from dmesg:
> 
> [  761.999468] [ cut here ]
> [  761.999471] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
> [  761.999483] WARNING: CPU: 0 PID: 8938 at net/sched/sch_generic.c:461 
> dev_watchdog+0x1e9/0x1f0
> [  761.999484] Modules linked in: btusb btintel r8169 rfcomm bnep 
> iptable_filter xt_conntrack iptable_nat ipt_MASQUERADE
> nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv4 uvcvideo videobuf2_vmalloc 
> videobuf2_memops snd_hda_codec_via
> videobuf2_v4l2 snd_hda_codec_hdmi snd_hda_codec_generic videobuf2_common 
> usbhid realtek coretemp snd_hda_intel hwmon
> snd_hda_codec x86_pkg_temp_thermal snd_hwdep libphy snd_hda_core [last 
> unloaded: btintel]
> [  761.999503] CPU: 0 PID: 8938 Comm: kworker/0:0 Not tainted 4.19.0-rc7 #328
> [  761.999504] Hardware name: Notebook W65_67SZ   
>  /W65_67SZ
>, BIOS 1.03.05 02/26/2014
> [  761.999508] Workqu

Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-10 Thread Chris Clayton



On 10/10/2018 01:24, Maciej S. Szmigiero wrote:
> On 09.10.2018 22:36, Heiner Kallweit wrote:
>> On 09.10.2018 16:40, Chris Clayton wrote:
>>> Thanks to Maciej and Heiner for their replies.
>>>
>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>>> Hi again,
>>>>>
>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but 
>>>>> tried it anyway. I can confirm that the
>>>>> regression is still present and my network still fails when, after a 
>>>>> resume from suspend (to ram or disk), I open my
>>>>> browser or my mail client. In both those cases the failure is almost 
>>>>> immediate - e.g. my home page doesn't get displayed
>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so 
>>>>> quickly but the reported time increases from
>>>>> 14-15ms to more than 1000ms.
>>>>
>>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>>> state (before a suspend) and in the broken state (after a resume).
>>>> Maybe there will be some obvious in the difference.
>>>>
>>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>>
>>> Maciej suggested comparing the output from lspci -vv for the ethernet 
>>> device. They are identical.
>>>
>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre 
>>> and post suspend. Again, they are identical.
>>> Heiner specifically suggested looking at the RxConfig. The value of that is 
>>> 0x0002870e both pre and post suspend.
>>>
>> Hmm, this is very weird, especially taking into account that in your original
>> report you state that removing the call to rtl_init_rxcfg() from 
>> rtl_hw_start()
>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
>> register values seem to be the same before and after resume. So how can the
>> chip behave differently?
>> So far my best guess is that some chip quirk causes it to accept writes to
>> register RxConfig, but to misinterpret or ignore the written value.
>> So far your report is the only one (affecting RTL8411), but we don't know
>> whether other chip versions are affected too.
> 
> Also, it is interesting that even if one removes a call to
> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
> written to moments later by rtl_set_rx_mode().
> 
> The only chip accesses in the meantime seems to be a write to TxConfig by
> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
> to MAR0 earlier in rtl_set_rx_mode().
> 
> My proposals are:
> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
> in rtl_hw_start().
> Maybe the chip does not like sometimes that RxConfig is written before
> TxConfig.
> 
After testing your first proposal, which made no  difference, I founf the 
following in dmesg in the output from dmesg:

[  761.999468] [ cut here ]
[  761.999471] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
[  761.999483] WARNING: CPU: 0 PID: 8938 at net/sched/sch_generic.c:461 
dev_watchdog+0x1e9/0x1f0
[  761.999484] Modules linked in: btusb btintel r8169 rfcomm bnep 
iptable_filter xt_conntrack iptable_nat ipt_MASQUERADE
nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv4 uvcvideo videobuf2_vmalloc 
videobuf2_memops snd_hda_codec_via
videobuf2_v4l2 snd_hda_codec_hdmi snd_hda_codec_generic videobuf2_common usbhid 
realtek coretemp snd_hda_intel hwmon
snd_hda_codec x86_pkg_temp_thermal snd_hwdep libphy snd_hda_core [last 
unloaded: btintel]
[  761.999503] CPU: 0 PID: 8938 Comm: kworker/0:0 Not tainted 4.19.0-rc7 #328
[  761.999504] Hardware name: Notebook W65_67SZ 
   /W65_67SZ
   , BIOS 1.03.05 02/26/2014
[  761.999508] Workqueue: events rtl_task [r8169]
[  761.999510] RIP: 0010:dev_watchdog+0x1e9/0x1f0
[  761.999512] Code: 00 48 63 4d e8 eb 99 4c 89 ef c6 05 b6 13 a6 00 01 e8 1b 
c7 fd ff 89 d9 4c 89 ee 48 c7 c7 40 53 e1
81 48 89 c2 e8 ae f4 a3 ff <0f> 0b eb c0 0f 1f 00 48 c7 47 08 00 00 00 00 48 c7 
07 00 00 00 00
[  761.999513] RSP: 0018:88040f803e98 EFLAGS: 00010282
[  761.999514] RAX:  RBX:  RCX: 0006
[  761.999516] RDX: 0007 RSI: 0096 RDI: 88040f8153d0
[  761.999517] RBP: 88040ca9a3b8 R08: 813565f0 R09: 034e
[  761.999517] R10: 0007 R11:  R12: 88040ca9a39c
[  761.999518] R13: 880

Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-10 Thread Chris Clayton



On 10/10/2018 01:24, Maciej S. Szmigiero wrote:
> On 09.10.2018 22:36, Heiner Kallweit wrote:
>> On 09.10.2018 16:40, Chris Clayton wrote:
>>> Thanks to Maciej and Heiner for their replies.
>>>
>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>>> Hi again,
>>>>>
>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but 
>>>>> tried it anyway. I can confirm that the
>>>>> regression is still present and my network still fails when, after a 
>>>>> resume from suspend (to ram or disk), I open my
>>>>> browser or my mail client. In both those cases the failure is almost 
>>>>> immediate - e.g. my home page doesn't get displayed
>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so 
>>>>> quickly but the reported time increases from
>>>>> 14-15ms to more than 1000ms.
>>>>
>>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>>> state (before a suspend) and in the broken state (after a resume).
>>>> Maybe there will be some obvious in the difference.
>>>>
>>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>>
>>> Maciej suggested comparing the output from lspci -vv for the ethernet 
>>> device. They are identical.
>>>
>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre 
>>> and post suspend. Again, they are identical.
>>> Heiner specifically suggested looking at the RxConfig. The value of that is 
>>> 0x0002870e both pre and post suspend.
>>>
>> Hmm, this is very weird, especially taking into account that in your original
>> report you state that removing the call to rtl_init_rxcfg() from 
>> rtl_hw_start()
>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
>> register values seem to be the same before and after resume. So how can the
>> chip behave differently?
>> So far my best guess is that some chip quirk causes it to accept writes to
>> register RxConfig, but to misinterpret or ignore the written value.
>> So far your report is the only one (affecting RTL8411), but we don't know
>> whether other chip versions are affected too.
> 
> Also, it is interesting that even if one removes a call to
> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
> written to moments later by rtl_set_rx_mode().
> 
> The only chip accesses in the meantime seems to be a write to TxConfig by
> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
> to MAR0 earlier in rtl_set_rx_mode().
> 
> My proposals are:
> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
> in rtl_hw_start().
> Maybe the chip does not like sometimes that RxConfig is written before
> TxConfig.
> 
After testing your first proposal, which made no  difference, I founf the 
following in dmesg in the output from dmesg:

[  761.999468] [ cut here ]
[  761.999471] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
[  761.999483] WARNING: CPU: 0 PID: 8938 at net/sched/sch_generic.c:461 
dev_watchdog+0x1e9/0x1f0
[  761.999484] Modules linked in: btusb btintel r8169 rfcomm bnep 
iptable_filter xt_conntrack iptable_nat ipt_MASQUERADE
nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv4 uvcvideo videobuf2_vmalloc 
videobuf2_memops snd_hda_codec_via
videobuf2_v4l2 snd_hda_codec_hdmi snd_hda_codec_generic videobuf2_common usbhid 
realtek coretemp snd_hda_intel hwmon
snd_hda_codec x86_pkg_temp_thermal snd_hwdep libphy snd_hda_core [last 
unloaded: btintel]
[  761.999503] CPU: 0 PID: 8938 Comm: kworker/0:0 Not tainted 4.19.0-rc7 #328
[  761.999504] Hardware name: Notebook W65_67SZ 
   /W65_67SZ
   , BIOS 1.03.05 02/26/2014
[  761.999508] Workqueue: events rtl_task [r8169]
[  761.999510] RIP: 0010:dev_watchdog+0x1e9/0x1f0
[  761.999512] Code: 00 48 63 4d e8 eb 99 4c 89 ef c6 05 b6 13 a6 00 01 e8 1b 
c7 fd ff 89 d9 4c 89 ee 48 c7 c7 40 53 e1
81 48 89 c2 e8 ae f4 a3 ff <0f> 0b eb c0 0f 1f 00 48 c7 47 08 00 00 00 00 48 c7 
07 00 00 00 00
[  761.999513] RSP: 0018:88040f803e98 EFLAGS: 00010282
[  761.999514] RAX:  RBX:  RCX: 0006
[  761.999516] RDX: 0007 RSI: 0096 RDI: 88040f8153d0
[  761.999517] RBP: 88040ca9a3b8 R08: 813565f0 R09: 034e
[  761.999517] R10: 0007 R11:  R12: 88040ca9a39c
[  761.999518] R13: 880

Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-09 Thread Chris Clayton



On 09/10/2018 22:39, Heiner Kallweit wrote:
> On 09.10.2018 16:40, Chris Clayton wrote:
>> Thanks to Maciej and Heiner for their replies.
>>
>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>> Hi again,
>>>>
>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but 
>>>> tried it anyway. I can confirm that the
>>>> regression is still present and my network still fails when, after a 
>>>> resume from suspend (to ram or disk), I open my
>>>> browser or my mail client. In both those cases the failure is almost 
>>>> immediate - e.g. my home page doesn't get displayed
>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so 
>>>> quickly but the reported time increases from
>>>> 14-15ms to more than 1000ms.
>>>
>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>> state (before a suspend) and in the broken state (after a resume).
>>> Maybe there will be some obvious in the difference.
>>>
>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>
>> Maciej suggested comparing the output from lspci -vv for the ethernet 
>> device. They are identical.
>>
>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre 
>> and post suspend. Again, they are identical.
>> Heiner specifically suggested looking at the RxConfig. The value of that is 
>> 0x0002870e both pre and post suspend.
>>
>> I've attached files I redirected the outputs to.
>>
>> Please don't hesitate to ask for any other information needed to solve this 
>> problem. In the meantime, I've now got
>> scripts that stop the network during suspend and restart it during resume. 
>> (Those scripts were removed whilst I gathered
>> the diagnostics shown in the attachments.)
>>
> I'd like to check whether it may be a timing issue. The following 
> experimental patch
> adds a PCI commit after writing register ChipCmd. Could you please check 
> whether
> it changes anything?
> 
> diff --git a/drivers/net/ethernet/realtek/r8169.c 
> b/drivers/net/ethernet/realtek/r8169.c
> index 7d3f671e1..f3c359492 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -4641,6 +4641,7 @@ static void rtl_hw_start(struct  rtl8169_private *tp)
>   /* Initially a 10 us delay. Turned it into a PCI commit. - FR */
>   RTL_R8(tp, IntrMask);
>   RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
> + RTL_R8(tp, ChipCmd);
>   rtl_init_rxcfg(tp);
>   rtl_set_tx_config_registers(tp);
>  
> 

Sorry, this patch doesn't make any difference - my network still fails. After a 
suspend/resume my browsers (chromium
and firefox) both fail to open my home page (https://www.google.co.uk). The 
ping time for one of my ISP's name servers
increases from 14-15ms to more than 1000ms, although it after a few pings it 
does reduce. As the screen grab below
shows, the network does eventually fail

$ ping NS1
PING ns1 (90.207.238.97): 56 data bytes
64 bytes from 90.207.238.97: icmp_seq=0 ttl=251 time=1017.289 ms
64 bytes from 90.207.238.97: icmp_seq=1 ttl=251 time=1018.051 ms
64 bytes from 90.207.238.97: icmp_seq=2 ttl=251 time=1015.271 ms
64 bytes from 90.207.238.97: icmp_seq=3 ttl=251 time=1015.495 ms
64 bytes from 90.207.238.97: icmp_seq=6 ttl=251 time=1015.646 ms
64 bytes from 90.207.238.97: icmp_seq=7 ttl=251 time=1022.609 ms
64 bytes from 90.207.238.97: icmp_seq=8 ttl=251 time=1015.612 ms
64 bytes from 90.207.238.97: icmp_seq=10 ttl=251 time=1015.551 ms
64 bytes from 90.207.238.97: icmp_seq=12 ttl=251 time=1015.446 ms
64 bytes from 90.207.238.97: icmp_seq=13 ttl=251 time=1015.657 ms
64 bytes from 90.207.238.97: icmp_seq=14 ttl=251 time=1015.614 ms
64 bytes from 90.207.238.97: icmp_seq=15 ttl=251 time=1015.651 ms
64 bytes from 90.207.238.97: icmp_seq=17 ttl=251 time=1015.459 ms
64 bytes from 90.207.238.97: icmp_seq=18 ttl=251 time=1015.443 ms
64 bytes from 90.207.238.97: icmp_seq=19 ttl=251 time=1015.936 ms
64 bytes from 90.207.238.97: icmp_seq=20 ttl=251 time=1015.681 ms
64 bytes from 90.207.238.97: icmp_seq=22 ttl=251 time=1015.410 ms
64 bytes from 90.207.238.97: icmp_seq=23 ttl=251 time=1015.487 ms
64 bytes from 90.207.238.97: icmp_seq=24 ttl=251 time=1016.169 ms
64 bytes from 90.207.238.97: icmp_seq=25 ttl=251 time=1015.659 ms
64 bytes from 90.207.238.97: icmp_seq=26 ttl=251 time=14.606 ms
64 bytes from 90.207.238.97: icmp_seq=30 ttl=251 time=32.765 ms
64 bytes from 90.207.238.97: icmp_seq=31 ttl=251 time=115.052 ms
64 bytes from 90.207.238.97: icmp_seq=33 ttl=251 time=757.115 ms
64 bytes from 90.207.238.97

Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-09 Thread Chris Clayton



On 09/10/2018 22:39, Heiner Kallweit wrote:
> On 09.10.2018 16:40, Chris Clayton wrote:
>> Thanks to Maciej and Heiner for their replies.
>>
>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>> Hi again,
>>>>
>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but 
>>>> tried it anyway. I can confirm that the
>>>> regression is still present and my network still fails when, after a 
>>>> resume from suspend (to ram or disk), I open my
>>>> browser or my mail client. In both those cases the failure is almost 
>>>> immediate - e.g. my home page doesn't get displayed
>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so 
>>>> quickly but the reported time increases from
>>>> 14-15ms to more than 1000ms.
>>>
>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>> state (before a suspend) and in the broken state (after a resume).
>>> Maybe there will be some obvious in the difference.
>>>
>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>
>> Maciej suggested comparing the output from lspci -vv for the ethernet 
>> device. They are identical.
>>
>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre 
>> and post suspend. Again, they are identical.
>> Heiner specifically suggested looking at the RxConfig. The value of that is 
>> 0x0002870e both pre and post suspend.
>>
>> I've attached files I redirected the outputs to.
>>
>> Please don't hesitate to ask for any other information needed to solve this 
>> problem. In the meantime, I've now got
>> scripts that stop the network during suspend and restart it during resume. 
>> (Those scripts were removed whilst I gathered
>> the diagnostics shown in the attachments.)
>>
> I'd like to check whether it may be a timing issue. The following 
> experimental patch
> adds a PCI commit after writing register ChipCmd. Could you please check 
> whether
> it changes anything?
> 
> diff --git a/drivers/net/ethernet/realtek/r8169.c 
> b/drivers/net/ethernet/realtek/r8169.c
> index 7d3f671e1..f3c359492 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -4641,6 +4641,7 @@ static void rtl_hw_start(struct  rtl8169_private *tp)
>   /* Initially a 10 us delay. Turned it into a PCI commit. - FR */
>   RTL_R8(tp, IntrMask);
>   RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
> + RTL_R8(tp, ChipCmd);
>   rtl_init_rxcfg(tp);
>   rtl_set_tx_config_registers(tp);
>  
> 

Sorry, this patch doesn't make any difference - my network still fails. After a 
suspend/resume my browsers (chromium
and firefox) both fail to open my home page (https://www.google.co.uk). The 
ping time for one of my ISP's name servers
increases from 14-15ms to more than 1000ms, although it after a few pings it 
does reduce. As the screen grab below
shows, the network does eventually fail

$ ping NS1
PING ns1 (90.207.238.97): 56 data bytes
64 bytes from 90.207.238.97: icmp_seq=0 ttl=251 time=1017.289 ms
64 bytes from 90.207.238.97: icmp_seq=1 ttl=251 time=1018.051 ms
64 bytes from 90.207.238.97: icmp_seq=2 ttl=251 time=1015.271 ms
64 bytes from 90.207.238.97: icmp_seq=3 ttl=251 time=1015.495 ms
64 bytes from 90.207.238.97: icmp_seq=6 ttl=251 time=1015.646 ms
64 bytes from 90.207.238.97: icmp_seq=7 ttl=251 time=1022.609 ms
64 bytes from 90.207.238.97: icmp_seq=8 ttl=251 time=1015.612 ms
64 bytes from 90.207.238.97: icmp_seq=10 ttl=251 time=1015.551 ms
64 bytes from 90.207.238.97: icmp_seq=12 ttl=251 time=1015.446 ms
64 bytes from 90.207.238.97: icmp_seq=13 ttl=251 time=1015.657 ms
64 bytes from 90.207.238.97: icmp_seq=14 ttl=251 time=1015.614 ms
64 bytes from 90.207.238.97: icmp_seq=15 ttl=251 time=1015.651 ms
64 bytes from 90.207.238.97: icmp_seq=17 ttl=251 time=1015.459 ms
64 bytes from 90.207.238.97: icmp_seq=18 ttl=251 time=1015.443 ms
64 bytes from 90.207.238.97: icmp_seq=19 ttl=251 time=1015.936 ms
64 bytes from 90.207.238.97: icmp_seq=20 ttl=251 time=1015.681 ms
64 bytes from 90.207.238.97: icmp_seq=22 ttl=251 time=1015.410 ms
64 bytes from 90.207.238.97: icmp_seq=23 ttl=251 time=1015.487 ms
64 bytes from 90.207.238.97: icmp_seq=24 ttl=251 time=1016.169 ms
64 bytes from 90.207.238.97: icmp_seq=25 ttl=251 time=1015.659 ms
64 bytes from 90.207.238.97: icmp_seq=26 ttl=251 time=14.606 ms
64 bytes from 90.207.238.97: icmp_seq=30 ttl=251 time=32.765 ms
64 bytes from 90.207.238.97: icmp_seq=31 ttl=251 time=115.052 ms
64 bytes from 90.207.238.97: icmp_seq=33 ttl=251 time=757.115 ms
64 bytes from 90.207.238.97

Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-09 Thread Chris Clayton
Thanks to Maciej and Heiner for their replies.

On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
> On 07.10.2018 21:36, Chris Clayton wrote:
>> Hi again,
>>
>> I didn't think there was anything in 4.19-rc7 to fix this regression, but 
>> tried it anyway. I can confirm that the
>> regression is still present and my network still fails when, after a resume 
>> from suspend (to ram or disk), I open my
>> browser or my mail client. In both those cases the failure is almost 
>> immediate - e.g. my home page doesn't get displayed
>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so 
>> quickly but the reported time increases from
>> 14-15ms to more than 1000ms.
> 
> You can try comparing chip registers (ethtool -d eth0) in the working
> state (before a suspend) and in the broken state (after a resume).
> Maybe there will be some obvious in the difference.
> 
> The same goes for the PCI configuration (lspci -d :8168 -vv).
> 
Maciej suggested comparing the output from lspci -vv for the ethernet device. 
They are identical.

Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and 
post suspend. Again, they are identical.
Heiner specifically suggested looking at the RxConfig. The value of that is 
0x0002870e both pre and post suspend.

I've attached files I redirected the outputs to.

Please don't hesitate to ask for any other information needed to solve this 
problem. In the meantime, I've now got
scripts that stop the network during suspend and restart it during resume. 
(Those scripts were removed whilst I gathered
the diagnostics shown in the attachments.)

Chris

>> Chris
> 
> Maciej
> 
ethtool -d eth0
===
RealTek RTL8411 registers:

0x00: MAC Address  80:fa:5b:08:d0:3d
0x08: Multicast Address Filter 0x 0x0080
0x10: Dump Tally Counter Command   0x0c2ec000 0x0004
0x20: Tx Normal Priority Ring Addr 0x07a0a000 0x0004
0x28: Tx High Priority Ring Addr   0x 0x
0x30: Flash memory read/write 0x
0x34: Early Rx Byte Count  0
0x36: Early Rx Status   0x00
0x37: Command   0x0c
  Rx on, Tx on
0x3C: Interrupt Mask  0x803f
  SERR LinkChg RxNoBuf TxErr TxOK RxErr RxOK 
0x3E: Interrupt Status0x
  
0x40: Tx Configuration0x4b800f80
0x44: Rx Configuration0x0002870e
0x48: Timer count 0x
0x4C: Missed packet counter 0x00
0x50: EEPROM Command0x10
0x51: Config 0  0x00
0x52: Config 1  0xcf
0x53: Config 2  0x3c
0x54: Config 3  0x60
0x55: Config 4  0x10
0x56: Config 5  0x02
0x58: Timer interrupt 0x
0x5C: Multiple Interrupt Select   0x
0x60: PHY access  0x80040de1
0x64: TBI control and status  0x2701
0x68: TBI Autonegotiation advertisement (ANAR)0xf70c
0x6A: TBI Link partner ability (LPAR) 0x0002
0x6C: PHY status0xeb
0x84: PM wakeup frame 00x 0x
0x8C: PM wakeup frame 10x 0x
0x94: PM wakeup frame 2 (low)  0x 0x
0x9C: PM wakeup frame 2 (high) 0x 0x
0xA4: PM wakeup frame 3 (low)  0x 0x
0xAC: PM wakeup frame 3 (high) 0x 0x
0xB4: PM wakeup frame 4 (low)  0x 0x
0xBC: PM wakeup frame 4 (high) 0x 0x
0xC4: Wakeup frame 0 CRC  0x
0xC6: Wakeup frame 1 CRC  0x
0xC8: Wakeup frame 2 CRC  0x
0xCA: Wakeup frame 3 CRC  0x
0xCC: Wakeup frame 4 CRC  0x
0xDA: RX packet maximum size  0x4000
0xE0: C+ Command  0x20e1
  VLAN de-tagging
  RX checksumming
0xE2: Interrupt Mitigation0x5151
  TxTimer:   5
  TxPackets: 1
  RxTimer:   5
  RxPackets: 1
0xE4: Rx Ring Addr 0x07935000 0x0004
0xEC: Early Tx threshold0x27
0xF0: Func Event  0x0040003f
0xF4: Func Event Mask 0x
0xF8: Func Preset State   0x00031eff
0xFC: Func Force Event  

Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-09 Thread Chris Clayton
Thanks to Maciej and Heiner for their replies.

On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
> On 07.10.2018 21:36, Chris Clayton wrote:
>> Hi again,
>>
>> I didn't think there was anything in 4.19-rc7 to fix this regression, but 
>> tried it anyway. I can confirm that the
>> regression is still present and my network still fails when, after a resume 
>> from suspend (to ram or disk), I open my
>> browser or my mail client. In both those cases the failure is almost 
>> immediate - e.g. my home page doesn't get displayed
>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so 
>> quickly but the reported time increases from
>> 14-15ms to more than 1000ms.
> 
> You can try comparing chip registers (ethtool -d eth0) in the working
> state (before a suspend) and in the broken state (after a resume).
> Maybe there will be some obvious in the difference.
> 
> The same goes for the PCI configuration (lspci -d :8168 -vv).
> 
Maciej suggested comparing the output from lspci -vv for the ethernet device. 
They are identical.

Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and 
post suspend. Again, they are identical.
Heiner specifically suggested looking at the RxConfig. The value of that is 
0x0002870e both pre and post suspend.

I've attached files I redirected the outputs to.

Please don't hesitate to ask for any other information needed to solve this 
problem. In the meantime, I've now got
scripts that stop the network during suspend and restart it during resume. 
(Those scripts were removed whilst I gathered
the diagnostics shown in the attachments.)

Chris

>> Chris
> 
> Maciej
> 
ethtool -d eth0
===
RealTek RTL8411 registers:

0x00: MAC Address  80:fa:5b:08:d0:3d
0x08: Multicast Address Filter 0x 0x0080
0x10: Dump Tally Counter Command   0x0c2ec000 0x0004
0x20: Tx Normal Priority Ring Addr 0x07a0a000 0x0004
0x28: Tx High Priority Ring Addr   0x 0x
0x30: Flash memory read/write 0x
0x34: Early Rx Byte Count  0
0x36: Early Rx Status   0x00
0x37: Command   0x0c
  Rx on, Tx on
0x3C: Interrupt Mask  0x803f
  SERR LinkChg RxNoBuf TxErr TxOK RxErr RxOK 
0x3E: Interrupt Status0x
  
0x40: Tx Configuration0x4b800f80
0x44: Rx Configuration0x0002870e
0x48: Timer count 0x
0x4C: Missed packet counter 0x00
0x50: EEPROM Command0x10
0x51: Config 0  0x00
0x52: Config 1  0xcf
0x53: Config 2  0x3c
0x54: Config 3  0x60
0x55: Config 4  0x10
0x56: Config 5  0x02
0x58: Timer interrupt 0x
0x5C: Multiple Interrupt Select   0x
0x60: PHY access  0x80040de1
0x64: TBI control and status  0x2701
0x68: TBI Autonegotiation advertisement (ANAR)0xf70c
0x6A: TBI Link partner ability (LPAR) 0x0002
0x6C: PHY status0xeb
0x84: PM wakeup frame 00x 0x
0x8C: PM wakeup frame 10x 0x
0x94: PM wakeup frame 2 (low)  0x 0x
0x9C: PM wakeup frame 2 (high) 0x 0x
0xA4: PM wakeup frame 3 (low)  0x 0x
0xAC: PM wakeup frame 3 (high) 0x 0x
0xB4: PM wakeup frame 4 (low)  0x 0x
0xBC: PM wakeup frame 4 (high) 0x 0x
0xC4: Wakeup frame 0 CRC  0x
0xC6: Wakeup frame 1 CRC  0x
0xC8: Wakeup frame 2 CRC  0x
0xCA: Wakeup frame 3 CRC  0x
0xCC: Wakeup frame 4 CRC  0x
0xDA: RX packet maximum size  0x4000
0xE0: C+ Command  0x20e1
  VLAN de-tagging
  RX checksumming
0xE2: Interrupt Mitigation0x5151
  TxTimer:   5
  TxPackets: 1
  RxTimer:   5
  RxPackets: 1
0xE4: Rx Ring Addr 0x07935000 0x0004
0xEC: Early Tx threshold0x27
0xF0: Func Event  0x0040003f
0xF4: Func Event Mask 0x
0xF8: Func Preset State   0x00031eff
0xFC: Func Force Event  

Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-07 Thread Chris Clayton
Hi again,

I didn't think there was anything in 4.19-rc7 to fix this regression, but tried 
it anyway. I can confirm that the
regression is still present and my network still fails when, after a resume 
from suspend (to ram or disk), I open my
browser or my mail client. In both those cases the failure is almost immediate 
- e.g. my home page doesn't get displayed
in the browser. Pinging one of my ISPs name servers doesn't fail quite so 
quickly but the reported time increases from
14-15ms to more than 1000ms.

Chris

On 04/10/2018 09:41, Chris Clayton wrote:
> Hi Heiner,
> 
> Here's the reply to your questions. Sorry for the delay.
> 
> On 28/09/2018 23:13, Heiner Kallweit wrote:
>> On 29.09.2018 00:00, Chris Clayton wrote:
>>> Thanks Maciej.
>>>
>>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
>>>> Hi,
>>>>
>>>>> Hi,
>>>>>
>>>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing 
>>>>> network problems after resuming from a
>>>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>>>>
>>>>> The pattern of the problem is that when I first boot, the network is 
>>>>> fine. But, after resume from suspend I find that
>>>>> the time taken for a ping of one of my ISP's nameservers increases from 
>>>>> 14-15ms to more than 1000ms. Moreover, when I
>>>>> open a browser (chromium or firefox), it fails to retrieve my home page 
>>>>> (https://www.google.co.uk) and pings of the
>>>>> nameserver fail with the message "Destination Host Unreachable". Often, I 
>>>>> can revive the network by stopping it with
>>>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 
>>>>> module and load it again.
>>>>
>>>> Please have a look at the following thread:
>>>> https://lkml.org/lkml/2018/9/25/1118
>>>>
>>>
>>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the 
>>> problem is not solved by it. Similarly, I applied
>>> Heiner's patch to the 4.19, but again the problem is not solved.
>>>
>> I think we talk about two different issues here. The one the fix is for has 
>> no link to suspend/resume.
>>
>> Chris, the lspci output doesn't provide enough detail to determine the exact 
>> chip version.
>> Can you provide the dmesg part with the XID?
> 
> $ dmesg | grep r8169
> [5.274938] libphy: r8169: probed
> [5.276563] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 
> 48800800, IRQ 29
> [5.278158] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, 
> tx checksumming: ko]
> [9.275275] RTL8211E Gigabit Ethernet r8169-502:00: attached PHY driver 
> [RTL8211E Gigabit Ethernet]
> (mii_bus:phy_addr=r8169-502:00, irq=IGNORE)
> [9.460876] r8169 :05:00.2 eth0: No native access to PCI extended 
> config space, falling back to CSI
> [   11.005336] r8169 :05:00.2 eth0: Link is Up - 100Mbps/Full - flow 
> control rx/tx
> 
>> According to your lspci output neither MSI nor MSI-X is active.
>> Do you have to use nomsi for whatever reason?
>>
> 
> No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% 
> sure that it used to be - I've no idea how
> it got dropped. If I'm not sure about an option, I start by taking the 
> recommendation in the kconfig help. Help on MSI
> has a very clear "say Y". I've re-enabled it now.
> 
> Chris
> 
>> Heiner
>>
>>>> Maciej
>>>>
>>> Chris
>>>
>>
>>


Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-07 Thread Chris Clayton
Hi again,

I didn't think there was anything in 4.19-rc7 to fix this regression, but tried 
it anyway. I can confirm that the
regression is still present and my network still fails when, after a resume 
from suspend (to ram or disk), I open my
browser or my mail client. In both those cases the failure is almost immediate 
- e.g. my home page doesn't get displayed
in the browser. Pinging one of my ISPs name servers doesn't fail quite so 
quickly but the reported time increases from
14-15ms to more than 1000ms.

Chris

On 04/10/2018 09:41, Chris Clayton wrote:
> Hi Heiner,
> 
> Here's the reply to your questions. Sorry for the delay.
> 
> On 28/09/2018 23:13, Heiner Kallweit wrote:
>> On 29.09.2018 00:00, Chris Clayton wrote:
>>> Thanks Maciej.
>>>
>>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
>>>> Hi,
>>>>
>>>>> Hi,
>>>>>
>>>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing 
>>>>> network problems after resuming from a
>>>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>>>>
>>>>> The pattern of the problem is that when I first boot, the network is 
>>>>> fine. But, after resume from suspend I find that
>>>>> the time taken for a ping of one of my ISP's nameservers increases from 
>>>>> 14-15ms to more than 1000ms. Moreover, when I
>>>>> open a browser (chromium or firefox), it fails to retrieve my home page 
>>>>> (https://www.google.co.uk) and pings of the
>>>>> nameserver fail with the message "Destination Host Unreachable". Often, I 
>>>>> can revive the network by stopping it with
>>>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 
>>>>> module and load it again.
>>>>
>>>> Please have a look at the following thread:
>>>> https://lkml.org/lkml/2018/9/25/1118
>>>>
>>>
>>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the 
>>> problem is not solved by it. Similarly, I applied
>>> Heiner's patch to the 4.19, but again the problem is not solved.
>>>
>> I think we talk about two different issues here. The one the fix is for has 
>> no link to suspend/resume.
>>
>> Chris, the lspci output doesn't provide enough detail to determine the exact 
>> chip version.
>> Can you provide the dmesg part with the XID?
> 
> $ dmesg | grep r8169
> [5.274938] libphy: r8169: probed
> [5.276563] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 
> 48800800, IRQ 29
> [5.278158] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, 
> tx checksumming: ko]
> [9.275275] RTL8211E Gigabit Ethernet r8169-502:00: attached PHY driver 
> [RTL8211E Gigabit Ethernet]
> (mii_bus:phy_addr=r8169-502:00, irq=IGNORE)
> [9.460876] r8169 :05:00.2 eth0: No native access to PCI extended 
> config space, falling back to CSI
> [   11.005336] r8169 :05:00.2 eth0: Link is Up - 100Mbps/Full - flow 
> control rx/tx
> 
>> According to your lspci output neither MSI nor MSI-X is active.
>> Do you have to use nomsi for whatever reason?
>>
> 
> No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% 
> sure that it used to be - I've no idea how
> it got dropped. If I'm not sure about an option, I start by taking the 
> recommendation in the kconfig help. Help on MSI
> has a very clear "say Y". I've re-enabled it now.
> 
> Chris
> 
>> Heiner
>>
>>>> Maciej
>>>>
>>> Chris
>>>
>>
>>


Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-04 Thread Chris Clayton
Hi Heiner,

Here's the reply to your questions. Sorry for the delay.

On 28/09/2018 23:13, Heiner Kallweit wrote:
> On 29.09.2018 00:00, Chris Clayton wrote:
>> Thanks Maciej.
>>
>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
>>> Hi,
>>>
>>>> Hi,
>>>>
>>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing 
>>>> network problems after resuming from a
>>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>>>
>>>> The pattern of the problem is that when I first boot, the network is fine. 
>>>> But, after resume from suspend I find that
>>>> the time taken for a ping of one of my ISP's nameservers increases from 
>>>> 14-15ms to more than 1000ms. Moreover, when I
>>>> open a browser (chromium or firefox), it fails to retrieve my home page 
>>>> (https://www.google.co.uk) and pings of the
>>>> nameserver fail with the message "Destination Host Unreachable". Often, I 
>>>> can revive the network by stopping it with
>>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 
>>>> module and load it again.
>>>
>>> Please have a look at the following thread:
>>> https://lkml.org/lkml/2018/9/25/1118
>>>
>>
>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem 
>> is not solved by it. Similarly, I applied
>> Heiner's patch to the 4.19, but again the problem is not solved.
>>
> I think we talk about two different issues here. The one the fix is for has 
> no link to suspend/resume.
> 
> Chris, the lspci output doesn't provide enough detail to determine the exact 
> chip version.
> Can you provide the dmesg part with the XID?

$ dmesg | grep r8169
[5.274938] libphy: r8169: probed
[5.276563] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 
48800800, IRQ 29
[5.278158] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, tx 
checksumming: ko]
[9.275275] RTL8211E Gigabit Ethernet r8169-502:00: attached PHY driver 
[RTL8211E Gigabit Ethernet]
(mii_bus:phy_addr=r8169-502:00, irq=IGNORE)
[9.460876] r8169 :05:00.2 eth0: No native access to PCI extended config 
space, falling back to CSI
[   11.005336] r8169 :05:00.2 eth0: Link is Up - 100Mbps/Full - flow 
control rx/tx

> According to your lspci output neither MSI nor MSI-X is active.
> Do you have to use nomsi for whatever reason?
> 

No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% 
sure that it used to be - I've no idea how
it got dropped. If I'm not sure about an option, I start by taking the 
recommendation in the kconfig help. Help on MSI
has a very clear "say Y". I've re-enabled it now.

Chris

> Heiner
> 
>>> Maciej
>>>
>> Chris
>>
> 
> 


Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-10-04 Thread Chris Clayton
Hi Heiner,

Here's the reply to your questions. Sorry for the delay.

On 28/09/2018 23:13, Heiner Kallweit wrote:
> On 29.09.2018 00:00, Chris Clayton wrote:
>> Thanks Maciej.
>>
>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
>>> Hi,
>>>
>>>> Hi,
>>>>
>>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing 
>>>> network problems after resuming from a
>>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>>>
>>>> The pattern of the problem is that when I first boot, the network is fine. 
>>>> But, after resume from suspend I find that
>>>> the time taken for a ping of one of my ISP's nameservers increases from 
>>>> 14-15ms to more than 1000ms. Moreover, when I
>>>> open a browser (chromium or firefox), it fails to retrieve my home page 
>>>> (https://www.google.co.uk) and pings of the
>>>> nameserver fail with the message "Destination Host Unreachable". Often, I 
>>>> can revive the network by stopping it with
>>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 
>>>> module and load it again.
>>>
>>> Please have a look at the following thread:
>>> https://lkml.org/lkml/2018/9/25/1118
>>>
>>
>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem 
>> is not solved by it. Similarly, I applied
>> Heiner's patch to the 4.19, but again the problem is not solved.
>>
> I think we talk about two different issues here. The one the fix is for has 
> no link to suspend/resume.
> 
> Chris, the lspci output doesn't provide enough detail to determine the exact 
> chip version.
> Can you provide the dmesg part with the XID?

$ dmesg | grep r8169
[5.274938] libphy: r8169: probed
[5.276563] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 
48800800, IRQ 29
[5.278158] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, tx 
checksumming: ko]
[9.275275] RTL8211E Gigabit Ethernet r8169-502:00: attached PHY driver 
[RTL8211E Gigabit Ethernet]
(mii_bus:phy_addr=r8169-502:00, irq=IGNORE)
[9.460876] r8169 :05:00.2 eth0: No native access to PCI extended config 
space, falling back to CSI
[   11.005336] r8169 :05:00.2 eth0: Link is Up - 100Mbps/Full - flow 
control rx/tx

> According to your lspci output neither MSI nor MSI-X is active.
> Do you have to use nomsi for whatever reason?
> 

No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% 
sure that it used to be - I've no idea how
it got dropped. If I'm not sure about an option, I start by taking the 
recommendation in the kconfig help. Help on MSI
has a very clear "say Y". I've re-enabled it now.

Chris

> Heiner
> 
>>> Maciej
>>>
>> Chris
>>
> 
> 


Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-09-29 Thread Chris Clayton
Sorry, sent by accident. Note to self - don't attempt email until after second 
cup of coffee.

On 29/09/2018 08:25, Chris Clayton wrote:
> 
> 
> On 28/09/2018 23:13, Heiner Kallweit wrote:
>> On 29.09.2018 00:00, Chris Clayton wrote:
>>> Thanks Maciej.
>>>
>>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
>>>> Hi,
>>>>
>>>>> Hi,
>>>>>
>>>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing 
>>>>> network problems after resuming from a
>>>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>>>>
>>>>> The pattern of the problem is that when I first boot, the network is 
>>>>> fine. But, after resume from suspend I find that
>>>>> the time taken for a ping of one of my ISP's nameservers increases from 
>>>>> 14-15ms to more than 1000ms. Moreover, when I
>>>>> open a browser (chromium or firefox), it fails to retrieve my home page 
>>>>> (https://www.google.co.uk) and pings of the
>>>>> nameserver fail with the message "Destination Host Unreachable". Often, I 
>>>>> can revive the network by stopping it with
>>>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 
>>>>> module and load it again.
>>>>
>>>> Please have a look at the following thread:
>>>> https://lkml.org/lkml/2018/9/25/1118
>>>>
>>>
>>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the 
>>> problem is not solved by it. Similarly, I applied
>>> Heiner's patch to the 4.19, but again the problem is not solved.
>>>
>> I think we talk about two different issues here. The one the fix is for has 
>> no link to suspend/resume.
>>
>> Chris, the lspci output doesn't provide enough detail to determine the exact 
>> chip version.
>> Can you provide the dmesg part with the XID?

I meant to say that I have now re-enabled MSI in 4.18.7 - the latest stable 
series kernel in which eth0 continues to
function reliably after a suspend/resume cycle. The second dmesg output below 
is taken from that kernel. The first one
was from an up-to-date 4.19 kernel
> 
> $ dmesg | grep -i r8169
> [5.320679] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> [5.321432] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM 
> control
> [5.322892] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 
> 48800800, IRQ 19
> [5.323786] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, 
> tx checksumming: ko]
> [   10.232077] r8169 :05:00.2 eth0: No native access to PCI extended 
> config space, falling back to CSI
> [   10.235218] r8169 :05:00.2 eth0: link down
> [   11.717460] r8169 :05:00.2 eth0: link up
> 
> $ dmesg | grep -i r8169
> [5.208040] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> [5.208677] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM 
> control
> [5.210066] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 
> 48800800, IRQ 29
> [5.210676] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, 
> tx checksumming: ko]
> [   10.456081] r8169 :05:00.2 eth0: No native access to PCI extended 
> config space, falling back to CSI
> [   10.459217] r8169 :05:00.2 eth0: link down
> [   10.459880] r8169 :05:00.2 eth0: link down
> [   12.015158] r8169 :05:00.2 eth0: link up
> 
> 
>> According to your lspci output neither MSI nor MSI-X is active.
>> Do you have to use nomsi for whatever reason?
> 
> No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% 
> sure that it used to be - I've no idea how
> it got dropped. If I'm not sure about an option, I start by taking the 
> recommendation in the kconfig help. Help on MSI
> has a very clear "say Y".

As I said above I have re-enabled MSI.
> 
>>
>> Heiner
>>
>>>> Maciej
>>>>
>>> Chris
>>>
>>


Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-09-29 Thread Chris Clayton
Sorry, sent by accident. Note to self - don't attempt email until after second 
cup of coffee.

On 29/09/2018 08:25, Chris Clayton wrote:
> 
> 
> On 28/09/2018 23:13, Heiner Kallweit wrote:
>> On 29.09.2018 00:00, Chris Clayton wrote:
>>> Thanks Maciej.
>>>
>>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
>>>> Hi,
>>>>
>>>>> Hi,
>>>>>
>>>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing 
>>>>> network problems after resuming from a
>>>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>>>>
>>>>> The pattern of the problem is that when I first boot, the network is 
>>>>> fine. But, after resume from suspend I find that
>>>>> the time taken for a ping of one of my ISP's nameservers increases from 
>>>>> 14-15ms to more than 1000ms. Moreover, when I
>>>>> open a browser (chromium or firefox), it fails to retrieve my home page 
>>>>> (https://www.google.co.uk) and pings of the
>>>>> nameserver fail with the message "Destination Host Unreachable". Often, I 
>>>>> can revive the network by stopping it with
>>>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 
>>>>> module and load it again.
>>>>
>>>> Please have a look at the following thread:
>>>> https://lkml.org/lkml/2018/9/25/1118
>>>>
>>>
>>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the 
>>> problem is not solved by it. Similarly, I applied
>>> Heiner's patch to the 4.19, but again the problem is not solved.
>>>
>> I think we talk about two different issues here. The one the fix is for has 
>> no link to suspend/resume.
>>
>> Chris, the lspci output doesn't provide enough detail to determine the exact 
>> chip version.
>> Can you provide the dmesg part with the XID?

I meant to say that I have now re-enabled MSI in 4.18.7 - the latest stable 
series kernel in which eth0 continues to
function reliably after a suspend/resume cycle. The second dmesg output below 
is taken from that kernel. The first one
was from an up-to-date 4.19 kernel
> 
> $ dmesg | grep -i r8169
> [5.320679] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> [5.321432] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM 
> control
> [5.322892] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 
> 48800800, IRQ 19
> [5.323786] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, 
> tx checksumming: ko]
> [   10.232077] r8169 :05:00.2 eth0: No native access to PCI extended 
> config space, falling back to CSI
> [   10.235218] r8169 :05:00.2 eth0: link down
> [   11.717460] r8169 :05:00.2 eth0: link up
> 
> $ dmesg | grep -i r8169
> [5.208040] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> [5.208677] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM 
> control
> [5.210066] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 
> 48800800, IRQ 29
> [5.210676] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, 
> tx checksumming: ko]
> [   10.456081] r8169 :05:00.2 eth0: No native access to PCI extended 
> config space, falling back to CSI
> [   10.459217] r8169 :05:00.2 eth0: link down
> [   10.459880] r8169 :05:00.2 eth0: link down
> [   12.015158] r8169 :05:00.2 eth0: link up
> 
> 
>> According to your lspci output neither MSI nor MSI-X is active.
>> Do you have to use nomsi for whatever reason?
> 
> No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% 
> sure that it used to be - I've no idea how
> it got dropped. If I'm not sure about an option, I start by taking the 
> recommendation in the kconfig help. Help on MSI
> has a very clear "say Y".

As I said above I have re-enabled MSI.
> 
>>
>> Heiner
>>
>>>> Maciej
>>>>
>>> Chris
>>>
>>


Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-09-29 Thread Chris Clayton



On 28/09/2018 23:13, Heiner Kallweit wrote:
> On 29.09.2018 00:00, Chris Clayton wrote:
>> Thanks Maciej.
>>
>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
>>> Hi,
>>>
>>>> Hi,
>>>>
>>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing 
>>>> network problems after resuming from a
>>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>>>
>>>> The pattern of the problem is that when I first boot, the network is fine. 
>>>> But, after resume from suspend I find that
>>>> the time taken for a ping of one of my ISP's nameservers increases from 
>>>> 14-15ms to more than 1000ms. Moreover, when I
>>>> open a browser (chromium or firefox), it fails to retrieve my home page 
>>>> (https://www.google.co.uk) and pings of the
>>>> nameserver fail with the message "Destination Host Unreachable". Often, I 
>>>> can revive the network by stopping it with
>>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 
>>>> module and load it again.
>>>
>>> Please have a look at the following thread:
>>> https://lkml.org/lkml/2018/9/25/1118
>>>
>>
>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem 
>> is not solved by it. Similarly, I applied
>> Heiner's patch to the 4.19, but again the problem is not solved.
>>
> I think we talk about two different issues here. The one the fix is for has 
> no link to suspend/resume.
> 
> Chris, the lspci output doesn't provide enough detail to determine the exact 
> chip version.
> Can you provide the dmesg part with the XID?

$ dmesg | grep -i r8169
[5.320679] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[5.321432] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM 
control
[5.322892] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 
48800800, IRQ 19
[5.323786] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, tx 
checksumming: ko]
[   10.232077] r8169 :05:00.2 eth0: No native access to PCI extended config 
space, falling back to CSI
[   10.235218] r8169 :05:00.2 eth0: link down
[   11.717460] r8169 :05:00.2 eth0: link up

$ dmesg | grep -i r8169
[5.208040] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[5.208677] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM 
control
[5.210066] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 
48800800, IRQ 29
[5.210676] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, tx 
checksumming: ko]
[   10.456081] r8169 :05:00.2 eth0: No native access to PCI extended config 
space, falling back to CSI
[   10.459217] r8169 :05:00.2 eth0: link down
[   10.459880] r8169 :05:00.2 eth0: link down
[   12.015158] r8169 :05:00.2 eth0: link up


> According to your lspci output neither MSI nor MSI-X is active.
> Do you have to use nomsi for whatever reason?

No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% 
sure that it used to be - I've no idea how
it got dropped. If I'm not sure about an option, I start by taking the 
recommendation in the kconfig help. Help on MSI
has a very clear "say Y".

> 
> Heiner
> 
>>> Maciej
>>>
>> Chris
>>
> 


Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-09-29 Thread Chris Clayton



On 28/09/2018 23:13, Heiner Kallweit wrote:
> On 29.09.2018 00:00, Chris Clayton wrote:
>> Thanks Maciej.
>>
>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
>>> Hi,
>>>
>>>> Hi,
>>>>
>>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing 
>>>> network problems after resuming from a
>>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>>>
>>>> The pattern of the problem is that when I first boot, the network is fine. 
>>>> But, after resume from suspend I find that
>>>> the time taken for a ping of one of my ISP's nameservers increases from 
>>>> 14-15ms to more than 1000ms. Moreover, when I
>>>> open a browser (chromium or firefox), it fails to retrieve my home page 
>>>> (https://www.google.co.uk) and pings of the
>>>> nameserver fail with the message "Destination Host Unreachable". Often, I 
>>>> can revive the network by stopping it with
>>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 
>>>> module and load it again.
>>>
>>> Please have a look at the following thread:
>>> https://lkml.org/lkml/2018/9/25/1118
>>>
>>
>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem 
>> is not solved by it. Similarly, I applied
>> Heiner's patch to the 4.19, but again the problem is not solved.
>>
> I think we talk about two different issues here. The one the fix is for has 
> no link to suspend/resume.
> 
> Chris, the lspci output doesn't provide enough detail to determine the exact 
> chip version.
> Can you provide the dmesg part with the XID?

$ dmesg | grep -i r8169
[5.320679] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[5.321432] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM 
control
[5.322892] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 
48800800, IRQ 19
[5.323786] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, tx 
checksumming: ko]
[   10.232077] r8169 :05:00.2 eth0: No native access to PCI extended config 
space, falling back to CSI
[   10.235218] r8169 :05:00.2 eth0: link down
[   11.717460] r8169 :05:00.2 eth0: link up

$ dmesg | grep -i r8169
[5.208040] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[5.208677] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM 
control
[5.210066] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 
48800800, IRQ 29
[5.210676] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, tx 
checksumming: ko]
[   10.456081] r8169 :05:00.2 eth0: No native access to PCI extended config 
space, falling back to CSI
[   10.459217] r8169 :05:00.2 eth0: link down
[   10.459880] r8169 :05:00.2 eth0: link down
[   12.015158] r8169 :05:00.2 eth0: link up


> According to your lspci output neither MSI nor MSI-X is active.
> Do you have to use nomsi for whatever reason?

No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% 
sure that it used to be - I've no idea how
it got dropped. If I'm not sure about an option, I start by taking the 
recommendation in the kconfig help. Help on MSI
has a very clear "say Y".

> 
> Heiner
> 
>>> Maciej
>>>
>> Chris
>>
> 


Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-09-28 Thread Chris Clayton
Thanks Maciej.

On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
> Hi,
> 
>> Hi,
>>
>> I upgraded my kernel to 4.18.10 recently and have since been experiencing 
>> network problems after resuming from a
>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>
>> The pattern of the problem is that when I first boot, the network is fine. 
>> But, after resume from suspend I find that
>> the time taken for a ping of one of my ISP's nameservers increases from 
>> 14-15ms to more than 1000ms. Moreover, when I
>> open a browser (chromium or firefox), it fails to retrieve my home page 
>> (https://www.google.co.uk) and pings of the
>> nameserver fail with the message "Destination Host Unreachable". Often, I 
>> can revive the network by stopping it with
>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 
>> module and load it again.
> 
> Please have a look at the following thread:
> https://lkml.org/lkml/2018/9/25/1118
> 

I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem is 
not solved by it. Similarly, I applied
Heiner's patch to the 4.19, but again the problem is not solved.

> Maciej
> 
Chris


Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

2018-09-28 Thread Chris Clayton
Thanks Maciej.

On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
> Hi,
> 
>> Hi,
>>
>> I upgraded my kernel to 4.18.10 recently and have since been experiencing 
>> network problems after resuming from a
>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>
>> The pattern of the problem is that when I first boot, the network is fine. 
>> But, after resume from suspend I find that
>> the time taken for a ping of one of my ISP's nameservers increases from 
>> 14-15ms to more than 1000ms. Moreover, when I
>> open a browser (chromium or firefox), it fails to retrieve my home page 
>> (https://www.google.co.uk) and pings of the
>> nameserver fail with the message "Destination Host Unreachable". Often, I 
>> can revive the network by stopping it with
>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 
>> module and load it again.
> 
> Please have a look at the following thread:
> https://lkml.org/lkml/2018/9/25/1118
> 

I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem is 
not solved by it. Similarly, I applied
Heiner's patch to the 4.19, but again the problem is not solved.

> Maciej
> 
Chris


Re: [PATCH V2] mfd: rtsx: release IRQ during shutdown

2018-01-03 Thread Chris Clayton


On 03/01/18 12:32, Sinan Kaya wrote:
> 'Commit cc27b735ad3a ("PCI/portdrv: Turn off PCIe services during
> shutdown")' revealed a resource leak in rtsx_pci driver during shutdown.
> 
> Issue shows up as a warning during shutdown as follows:
> 
> remove_proc_entry: removing non-empty directory 'irq/17', leaking at least
> 'rtsx_pci'
> WARNING: CPU: 0 PID: 1578 at fs/proc/generic.c:572
> remove_proc_entry+0x11d/0x130
> Modules linked in 
> ...
> Call Trace:
> unregister_irq_proc
> free_desc
> irq_free_descs
> mp_unmap_irq
> acpi_unregister_gsi_apic
> acpi_pci_irq_disable
> do_pci_disable_device
> pci_disable_device
> device_shutdown
> kernel_restart
> Sys_reboot
> 
> Even though rtsx_pci driver implements a shutdown callback, it is not
> releasing the interrupt that it registered during probe. This is causing
> the ACPI layer to complain that the shared IRQ is in use while freeing
> IRQ.
> 
> This code releases the IRQ to prevent resource leak and eliminate the
> warning.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=198141
> Reported-by: Chris Clayton <chris2...@googlemail.com>
> Fixes: cc27b735ad3a ("PCI/portdrv: Turn off PCIe services during shutdown")
> Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
> ---
>  drivers/mfd/rtsx_pcr.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/mfd/rtsx_pcr.c b/drivers/mfd/rtsx_pcr.c
> index 590fb9a..c3ed885 100644
> --- a/drivers/mfd/rtsx_pcr.c
> +++ b/drivers/mfd/rtsx_pcr.c
> @@ -1543,6 +1543,9 @@ static void rtsx_pci_shutdown(struct pci_dev *pcidev)
>   rtsx_pci_power_off(pcr, HOST_ENTER_S1);
>  
>   pci_disable_device(pcidev);
> + free_irq(pcr->irq, (void *)pcr);
> + if (pcr->msi_en)
> + pci_disable_msi(pcr->pci);
>  }
>  
>  #else /* CONFIG_PM */

I've applied v2 of the patch and built and installed the kernel (-rc6). All I 
can say, is that my system still closes
down without the warning and call trace that the unpatched kernel produces. 
It's the best I can do by way of a test
because I have no idea what the code added in v2 is supposed to achieve and, 
because my system shuts down (or reboots)
moments later, there is no opportunity to check. If that constitutes a valid 
test:

Tested-by: Chris Clayton <chris2...@googlemail.com>

> 


Re: [PATCH V2] mfd: rtsx: release IRQ during shutdown

2018-01-03 Thread Chris Clayton


On 03/01/18 12:32, Sinan Kaya wrote:
> 'Commit cc27b735ad3a ("PCI/portdrv: Turn off PCIe services during
> shutdown")' revealed a resource leak in rtsx_pci driver during shutdown.
> 
> Issue shows up as a warning during shutdown as follows:
> 
> remove_proc_entry: removing non-empty directory 'irq/17', leaking at least
> 'rtsx_pci'
> WARNING: CPU: 0 PID: 1578 at fs/proc/generic.c:572
> remove_proc_entry+0x11d/0x130
> Modules linked in 
> ...
> Call Trace:
> unregister_irq_proc
> free_desc
> irq_free_descs
> mp_unmap_irq
> acpi_unregister_gsi_apic
> acpi_pci_irq_disable
> do_pci_disable_device
> pci_disable_device
> device_shutdown
> kernel_restart
> Sys_reboot
> 
> Even though rtsx_pci driver implements a shutdown callback, it is not
> releasing the interrupt that it registered during probe. This is causing
> the ACPI layer to complain that the shared IRQ is in use while freeing
> IRQ.
> 
> This code releases the IRQ to prevent resource leak and eliminate the
> warning.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=198141
> Reported-by: Chris Clayton 
> Fixes: cc27b735ad3a ("PCI/portdrv: Turn off PCIe services during shutdown")
> Signed-off-by: Sinan Kaya 
> ---
>  drivers/mfd/rtsx_pcr.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/mfd/rtsx_pcr.c b/drivers/mfd/rtsx_pcr.c
> index 590fb9a..c3ed885 100644
> --- a/drivers/mfd/rtsx_pcr.c
> +++ b/drivers/mfd/rtsx_pcr.c
> @@ -1543,6 +1543,9 @@ static void rtsx_pci_shutdown(struct pci_dev *pcidev)
>   rtsx_pci_power_off(pcr, HOST_ENTER_S1);
>  
>   pci_disable_device(pcidev);
> + free_irq(pcr->irq, (void *)pcr);
> + if (pcr->msi_en)
> + pci_disable_msi(pcr->pci);
>  }
>  
>  #else /* CONFIG_PM */

I've applied v2 of the patch and built and installed the kernel (-rc6). All I 
can say, is that my system still closes
down without the warning and call trace that the unpatched kernel produces. 
It's the best I can do by way of a test
because I have no idea what the code added in v2 is supposed to achieve and, 
because my system shuts down (or reboots)
moments later, there is no opportunity to check. If that constitutes a valid 
test:

Tested-by: Chris Clayton 

> 


Re: Oops on 4.15-rc[123] on shutdown/reboot

2017-12-11 Thread Chris Clayton
On 11/12/17 17:17, Bjorn Helgaas wrote:
> [+cc linux-pci]
> 
> On Mon, Dec 11, 2017 at 11:29:50AM -0500, Sinan Kaya wrote:
>> Hi Chris,
>>
>>>
>>> I'm more than happy to provide additional diagnostics and test proposed 
>>> fixes. As a starter for ten, I've attached the
>>> output from 'lspci -v'. If, however, you need to see the backtrace, I'll 
>>> need some advice on how to capture that.
>>>
>>
>> Can you open a bugzilla and also share the boot log?
>>
>> There must be something unique about your system.
> 
> Can you attach "lspci -vv" output (as root) to the bugzilla, too?
> 

I've opened the bugzilla report (Bug 198141) and attached the dmesg and lspci 
-vv outputs to it.




Re: Oops on 4.15-rc[123] on shutdown/reboot

2017-12-11 Thread Chris Clayton
On 11/12/17 17:17, Bjorn Helgaas wrote:
> [+cc linux-pci]
> 
> On Mon, Dec 11, 2017 at 11:29:50AM -0500, Sinan Kaya wrote:
>> Hi Chris,
>>
>>>
>>> I'm more than happy to provide additional diagnostics and test proposed 
>>> fixes. As a starter for ten, I've attached the
>>> output from 'lspci -v'. If, however, you need to see the backtrace, I'll 
>>> need some advice on how to capture that.
>>>
>>
>> Can you open a bugzilla and also share the boot log?
>>
>> There must be something unique about your system.
> 
> Can you attach "lspci -vv" output (as root) to the bugzilla, too?
> 

I've opened the bugzilla report (Bug 198141) and attached the dmesg and lspci 
-vv outputs to it.




Re: Oops on 4.15-rc[123] on shutdown/reboot

2017-12-11 Thread Chris Clayton


On 11/12/17 17:24, Sinan Kaya wrote:
> On 12/11/2017 12:06 PM, Chris Clayton wrote:
>> Here's the output of dmesg for 4.15.0-rc3. I'll open a bugzilla later and 
>> add this and the lspci output that I sent with
>> my original repoart.
> 
> This was helpful. I don't see any AER/DPC in your log. It looks like the only 
> PCIe
> portdrv service you have is PME.
> 
> Can we do a quick hack and return immediately from 
> 
> static int pcie_pme_probe(struct pcie_device *srv)
> 
> by putting return 0; at the top. 
> 
> Same thing in 
> 
> static void pcie_pme_remove(struct pcie_device *srv)
> 
> just place a return at the top.
> 

I made those changes (to drivers/pci/pcie/pme.c) and built and installed the 
kernel.  Sorry, but I still get the oops
when I reboot.

> I'm hoping your problem will go away after this. Then, we can start peeling 
> the onion.
> 


Re: Oops on 4.15-rc[123] on shutdown/reboot

2017-12-11 Thread Chris Clayton


On 11/12/17 17:24, Sinan Kaya wrote:
> On 12/11/2017 12:06 PM, Chris Clayton wrote:
>> Here's the output of dmesg for 4.15.0-rc3. I'll open a bugzilla later and 
>> add this and the lspci output that I sent with
>> my original repoart.
> 
> This was helpful. I don't see any AER/DPC in your log. It looks like the only 
> PCIe
> portdrv service you have is PME.
> 
> Can we do a quick hack and return immediately from 
> 
> static int pcie_pme_probe(struct pcie_device *srv)
> 
> by putting return 0; at the top. 
> 
> Same thing in 
> 
> static void pcie_pme_remove(struct pcie_device *srv)
> 
> just place a return at the top.
> 

I made those changes (to drivers/pci/pcie/pme.c) and built and installed the 
kernel.  Sorry, but I still get the oops
when I reboot.

> I'm hoping your problem will go away after this. Then, we can start peeling 
> the onion.
> 


Re: Oops on 4.15-rc[123] on shutdown/reboot

2017-12-11 Thread Chris Clayton


On 11/12/17 16:29, Sinan Kaya wrote:
> Hi Chris,
> 
>>
>> I'm more than happy to provide additional diagnostics and test proposed 
>> fixes. As a starter for ten, I've attached the
>> output from 'lspci -v'. If, however, you need to see the backtrace, I'll 
>> need some advice on how to capture that.
>>
> 
> Can you open a bugzilla and also share the boot log?
> 

Here's the output of dmesg for 4.15.0-rc3. I'll open a bugzilla later and add 
this and the lspci output that I sent with
my original repoart.

> There must be something unique about your system.
>  
> Sinan
> 
[0.00] Linux version 4.15.0-rc3 (chris@laptop) (gcc version 7.2.1 20171207 (GCC)) #398 SMP PREEMPT Mon Dec 11 07:46:20 GMT 2017
[0.00] Command line: ro root=/dev/sda2 resume=/dev/sda6 rootfstype=ext4 net.ifnames=0
[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x0100-0x0009d7ff] usable
[0.00] BIOS-e820: [mem 0x0009d800-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000e-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0xd7216fff] usable
[0.00] BIOS-e820: [mem 0xd7217000-0xd721dfff] ACPI NVS
[0.00] BIOS-e820: [mem 0xd721e000-0xd7a0cfff] usable
[0.00] BIOS-e820: [mem 0xd7a0d000-0xd7ca1fff] reserved
[0.00] BIOS-e820: [mem 0xd7ca2000-0xdb4d] usable
[0.00] BIOS-e820: [mem 0xdb4e-0xdb82dfff] reserved
[0.00] BIOS-e820: [mem 0xdb82e000-0xdb88afff] usable
[0.00] BIOS-e820: [mem 0xdb88b000-0xdb9bcfff] ACPI NVS
[0.00] BIOS-e820: [mem 0xdb9bd000-0xdbffefff] reserved
[0.00] BIOS-e820: [mem 0xdbfff000-0xdbff] usable
[0.00] BIOS-e820: [mem 0xdd00-0xdf1f] reserved
[0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xfed0-0xfed03fff] reserved
[0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xff00-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00041fdf] usable
[0.00] NX (Execute Disable) protection: active
[0.00] random: fast init done
[0.00] SMBIOS 2.7 present.
[0.00] DMI: Notebook W65_67SZ/W65_67SZ, BIOS 1.03.05 02/26/2014
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] e820: last_pfn = 0x41fe00 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-C write-protect
[0.00]   D-E7FFF uncachable
[0.00]   E8000-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 00 mask 7C write-back
[0.00]   1 base 04 mask 7FE000 write-back
[0.00]   2 base 00E000 mask 7FE000 uncachable
[0.00]   3 base 00DE00 mask 7FFE00 uncachable
[0.00]   4 base 00DD00 mask 7FFF00 uncachable
[0.00]   5 base 041FE0 mask 7FFFE0 uncachable
[0.00]   6 disabled
[0.00]   7 disabled
[0.00]   8 disabled
[0.00]   9 disabled
[0.00] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT  
[0.00] e820: update [mem 0xdd00-0x] usable ==> reserved
[0.00] e820: last_pfn = 0xdc000 max_arch_pfn = 0x4
[0.00] found SMP MP-table at [mem 0x000fd820-0x000fd82f] mapped at [(ptrval)]
[0.00] Base memory trampoline at [(ptrval)] 97000 size 24576
[0.00] Using GB pages for direct mapping
[0.00] BRK [0x02098000, 0x02098fff] PGTABLE
[0.00] BRK [0x02099000, 0x02099fff] PGTABLE
[0.00] BRK [0x0209a000, 0x0209afff] PGTABLE
[0.00] BRK [0x0209b000, 0x0209bfff] PGTABLE
[0.00] BRK [0x0209c000, 0x0209cfff] PGTABLE
[0.00] BRK [0x0209d000, 0x0209dfff] PGTABLE
[0.00] ACPI: Early 

Re: Oops on 4.15-rc[123] on shutdown/reboot

2017-12-11 Thread Chris Clayton


On 11/12/17 16:29, Sinan Kaya wrote:
> Hi Chris,
> 
>>
>> I'm more than happy to provide additional diagnostics and test proposed 
>> fixes. As a starter for ten, I've attached the
>> output from 'lspci -v'. If, however, you need to see the backtrace, I'll 
>> need some advice on how to capture that.
>>
> 
> Can you open a bugzilla and also share the boot log?
> 

Here's the output of dmesg for 4.15.0-rc3. I'll open a bugzilla later and add 
this and the lspci output that I sent with
my original repoart.

> There must be something unique about your system.
>  
> Sinan
> 
[0.00] Linux version 4.15.0-rc3 (chris@laptop) (gcc version 7.2.1 20171207 (GCC)) #398 SMP PREEMPT Mon Dec 11 07:46:20 GMT 2017
[0.00] Command line: ro root=/dev/sda2 resume=/dev/sda6 rootfstype=ext4 net.ifnames=0
[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x0100-0x0009d7ff] usable
[0.00] BIOS-e820: [mem 0x0009d800-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000e-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0xd7216fff] usable
[0.00] BIOS-e820: [mem 0xd7217000-0xd721dfff] ACPI NVS
[0.00] BIOS-e820: [mem 0xd721e000-0xd7a0cfff] usable
[0.00] BIOS-e820: [mem 0xd7a0d000-0xd7ca1fff] reserved
[0.00] BIOS-e820: [mem 0xd7ca2000-0xdb4d] usable
[0.00] BIOS-e820: [mem 0xdb4e-0xdb82dfff] reserved
[0.00] BIOS-e820: [mem 0xdb82e000-0xdb88afff] usable
[0.00] BIOS-e820: [mem 0xdb88b000-0xdb9bcfff] ACPI NVS
[0.00] BIOS-e820: [mem 0xdb9bd000-0xdbffefff] reserved
[0.00] BIOS-e820: [mem 0xdbfff000-0xdbff] usable
[0.00] BIOS-e820: [mem 0xdd00-0xdf1f] reserved
[0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xfed0-0xfed03fff] reserved
[0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xff00-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00041fdf] usable
[0.00] NX (Execute Disable) protection: active
[0.00] random: fast init done
[0.00] SMBIOS 2.7 present.
[0.00] DMI: Notebook W65_67SZ/W65_67SZ, BIOS 1.03.05 02/26/2014
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] e820: last_pfn = 0x41fe00 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-C write-protect
[0.00]   D-E7FFF uncachable
[0.00]   E8000-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 00 mask 7C write-back
[0.00]   1 base 04 mask 7FE000 write-back
[0.00]   2 base 00E000 mask 7FE000 uncachable
[0.00]   3 base 00DE00 mask 7FFE00 uncachable
[0.00]   4 base 00DD00 mask 7FFF00 uncachable
[0.00]   5 base 041FE0 mask 7FFFE0 uncachable
[0.00]   6 disabled
[0.00]   7 disabled
[0.00]   8 disabled
[0.00]   9 disabled
[0.00] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT  
[0.00] e820: update [mem 0xdd00-0x] usable ==> reserved
[0.00] e820: last_pfn = 0xdc000 max_arch_pfn = 0x4
[0.00] found SMP MP-table at [mem 0x000fd820-0x000fd82f] mapped at [(ptrval)]
[0.00] Base memory trampoline at [(ptrval)] 97000 size 24576
[0.00] Using GB pages for direct mapping
[0.00] BRK [0x02098000, 0x02098fff] PGTABLE
[0.00] BRK [0x02099000, 0x02099fff] PGTABLE
[0.00] BRK [0x0209a000, 0x0209afff] PGTABLE
[0.00] BRK [0x0209b000, 0x0209bfff] PGTABLE
[0.00] BRK [0x0209c000, 0x0209cfff] PGTABLE
[0.00] BRK [0x0209d000, 0x0209dfff] PGTABLE
[0.00] ACPI: Early 

Oops on 4.15-rc[123] on shutdown/reboot

2017-12-11 Thread Chris Clayton
I've been getting an oops when shutting down my laptop (with /sbin/halt) or 
rebooting it (/sbin/reboot or
/usr/sbin/kexec). Unfortunately, I can't provide the backtrace because it is on 
the screen for only a moment before the
system shuts down/reboots.

I have however, bisected it and the outcome is:

cc27b735ad3a75574a6ab1a66ed6b09385e77e5e is the first bad commit
commit cc27b735ad3a75574a6ab1a66ed6b09385e77e5e
Author: Sinan Kaya 
Date:   Wed Oct 25 15:01:02 2017 -0400

PCI/portdrv: Turn off PCIe services during shutdown

Some of the PCIe services such as AER are being left enabled during
shutdown. This might cause spurious AER errors while SOC is being powered
down.

Clean up the PCIe services gracefully during shutdown to clear these false
positives.

Signed-off-by: Sinan Kaya 
Signed-off-by: Bjorn Helgaas 

:04 04 5a827d6956c581344a0bf392e30155c337673c1d 
76c6a39b53604a0a0a370383c3503f80aa7cbc1e M  drivers

I'm confident that this is the correct outcome because a kernel built with the 
preceding commit
(6018182d3158505f11103adaee8ffb53424df986) does not oops. Nor does -rc3 with 
the patch reversed.

I'm more than happy to provide additional diagnostics and test proposed fixes. 
As a starter for ten, I've attached the
output from 'lspci -v'. If, however, you need to see the backtrace, I'll need 
some advice on how to capture that.

Chris

00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor 
DRAM Controller (rev 06)
Subsystem: CLEVO/KAPOK Computer Xeon E3-1200 v3/4th Gen Core Processor 
DRAM Controller
Flags: bus master, fast devsel, latency 0
Capabilities: [e0] Vendor Specific Information: Len=0c 

00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor 
PCI Express x16 Controller (rev 06) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 16
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
I/O behind bridge: None
Memory behind bridge: None
Prefetchable memory behind bridge: None
Capabilities: [88] Subsystem: CLEVO/KAPOK Computer Xeon E3-1200 v3/4th 
Gen Core Processor PCI Express x16 Controller
Capabilities: [80] Power Management version 3
Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit-
Capabilities: [a0] Express Root Port (Slot+), MSI 00
Kernel driver in use: pcieport

00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor 
Integrated Graphics Controller (rev 06) (prog-if 00 [VGA controller])
Subsystem: CLEVO/KAPOK Computer 4th Gen Core Processor Integrated 
Graphics Controller
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at f780 (64-bit, non-prefetchable) [size=4M]
Memory at e000 (64-bit, prefetchable) [size=256M]
I/O ports at f000 [size=64]
[virtual] Expansion ROM at 000c [disabled] [size=128K]
Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit-
Capabilities: [d0] Power Management version 2
Capabilities: [a4] PCI Advanced Features
Kernel driver in use: i915

00:03.0 Audio device: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor 
HD Audio Controller (rev 06)
Subsystem: CLEVO/KAPOK Computer Xeon E3-1200 v3/4th Gen Core Processor 
HD Audio Controller
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at f7f14000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [50] Power Management version 2
Capabilities: [60] MSI: Enable- Count=1/1 Maskable- 64bit-
Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel

00:14.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family 
USB xHCI (rev 05) (prog-if 30 [XHCI])
Subsystem: CLEVO/KAPOK Computer 8 Series/C220 Series Chipset Family USB 
xHCI
Flags: bus master, medium devsel, latency 0, IRQ 16
Memory at f7f0 (64-bit, non-prefetchable) [size=64K]
Capabilities: [70] Power Management version 2
Capabilities: [80] MSI: Enable- Count=1/8 Maskable- 64bit+
Kernel driver in use: xhci_hcd

00:16.0 Communication controller: Intel Corporation 8 Series/C220 Series 
Chipset Family MEI Controller #1 (rev 04)
Subsystem: CLEVO/KAPOK Computer 8 Series/C220 Series Chipset Family MEI 
Controller
Flags: bus master, fast devsel, latency 0, IRQ 11
Memory at f7f1e000 (64-bit, non-prefetchable) [size=16]
Capabilities: [50] Power Management version 3
Capabilities: [8c] MSI: Enable- Count=1/1 Maskable- 64bit+

00:1a.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family 
USB EHCI #2 (rev 05) (prog-if 20 [EHCI])
Subsystem: CLEVO/KAPOK Computer 8 Series/C220 Series 

Oops on 4.15-rc[123] on shutdown/reboot

2017-12-11 Thread Chris Clayton
I've been getting an oops when shutting down my laptop (with /sbin/halt) or 
rebooting it (/sbin/reboot or
/usr/sbin/kexec). Unfortunately, I can't provide the backtrace because it is on 
the screen for only a moment before the
system shuts down/reboots.

I have however, bisected it and the outcome is:

cc27b735ad3a75574a6ab1a66ed6b09385e77e5e is the first bad commit
commit cc27b735ad3a75574a6ab1a66ed6b09385e77e5e
Author: Sinan Kaya 
Date:   Wed Oct 25 15:01:02 2017 -0400

PCI/portdrv: Turn off PCIe services during shutdown

Some of the PCIe services such as AER are being left enabled during
shutdown. This might cause spurious AER errors while SOC is being powered
down.

Clean up the PCIe services gracefully during shutdown to clear these false
positives.

Signed-off-by: Sinan Kaya 
Signed-off-by: Bjorn Helgaas 

:04 04 5a827d6956c581344a0bf392e30155c337673c1d 
76c6a39b53604a0a0a370383c3503f80aa7cbc1e M  drivers

I'm confident that this is the correct outcome because a kernel built with the 
preceding commit
(6018182d3158505f11103adaee8ffb53424df986) does not oops. Nor does -rc3 with 
the patch reversed.

I'm more than happy to provide additional diagnostics and test proposed fixes. 
As a starter for ten, I've attached the
output from 'lspci -v'. If, however, you need to see the backtrace, I'll need 
some advice on how to capture that.

Chris

00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor 
DRAM Controller (rev 06)
Subsystem: CLEVO/KAPOK Computer Xeon E3-1200 v3/4th Gen Core Processor 
DRAM Controller
Flags: bus master, fast devsel, latency 0
Capabilities: [e0] Vendor Specific Information: Len=0c 

00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor 
PCI Express x16 Controller (rev 06) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 16
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
I/O behind bridge: None
Memory behind bridge: None
Prefetchable memory behind bridge: None
Capabilities: [88] Subsystem: CLEVO/KAPOK Computer Xeon E3-1200 v3/4th 
Gen Core Processor PCI Express x16 Controller
Capabilities: [80] Power Management version 3
Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit-
Capabilities: [a0] Express Root Port (Slot+), MSI 00
Kernel driver in use: pcieport

00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor 
Integrated Graphics Controller (rev 06) (prog-if 00 [VGA controller])
Subsystem: CLEVO/KAPOK Computer 4th Gen Core Processor Integrated 
Graphics Controller
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at f780 (64-bit, non-prefetchable) [size=4M]
Memory at e000 (64-bit, prefetchable) [size=256M]
I/O ports at f000 [size=64]
[virtual] Expansion ROM at 000c [disabled] [size=128K]
Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit-
Capabilities: [d0] Power Management version 2
Capabilities: [a4] PCI Advanced Features
Kernel driver in use: i915

00:03.0 Audio device: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor 
HD Audio Controller (rev 06)
Subsystem: CLEVO/KAPOK Computer Xeon E3-1200 v3/4th Gen Core Processor 
HD Audio Controller
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at f7f14000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [50] Power Management version 2
Capabilities: [60] MSI: Enable- Count=1/1 Maskable- 64bit-
Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel

00:14.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family 
USB xHCI (rev 05) (prog-if 30 [XHCI])
Subsystem: CLEVO/KAPOK Computer 8 Series/C220 Series Chipset Family USB 
xHCI
Flags: bus master, medium devsel, latency 0, IRQ 16
Memory at f7f0 (64-bit, non-prefetchable) [size=64K]
Capabilities: [70] Power Management version 2
Capabilities: [80] MSI: Enable- Count=1/8 Maskable- 64bit+
Kernel driver in use: xhci_hcd

00:16.0 Communication controller: Intel Corporation 8 Series/C220 Series 
Chipset Family MEI Controller #1 (rev 04)
Subsystem: CLEVO/KAPOK Computer 8 Series/C220 Series Chipset Family MEI 
Controller
Flags: bus master, fast devsel, latency 0, IRQ 11
Memory at f7f1e000 (64-bit, non-prefetchable) [size=16]
Capabilities: [50] Power Management version 3
Capabilities: [8c] MSI: Enable- Count=1/1 Maskable- 64bit+

00:1a.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family 
USB EHCI #2 (rev 05) (prog-if 20 [EHCI])
Subsystem: CLEVO/KAPOK Computer 8 Series/C220 Series Chipset Family USB 
EHCI
Flags: bus master, medium devsel, 

"PM / QoS: Fix device resume latency PM QoS" breaks sound

2017-10-28 Thread Chris Clayton
Hi,

I pulled the latestchanges from Linus' tree this evening and have found that 
with the new kernel, sound is not working
on my laptop. More precisely, the built-in speakers don't produce any sound. 
Sound does work when I use ear-plugs in the
headphone socket. It also works via a bluetooth speaker.

I've bisected the problem and ended up at:

0cc2b4e5a020fc7f4d1795741c116c983e9467d7 is the first bad commit
commit 0cc2b4e5a020fc7f4d1795741c116c983e9467d7
Author: Rafael J. Wysocki 
Date:   Tue Oct 24 15:20:45 2017 +0200

PM / QoS: Fix device resume latency PM QoS

The special value of 0 for device resume latency PM QoS means
"no restriction", but there are two problems with that.

First, device resume latency PM QoS requests with 0 as the
value are always put in front of requests with positive
values in the priority lists used internally by the PM QoS
framework, causing 0 to be chosen as an effective constraint
value.  However, that 0 is then interpreted as "no restriction"
effectively overriding the other requests with specific
restrictions which is incorrect.

Second, the users of device resume latency PM QoS have no
way to specify that *any* resume latency at all should be
avoided, which is an artificial limitation in general.

To address these issues, modify device resume latency PM QoS to
use S32_MAX as the "no constraint" value and 0 as the "no
latency at all" one and rework its users (the cpuidle menu
governor, the genpd QoS governor and the runtime PM framework)
to follow these changes.

Also add a special "n/a" value to the corresponding user space I/F
to allow user space to indicate that it cannot accept any resume
latencies at all for the given device.

Fixes: 85dc0b8a4019 (PM / QoS: Make it possible to expose PM QoS latency 
constraints)
Link: https://bugzilla.kernel.org/show_bug.cgi?id=197323
Reported-by: Reinette Chatre 
Tested-by: Reinette Chatre 
Signed-off-by: Rafael J. Wysocki 
Acked-by: Alex Shi 
Cc: All applicable 

:04 04 f0c128ec799bb9894cfc5c341f88ad7bdfb15bac 
9a2e8171ca47f864bd534cd9c160cce58449a889 M  Documentation
:04 04 0028ffec81675e686bdd621c0445d3e814d7980c 
29db53c6356a6fed9c8bdbc2d6bc7bd56a96e529 M  drivers
:04 04 2e66b79bd2ffb4fcb00f04a69a0afe5c80d1d3f3 
dd6d8e90b59389cd2bd8a0c92716d79d2eeb8268 M  include

With that change reverted, the speakers emit sound again.

The audio devices identified by "lspci -vv" are as follows:

00:03.0 Audio device: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor 
HD Audio Controller (rev 06)
Subsystem: CLEVO/KAPOK Computer Xeon E3-1200 v3/4th Gen Core Processor 
HD Audio Controller
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- SERR- 

"PM / QoS: Fix device resume latency PM QoS" breaks sound

2017-10-28 Thread Chris Clayton
Hi,

I pulled the latestchanges from Linus' tree this evening and have found that 
with the new kernel, sound is not working
on my laptop. More precisely, the built-in speakers don't produce any sound. 
Sound does work when I use ear-plugs in the
headphone socket. It also works via a bluetooth speaker.

I've bisected the problem and ended up at:

0cc2b4e5a020fc7f4d1795741c116c983e9467d7 is the first bad commit
commit 0cc2b4e5a020fc7f4d1795741c116c983e9467d7
Author: Rafael J. Wysocki 
Date:   Tue Oct 24 15:20:45 2017 +0200

PM / QoS: Fix device resume latency PM QoS

The special value of 0 for device resume latency PM QoS means
"no restriction", but there are two problems with that.

First, device resume latency PM QoS requests with 0 as the
value are always put in front of requests with positive
values in the priority lists used internally by the PM QoS
framework, causing 0 to be chosen as an effective constraint
value.  However, that 0 is then interpreted as "no restriction"
effectively overriding the other requests with specific
restrictions which is incorrect.

Second, the users of device resume latency PM QoS have no
way to specify that *any* resume latency at all should be
avoided, which is an artificial limitation in general.

To address these issues, modify device resume latency PM QoS to
use S32_MAX as the "no constraint" value and 0 as the "no
latency at all" one and rework its users (the cpuidle menu
governor, the genpd QoS governor and the runtime PM framework)
to follow these changes.

Also add a special "n/a" value to the corresponding user space I/F
to allow user space to indicate that it cannot accept any resume
latencies at all for the given device.

Fixes: 85dc0b8a4019 (PM / QoS: Make it possible to expose PM QoS latency 
constraints)
Link: https://bugzilla.kernel.org/show_bug.cgi?id=197323
Reported-by: Reinette Chatre 
Tested-by: Reinette Chatre 
Signed-off-by: Rafael J. Wysocki 
Acked-by: Alex Shi 
Cc: All applicable 

:04 04 f0c128ec799bb9894cfc5c341f88ad7bdfb15bac 
9a2e8171ca47f864bd534cd9c160cce58449a889 M  Documentation
:04 04 0028ffec81675e686bdd621c0445d3e814d7980c 
29db53c6356a6fed9c8bdbc2d6bc7bd56a96e529 M  drivers
:04 04 2e66b79bd2ffb4fcb00f04a69a0afe5c80d1d3f3 
dd6d8e90b59389cd2bd8a0c92716d79d2eeb8268 M  include

With that change reverted, the speakers emit sound again.

The audio devices identified by "lspci -vv" are as follows:

00:03.0 Audio device: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor 
HD Audio Controller (rev 06)
Subsystem: CLEVO/KAPOK Computer Xeon E3-1200 v3/4th Gen Core Processor 
HD Audio Controller
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- SERR- 

Re: [PATCH 4.12 004/106] scsi: sg: fix SG_DXFER_FROM_DEV transfers

2017-08-10 Thread Chris Clayton


On 09/08/17 17:51, Greg Kroah-Hartman wrote:
> 4.12-stable review patch.  If anyone has any objections, please let me know.
> 
> ---
 I repeat my comments when the patch was queued for stable:

1. Johannes' commit message says that the transfer must have a length bigger 
than 0, so the code should return false if
the length is less than or equal to 0, but the test is for less than 0.

2. But in any case, there's another patch that removes all this 
sg_is_valid_dxfer() jiggery-pokery and replaces it with
a simpler test. It hasn't reached Linus' tree yet but is, I believe, cc'd to 
stable.


As Johannes said in response to the second of my comments, the patch that 
replaces sg_is_valid_dxfer() with a simpler
test is now in Linus' tree - commit f930c7043663188429cd9b254e9d761edfc101ce. 
Without that change, I think there is
still some breakage in sg.

Chris
---
> 
> From: Johannes Thumshirn <jthumsh...@suse.de>
> 
> commit 68c59fcea1f2c6a54c62aa896cc623c1b5bc9b47 upstream.
> 
> SG_DXFER_FROM_DEV transfers do not necessarily have a dxferp as we set
> it to NULL for the old sg_io read/write interface, but must have a
> length bigger than 0. This fixes a regression introduced by commit
> 28676d869bbb ("scsi: sg: check for valid direction before starting the
> request")
> 
> Signed-off-by: Johannes Thumshirn <jthumsh...@suse.de>
> Fixes: 28676d869bbb ("scsi: sg: check for valid direction before starting the 
> request")
> Reported-by: Chris Clayton <chris2...@googlemail.com>
> Tested-by: Chris Clayton <chris2...@googlemail.com>
> Cc: Douglas Gilbert <dgilb...@interlog.com>
> Reviewed-by: Hannes Reinecke <h...@suse.com>
> Tested-by: Chris Clayton <chris2...@googlemail.com>
> Acked-by: Douglas Gilbert <dgilb...@interlog.com>
> Signed-off-by: Martin K. Petersen <martin.peter...@oracle.com>
> Signed-off-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>
> 
> ---
>  drivers/scsi/sg.c |5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> --- a/drivers/scsi/sg.c
> +++ b/drivers/scsi/sg.c
> @@ -758,8 +758,11 @@ static bool sg_is_valid_dxfer(sg_io_hdr_
>   if (hp->dxferp || hp->dxfer_len > 0)
>   return false;
>   return true;
> - case SG_DXFER_TO_DEV:
>   case SG_DXFER_FROM_DEV:
> + if (hp->dxfer_len < 0)
> + return false;
> + return true;
> + case SG_DXFER_TO_DEV:
>   case SG_DXFER_TO_FROM_DEV:
>   if (!hp->dxferp || hp->dxfer_len == 0)
>   return false;
> 
> 


Re: [PATCH 4.12 004/106] scsi: sg: fix SG_DXFER_FROM_DEV transfers

2017-08-10 Thread Chris Clayton


On 09/08/17 17:51, Greg Kroah-Hartman wrote:
> 4.12-stable review patch.  If anyone has any objections, please let me know.
> 
> ---
 I repeat my comments when the patch was queued for stable:

1. Johannes' commit message says that the transfer must have a length bigger 
than 0, so the code should return false if
the length is less than or equal to 0, but the test is for less than 0.

2. But in any case, there's another patch that removes all this 
sg_is_valid_dxfer() jiggery-pokery and replaces it with
a simpler test. It hasn't reached Linus' tree yet but is, I believe, cc'd to 
stable.


As Johannes said in response to the second of my comments, the patch that 
replaces sg_is_valid_dxfer() with a simpler
test is now in Linus' tree - commit f930c7043663188429cd9b254e9d761edfc101ce. 
Without that change, I think there is
still some breakage in sg.

Chris
---
> 
> From: Johannes Thumshirn 
> 
> commit 68c59fcea1f2c6a54c62aa896cc623c1b5bc9b47 upstream.
> 
> SG_DXFER_FROM_DEV transfers do not necessarily have a dxferp as we set
> it to NULL for the old sg_io read/write interface, but must have a
> length bigger than 0. This fixes a regression introduced by commit
> 28676d869bbb ("scsi: sg: check for valid direction before starting the
> request")
> 
> Signed-off-by: Johannes Thumshirn 
> Fixes: 28676d869bbb ("scsi: sg: check for valid direction before starting the 
> request")
> Reported-by: Chris Clayton 
> Tested-by: Chris Clayton 
> Cc: Douglas Gilbert 
> Reviewed-by: Hannes Reinecke 
> Tested-by: Chris Clayton 
> Acked-by: Douglas Gilbert 
> Signed-off-by: Martin K. Petersen 
> Signed-off-by: Greg Kroah-Hartman 
> 
> ---
>  drivers/scsi/sg.c |5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> --- a/drivers/scsi/sg.c
> +++ b/drivers/scsi/sg.c
> @@ -758,8 +758,11 @@ static bool sg_is_valid_dxfer(sg_io_hdr_
>   if (hp->dxferp || hp->dxfer_len > 0)
>   return false;
>   return true;
> - case SG_DXFER_TO_DEV:
>   case SG_DXFER_FROM_DEV:
> + if (hp->dxfer_len < 0)
> + return false;
> + return true;
> + case SG_DXFER_TO_DEV:
>   case SG_DXFER_TO_FROM_DEV:
>   if (!hp->dxferp || hp->dxfer_len == 0)
>   return false;
> 
> 


Re: [PATCH v2] scsi: sg: fix SG_DXFER_FROM_DEV transfers

2017-07-07 Thread Chris Clayton


On 07/07/17 09:56, Johannes Thumshirn wrote:
> SG_DXFER_FROM_DEV transfers do not necessarily have a dxferp as we set
> it to NULL for the old sg_io read/write interface, but must have a length
> bigger than 0. This fixes a regression introduced by commit 28676d869bbb
> ("scsi: sg: check for valid direction before starting the request")
> 

I've tested this new patch and the Nero applications can still find the optical 
drives on my laptop.

Tested-by: Chris Clayton <chris2...@googlemail.com>

> Signed-off-by: Johannes Thumshirn <jthumsh...@suse.de>
> Fixes: 28676d869bbb ("scsi: sg: check for valid direction before starting the 
> request")
> Reported-by: Chris Clayton <chris2...@googlemail.com>
> Tested-by: Chris Clayton <chris2...@googlemail.com>
> Cc: Douglas Gilbert <dgilb...@interlog.com>
> Reviewed-by: Hannes Reinecke <h...@suse.com>
> ---
> Changes to v1:
> * Fix breakage of the sg_io v3 interface, verified using sg_inq
> 
>  drivers/scsi/sg.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
> index 21225d62b0c1..1e82d4128a84 100644
> --- a/drivers/scsi/sg.c
> +++ b/drivers/scsi/sg.c
> @@ -758,8 +758,11 @@ static bool sg_is_valid_dxfer(sg_io_hdr_t *hp)
>   if (hp->dxferp || hp->dxfer_len > 0)
>   return false;
>   return true;
> - case SG_DXFER_TO_DEV:
>   case SG_DXFER_FROM_DEV:
> + if (hp->dxfer_len < 0)
> + return false;
> + return true;
> + case SG_DXFER_TO_DEV:
>   case SG_DXFER_TO_FROM_DEV:
>   if (!hp->dxferp || hp->dxfer_len == 0)
>   return false;
> 


Re: [PATCH v2] scsi: sg: fix SG_DXFER_FROM_DEV transfers

2017-07-07 Thread Chris Clayton


On 07/07/17 09:56, Johannes Thumshirn wrote:
> SG_DXFER_FROM_DEV transfers do not necessarily have a dxferp as we set
> it to NULL for the old sg_io read/write interface, but must have a length
> bigger than 0. This fixes a regression introduced by commit 28676d869bbb
> ("scsi: sg: check for valid direction before starting the request")
> 

I've tested this new patch and the Nero applications can still find the optical 
drives on my laptop.

Tested-by: Chris Clayton 

> Signed-off-by: Johannes Thumshirn 
> Fixes: 28676d869bbb ("scsi: sg: check for valid direction before starting the 
> request")
> Reported-by: Chris Clayton 
> Tested-by: Chris Clayton 
> Cc: Douglas Gilbert 
> Reviewed-by: Hannes Reinecke 
> ---
> Changes to v1:
> * Fix breakage of the sg_io v3 interface, verified using sg_inq
> 
>  drivers/scsi/sg.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
> index 21225d62b0c1..1e82d4128a84 100644
> --- a/drivers/scsi/sg.c
> +++ b/drivers/scsi/sg.c
> @@ -758,8 +758,11 @@ static bool sg_is_valid_dxfer(sg_io_hdr_t *hp)
>   if (hp->dxferp || hp->dxfer_len > 0)
>   return false;
>   return true;
> - case SG_DXFER_TO_DEV:
>   case SG_DXFER_FROM_DEV:
> + if (hp->dxfer_len < 0)
> + return false;
> + return true;
> + case SG_DXFER_TO_DEV:
>   case SG_DXFER_TO_FROM_DEV:
>   if (!hp->dxferp || hp->dxfer_len == 0)
>   return false;
> 


4.7.0-rc7+: Oops during boot with USB pen drive inserted

2016-07-21 Thread Chris Clayton
Hi,

With Linus' latest and greatest, I get an opps when I boot my laptop with a pen 
drive inserted in any USB port. The oops
message is:

Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(8,2)

The oops seems to be 100% repeatable. If a USB pen drive is not inserted, the 
laptop boots successfully.

I've taken a photograph of the oops and you can view it at
http://s714.photobucket.com/user/chris2553/media/IMG_20160722_053841.jpg.html.

At the top of the picture, I notice that the partitions on my actual boot disk 
are being reported as being on /dev/sdb,
so it seems likely that, at this point, the pen drive is being seen as 
/dev/sda, although that has scrolled off the
screen. I don't boot via a ramdisk - my kernel has ext4 built in.  The grub2 
entry is:

menuentry "Krisux, Linux 4.7.0-rc7+" {
insmod ext2
set root=(hd0,2)

linux   /boot/vmlinuz-4.7.0-rc7+ ro root=/dev/sda2 resume=/dev/sda6 
rootfstype=ext4 net.ifnames=0

}

(BTW, Krisux is not a real distro - it's just the name I have given Linux from 
Scratch system.)

The stack that is dumped is:

dump_stack
panic
printk
mount_block_root
prepare_namespace
kernel_init_freeable
kernel_init
ret_from_fork
rest_init

I realise I could work around this by specifying the boot partition by, say, 
its UUID, but I thought you would want me
to report this anyway.

Of course, I'm happy to provide any other information required and to test any 
fix, but I will be out and about for the
next 14-16 hours, so it will be later tonight or maybe even tomorrow before I 
can respond.

Chris



4.7.0-rc7+: Oops during boot with USB pen drive inserted

2016-07-21 Thread Chris Clayton
Hi,

With Linus' latest and greatest, I get an opps when I boot my laptop with a pen 
drive inserted in any USB port. The oops
message is:

Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(8,2)

The oops seems to be 100% repeatable. If a USB pen drive is not inserted, the 
laptop boots successfully.

I've taken a photograph of the oops and you can view it at
http://s714.photobucket.com/user/chris2553/media/IMG_20160722_053841.jpg.html.

At the top of the picture, I notice that the partitions on my actual boot disk 
are being reported as being on /dev/sdb,
so it seems likely that, at this point, the pen drive is being seen as 
/dev/sda, although that has scrolled off the
screen. I don't boot via a ramdisk - my kernel has ext4 built in.  The grub2 
entry is:

menuentry "Krisux, Linux 4.7.0-rc7+" {
insmod ext2
set root=(hd0,2)

linux   /boot/vmlinuz-4.7.0-rc7+ ro root=/dev/sda2 resume=/dev/sda6 
rootfstype=ext4 net.ifnames=0

}

(BTW, Krisux is not a real distro - it's just the name I have given Linux from 
Scratch system.)

The stack that is dumped is:

dump_stack
panic
printk
mount_block_root
prepare_namespace
kernel_init_freeable
kernel_init
ret_from_fork
rest_init

I realise I could work around this by specifying the boot partition by, say, 
its UUID, but I thought you would want me
to report this anyway.

Of course, I'm happy to provide any other information required and to test any 
fix, but I will be out and about for the
next 14-16 hours, so it will be later tonight or maybe even tomorrow before I 
can respond.

Chris



Re: 4.5 Regression - mouse not working after resume from suspend

2016-02-18 Thread Chris Clayton
I can only assume that although a non-working bluetooth mouse is a symptom of 
this regression, the silence of the
bluetooth folks is because the fault does not lie in the BT subsystem. 
Consequently, I'm transferring the problem back
to LKML in the hope that someone else can solve the problem.

On 15/02/16 23:40, Chris Clayton wrote:
> Hi,
> 
> Is there anything else I can do to help diagnose this regression.
> 
> To summarise my BT mouse does not work after resuming from suspend to disk or 
> ram. IT works perfectly in earlier 4.4,
> 4.3 and 4.2 kernels. I bisected and found the first bad commit is:
> 
> 2ff13894cfb877cb3d02d96a8402202f0a6f3efd is the first bad commit
> commit 2ff13894cfb877cb3d02d96a8402202f0a6f3efd
> Author: Johan Hedberg <johan.hedb...@intel.com>
> Date:   Wed Nov 25 16:15:44 2015 +0200
> 
> Bluetooth: Perform HCI update for power on synchronously
> 
> Johan requested additional information, which I provided. Checking the 
> archive at marc.info, it seems the mail didn't
> make it to the mailing list. Maybe it exceeded a size limit, I don't know. 
> Anyway I copied the mail to Johan and Marcel.
> 
> A bit more experimentation revealed that I can reactivate the mouse if I 
> restart the bluetooth daemon after the machine
> resumes.
> 
> Please let me know if I can provide anything else.
> 
> Thanks
> 
> Chris
> 
> On 06/02/16 15:23, Chris Clayton wrote:
>> Hi Johan,
>>
>> The information you requested has been captured from v4.5-rc2-340-g5af9c2e 
>> and is included below.
>>
>> On 06/02/16 14:33, Johan Hedberg wrote:
>>> Hi Chris,
>>>
>>> On Sat, Feb 06, 2016, Chris Clayton wrote:
>>>> On 06/02/16 11:38, Chris Clayton wrote:
>>>>> On 06/02/16 08:37, Chris Clayton wrote:
>>>>>> There seems to be a regression in resuming my laptop from a suspend
>>>>>> to RAM or disk. The symptom is that my bluetooth
>>>>>> mouse doesn't work after the resume. The kernel is built after a
>>>>>> pull of Linus' tree this morning (v4.5-rc2-340-g5af9c2e).
>>>>>>
>>>>>> Attached is the output from dmesg showing the boot, suspend (to
>>>>>> RAM) and resume. You'll see that during the resume,
>>>>>> error -517 is being reported for some devices. Suspend/resume has worked 
>>>>>> perfectly with a 4.[234].x kernels.
>>>>>>
>>>>>> I'll start a bisection, but thought I'd give a heads up in case
>>>>>> someone can see the problem before I get done with the
>>>>>> bisect.
>>>>>
>>>>> The bisection ended up at:
>>>>>
>>>>> 2ff13894cfb877cb3d02d96a8402202f0a6f3efd is the first bad commit
>>>>> commit 2ff13894cfb877cb3d02d96a8402202f0a6f3efd
>>>>> Author: Johan Hedberg <johan.hedb...@intel.com>
>>>>> Date:   Wed Nov 25 16:15:44 2015 +0200
>>>>>
>>>>> Bluetooth: Perform HCI update for power on synchronously
>>>>>
>>>>> The request to update HCI during power on is always coming either from
>>>>> hdev->req_workqueue or through an ioctl, so it's safe to use
>>>>> hci_req_sync for it. This way we also eliminate potential races with
>>>>> incoming mgmt commands or other actions while powering on.
>>>>>
>>>>> Part of this refactoring is the splitting of mgmt_powered() into
>>>>> mgmt_power_on() and __mgmt_power_off() functions. The main reason is
>>>>> the different requirements as far as hdev locking is concerned, as
>>>>> highlighted with the __ prefix of the power off API.
>>>>>
>>>>> Since the power on in the case of clearing the AUTO_OFF flag cannot be
>>>>> done synchronously in the set_powered mgmt handler, the hci_power_on
>>>>> work callback is extended to cover this (which also simplifies the
>>>>> set_powered helper a lot).
>>>>>
>>>>> Signed-off-by: Johan Hedberg <johan.hedb...@intel.com>
>>>>> Signed-off-by: Marcel Holtmann <mar...@holtmann.org>
>>>>>
>>>>> :04 04 a093d0be66f39f99c33a6a4725b2330ca9b41d03 
>>>>> a1eff79cec3ee7208e5aa200ab5069726bbeea8e M  include
>>>>> :04 04 d2d122193b33d45fcb9c2bc69f2024487a7528a0 
>>>>> 0036e1ec2e125f2432cfd420b5f79ca133ec34f7 M  net
>>>>
>>>> I've ju

Re: 4.5 Regression - mouse not working after resume from suspend

2016-02-18 Thread Chris Clayton
I can only assume that although a non-working bluetooth mouse is a symptom of 
this regression, the silence of the
bluetooth folks is because the fault does not lie in the BT subsystem. 
Consequently, I'm transferring the problem back
to LKML in the hope that someone else can solve the problem.

On 15/02/16 23:40, Chris Clayton wrote:
> Hi,
> 
> Is there anything else I can do to help diagnose this regression.
> 
> To summarise my BT mouse does not work after resuming from suspend to disk or 
> ram. IT works perfectly in earlier 4.4,
> 4.3 and 4.2 kernels. I bisected and found the first bad commit is:
> 
> 2ff13894cfb877cb3d02d96a8402202f0a6f3efd is the first bad commit
> commit 2ff13894cfb877cb3d02d96a8402202f0a6f3efd
> Author: Johan Hedberg 
> Date:   Wed Nov 25 16:15:44 2015 +0200
> 
> Bluetooth: Perform HCI update for power on synchronously
> 
> Johan requested additional information, which I provided. Checking the 
> archive at marc.info, it seems the mail didn't
> make it to the mailing list. Maybe it exceeded a size limit, I don't know. 
> Anyway I copied the mail to Johan and Marcel.
> 
> A bit more experimentation revealed that I can reactivate the mouse if I 
> restart the bluetooth daemon after the machine
> resumes.
> 
> Please let me know if I can provide anything else.
> 
> Thanks
> 
> Chris
> 
> On 06/02/16 15:23, Chris Clayton wrote:
>> Hi Johan,
>>
>> The information you requested has been captured from v4.5-rc2-340-g5af9c2e 
>> and is included below.
>>
>> On 06/02/16 14:33, Johan Hedberg wrote:
>>> Hi Chris,
>>>
>>> On Sat, Feb 06, 2016, Chris Clayton wrote:
>>>> On 06/02/16 11:38, Chris Clayton wrote:
>>>>> On 06/02/16 08:37, Chris Clayton wrote:
>>>>>> There seems to be a regression in resuming my laptop from a suspend
>>>>>> to RAM or disk. The symptom is that my bluetooth
>>>>>> mouse doesn't work after the resume. The kernel is built after a
>>>>>> pull of Linus' tree this morning (v4.5-rc2-340-g5af9c2e).
>>>>>>
>>>>>> Attached is the output from dmesg showing the boot, suspend (to
>>>>>> RAM) and resume. You'll see that during the resume,
>>>>>> error -517 is being reported for some devices. Suspend/resume has worked 
>>>>>> perfectly with a 4.[234].x kernels.
>>>>>>
>>>>>> I'll start a bisection, but thought I'd give a heads up in case
>>>>>> someone can see the problem before I get done with the
>>>>>> bisect.
>>>>>
>>>>> The bisection ended up at:
>>>>>
>>>>> 2ff13894cfb877cb3d02d96a8402202f0a6f3efd is the first bad commit
>>>>> commit 2ff13894cfb877cb3d02d96a8402202f0a6f3efd
>>>>> Author: Johan Hedberg 
>>>>> Date:   Wed Nov 25 16:15:44 2015 +0200
>>>>>
>>>>> Bluetooth: Perform HCI update for power on synchronously
>>>>>
>>>>> The request to update HCI during power on is always coming either from
>>>>> hdev->req_workqueue or through an ioctl, so it's safe to use
>>>>> hci_req_sync for it. This way we also eliminate potential races with
>>>>> incoming mgmt commands or other actions while powering on.
>>>>>
>>>>> Part of this refactoring is the splitting of mgmt_powered() into
>>>>> mgmt_power_on() and __mgmt_power_off() functions. The main reason is
>>>>> the different requirements as far as hdev locking is concerned, as
>>>>> highlighted with the __ prefix of the power off API.
>>>>>
>>>>> Since the power on in the case of clearing the AUTO_OFF flag cannot be
>>>>> done synchronously in the set_powered mgmt handler, the hci_power_on
>>>>> work callback is extended to cover this (which also simplifies the
>>>>> set_powered helper a lot).
>>>>>
>>>>> Signed-off-by: Johan Hedberg 
>>>>> Signed-off-by: Marcel Holtmann 
>>>>>
>>>>> :04 04 a093d0be66f39f99c33a6a4725b2330ca9b41d03 
>>>>> a1eff79cec3ee7208e5aa200ab5069726bbeea8e M  include
>>>>> :04 04 d2d122193b33d45fcb9c2bc69f2024487a7528a0 
>>>>> 0036e1ec2e125f2432cfd420b5f79ca133ec34f7 M  net
>>>>
>>>> I've just built a kernel at bf943cbf76ecd3b9838a80d5e08777b0f4ccc665
>>>> (the commit prior to the one the bisect landed 

Re: 4.5 Regression - mouse not working after resume from suspend

2016-02-06 Thread Chris Clayton


On 06/02/16 11:38, Chris Clayton wrote:
> 
> 
> On 06/02/16 08:37, Chris Clayton wrote:
>> There seems to be a regression in resuming my laptop from a suspend to RAM 
>> or disk. The symptom is that my bluetooth
>> mouse doesn't work after the resume. The kernel is built after a pull of 
>> Linus' tree this morning (v4.5-rc2-340-g5af9c2e).
>>
>> Attached is the output from dmesg showing the boot, suspend (to RAM) and 
>> resume. You'll see that during the resume,
>> error -517 is being reported for some devices. Suspend/resume has worked 
>> perfectly with a 4.[234].x kernels.
>>
>> I'll start a bisection, but thought I'd give a heads up in case someone can 
>> see the problem before I get done with the
>> bisect.
>>
> 
> The bisection ended up at:
> 
> 2ff13894cfb877cb3d02d96a8402202f0a6f3efd is the first bad commit
> commit 2ff13894cfb877cb3d02d96a8402202f0a6f3efd
> Author: Johan Hedberg 
> Date:   Wed Nov 25 16:15:44 2015 +0200
> 
> Bluetooth: Perform HCI update for power on synchronously
> 
> The request to update HCI during power on is always coming either from
> hdev->req_workqueue or through an ioctl, so it's safe to use
> hci_req_sync for it. This way we also eliminate potential races with
> incoming mgmt commands or other actions while powering on.
> 
> Part of this refactoring is the splitting of mgmt_powered() into
> mgmt_power_on() and __mgmt_power_off() functions. The main reason is
> the different requirements as far as hdev locking is concerned, as
> highlighted with the __ prefix of the power off API.
> 
> Since the power on in the case of clearing the AUTO_OFF flag cannot be
> done synchronously in the set_powered mgmt handler, the hci_power_on
> work callback is extended to cover this (which also simplifies the
> set_powered helper a lot).
> 
> Signed-off-by: Johan Hedberg 
> Signed-off-by: Marcel Holtmann 
> 
> :04 04 a093d0be66f39f99c33a6a4725b2330ca9b41d03 
> a1eff79cec3ee7208e5aa200ab5069726bbeea8e M  include
> :04 04 d2d122193b33d45fcb9c2bc69f2024487a7528a0 
> 0036e1ec2e125f2432cfd420b5f79ca133ec34f7 M  net
> 
> 

I've just built a kernel at bf943cbf76ecd3b9838a80d5e08777b0f4ccc665 (the 
commit prior to the one the bisect landed on)
and my BT mouse works fine after a suspend/resume. With a kernel built at 
2ff13894cfb877cb3d02d96a8402202f0a6f3efd, the
mouse does not work after resume.

> The bisect log is:
> 
> git bisect start
> # bad: [5af9c2e19da6514a1a50b07d97d93b74a7711873] Merge branch 'akpm' 
> (patches from Andrew)
> git bisect bad 5af9c2e19da6514a1a50b07d97d93b74a7711873
> # good: [afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc] Linux 4.4
> git bisect good afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc
> # bad: [63b6da39bb38e8f1a1ef3180d32a39d6baf9da84] perf: Fix 
> perf_event_exit_task() race
> git bisect bad 63b6da39bb38e8f1a1ef3180d32a39d6baf9da84
> # bad: [aee3bfa3307cd0da2126bdc0ea359dabea5ee8f7] Merge 
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
> git bisect bad aee3bfa3307cd0da2126bdc0ea359dabea5ee8f7
> # good: [60b7eca1dc2ec066916b3b7ac6ad89bea13cb9af] Merge tag 
> 'upstream-4.5-rc1' of git://git.infradead.org/linux-ubifs
> git bisect good 60b7eca1dc2ec066916b3b7ac6ad89bea13cb9af
> # bad: [a188222b6ed29404ac2d4232d35d1fe0e77af370] net: Rename 
> NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK
> git bisect bad a188222b6ed29404ac2d4232d35d1fe0e77af370
> # good: [1343c65f70ee1b1f968a08b30e1836a4e37116cd] fm10k: always check 
> init_hw for errors
> git bisect good 1343c65f70ee1b1f968a08b30e1836a4e37116cd
> # good: [bc9b145a092aca91a7f6ef40cdb3628b6ada7ec9] Merge branch 
> 'for-4.5-ancestor-test' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
> git bisect good bc9b145a092aca91a7f6ef40cdb3628b6ada7ec9
> # good: [a4fcad656e1100bdda9b0b752b93a1a276810469] fm10k: whitespace cleanups
> git bisect good a4fcad656e1100bdda9b0b752b93a1a276810469
> # bad: [7302b9d90117496049dd4bfa28755f7c2ed55b27] ieee802154/adf7242: Driver 
> for ADF7242 MAC IEEE802154
> git bisect bad 7302b9d90117496049dd4bfa28755f7c2ed55b27
> # bad: [a0c38245153abe1fd844af9b166d1a5d5dafe7b1] Bluetooth: hci_intel: Use 
> shorter timeout for HCI commands
> git bisect bad a0c38245153abe1fd844af9b166d1a5d5dafe7b1
> # good: [bf943cbf76ecd3b9838a80d5e08777b0f4ccc665] Bluetooth: Move fast 
> connectable code to hci_request.c
> git bisect good bf943cbf76ecd3b9838a80d5e08777b0f4ccc665
> # bad: [742c59516822f4a4bc23b0961d88c569a7f1bf71] Bluetooth: Simplify setting 
> Configuration Field
> git bisect bad 742c59516822f4a4bc23b0961d88c569a7f1bf71
> # bad: [02c04afea93fbba7925984df455bc63e7d92da97] Bluetooth: Si

Re: 4.5 Regression - mouse not working after resume from suspend

2016-02-06 Thread Chris Clayton


On 06/02/16 08:37, Chris Clayton wrote:
> There seems to be a regression in resuming my laptop from a suspend to RAM or 
> disk. The symptom is that my bluetooth
> mouse doesn't work after the resume. The kernel is built after a pull of 
> Linus' tree this morning (v4.5-rc2-340-g5af9c2e).
> 
> Attached is the output from dmesg showing the boot, suspend (to RAM) and 
> resume. You'll see that during the resume,
> error -517 is being reported for some devices. Suspend/resume has worked 
> perfectly with a 4.[234].x kernels.
> 
> I'll start a bisection, but thought I'd give a heads up in case someone can 
> see the problem before I get done with the
> bisect.
> 

The bisection ended up at:

2ff13894cfb877cb3d02d96a8402202f0a6f3efd is the first bad commit
commit 2ff13894cfb877cb3d02d96a8402202f0a6f3efd
Author: Johan Hedberg 
Date:   Wed Nov 25 16:15:44 2015 +0200

Bluetooth: Perform HCI update for power on synchronously

The request to update HCI during power on is always coming either from
hdev->req_workqueue or through an ioctl, so it's safe to use
hci_req_sync for it. This way we also eliminate potential races with
incoming mgmt commands or other actions while powering on.

Part of this refactoring is the splitting of mgmt_powered() into
mgmt_power_on() and __mgmt_power_off() functions. The main reason is
the different requirements as far as hdev locking is concerned, as
highlighted with the __ prefix of the power off API.

Since the power on in the case of clearing the AUTO_OFF flag cannot be
done synchronously in the set_powered mgmt handler, the hci_power_on
work callback is extended to cover this (which also simplifies the
set_powered helper a lot).

Signed-off-by: Johan Hedberg 
Signed-off-by: Marcel Holtmann 

:04 04 a093d0be66f39f99c33a6a4725b2330ca9b41d03 
a1eff79cec3ee7208e5aa200ab5069726bbeea8e M  include
:04 04 d2d122193b33d45fcb9c2bc69f2024487a7528a0 
0036e1ec2e125f2432cfd420b5f79ca133ec34f7 M  net


The bisect log is:

git bisect start
# bad: [5af9c2e19da6514a1a50b07d97d93b74a7711873] Merge branch 'akpm' (patches 
from Andrew)
git bisect bad 5af9c2e19da6514a1a50b07d97d93b74a7711873
# good: [afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc] Linux 4.4
git bisect good afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc
# bad: [63b6da39bb38e8f1a1ef3180d32a39d6baf9da84] perf: Fix 
perf_event_exit_task() race
git bisect bad 63b6da39bb38e8f1a1ef3180d32a39d6baf9da84
# bad: [aee3bfa3307cd0da2126bdc0ea359dabea5ee8f7] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect bad aee3bfa3307cd0da2126bdc0ea359dabea5ee8f7
# good: [60b7eca1dc2ec066916b3b7ac6ad89bea13cb9af] Merge tag 'upstream-4.5-rc1' 
of git://git.infradead.org/linux-ubifs
git bisect good 60b7eca1dc2ec066916b3b7ac6ad89bea13cb9af
# bad: [a188222b6ed29404ac2d4232d35d1fe0e77af370] net: Rename NETIF_F_ALL_CSUM 
to NETIF_F_CSUM_MASK
git bisect bad a188222b6ed29404ac2d4232d35d1fe0e77af370
# good: [1343c65f70ee1b1f968a08b30e1836a4e37116cd] fm10k: always check init_hw 
for errors
git bisect good 1343c65f70ee1b1f968a08b30e1836a4e37116cd
# good: [bc9b145a092aca91a7f6ef40cdb3628b6ada7ec9] Merge branch 
'for-4.5-ancestor-test' of
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
git bisect good bc9b145a092aca91a7f6ef40cdb3628b6ada7ec9
# good: [a4fcad656e1100bdda9b0b752b93a1a276810469] fm10k: whitespace cleanups
git bisect good a4fcad656e1100bdda9b0b752b93a1a276810469
# bad: [7302b9d90117496049dd4bfa28755f7c2ed55b27] ieee802154/adf7242: Driver 
for ADF7242 MAC IEEE802154
git bisect bad 7302b9d90117496049dd4bfa28755f7c2ed55b27
# bad: [a0c38245153abe1fd844af9b166d1a5d5dafe7b1] Bluetooth: hci_intel: Use 
shorter timeout for HCI commands
git bisect bad a0c38245153abe1fd844af9b166d1a5d5dafe7b1
# good: [bf943cbf76ecd3b9838a80d5e08777b0f4ccc665] Bluetooth: Move fast 
connectable code to hci_request.c
git bisect good bf943cbf76ecd3b9838a80d5e08777b0f4ccc665
# bad: [742c59516822f4a4bc23b0961d88c569a7f1bf71] Bluetooth: Simplify setting 
Configuration Field
git bisect bad 742c59516822f4a4bc23b0961d88c569a7f1bf71
# bad: [02c04afea93fbba7925984df455bc63e7d92da97] Bluetooth: Simplify 
read_adv_features code
git bisect bad 02c04afea93fbba7925984df455bc63e7d92da97
# bad: [2ff13894cfb877cb3d02d96a8402202f0a6f3efd] Bluetooth: Perform HCI update 
for power on synchronously
git bisect bad 2ff13894cfb877cb3d02d96a8402202f0a6f3efd
# first bad commit: [2ff13894cfb877cb3d02d96a8402202f0a6f3efd] Bluetooth: 
Perform HCI update for power on synchronously


Just shout if you need any additional diagnotics.


> Chris
> 


4.5 Regression - mouse not working after resume from suspend

2016-02-06 Thread Chris Clayton
There seems to be a regression in resuming my laptop from a suspend to RAM or 
disk. The symptom is that my bluetooth
mouse doesn't work after the resume. The kernel is built after a pull of Linus' 
tree this morning (v4.5-rc2-340-g5af9c2e).

Attached is the output from dmesg showing the boot, suspend (to RAM) and 
resume. You'll see that during the resume,
error -517 is being reported for some devices. Suspend/resume has worked 
perfectly with a 4.[234].x kernels.

I'll start a bisection, but thought I'd give a heads up in case someone can see 
the problem before I get done with the
bisect.

Chris
[0.00] Linux version 4.5.0-rc2+ (chris@laptop) (gcc version 5.3.1 
20160202 (GCC) ) #318 SMP PREEMPT Sat Feb 6 06:38:55 GMT 2016
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-4.5.0-rc2+ ro 
root=/dev/sda2 resume=/dev/sda6 rootfstype=ext4 net.ifnames=0
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: Supporting XSAVE feature 0x01: 'x87 floating point 
registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x02: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x04: 'AVX registers'
[0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, 
using 'standard' format.
[0.00] x86/fpu: Using 'eager' FPU context switches.
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009d7ff] usable
[0.00] BIOS-e820: [mem 0x0009d800-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000e-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0xd7216fff] usable
[0.00] BIOS-e820: [mem 0xd7217000-0xd721dfff] ACPI NVS
[0.00] BIOS-e820: [mem 0xd721e000-0xd7a0cfff] usable
[0.00] BIOS-e820: [mem 0xd7a0d000-0xd7ca1fff] reserved
[0.00] BIOS-e820: [mem 0xd7ca2000-0xdb4d] usable
[0.00] BIOS-e820: [mem 0xdb4e-0xdb82dfff] reserved
[0.00] BIOS-e820: [mem 0xdb82e000-0xdb88afff] usable
[0.00] BIOS-e820: [mem 0xdb88b000-0xdb9bcfff] ACPI NVS
[0.00] BIOS-e820: [mem 0xdb9bd000-0xdbffefff] reserved
[0.00] BIOS-e820: [mem 0xdbfff000-0xdbff] usable
[0.00] BIOS-e820: [mem 0xdd00-0xdf1f] reserved
[0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xfed0-0xfed03fff] reserved
[0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xff00-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00041fdf] usable
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.7 present.
[0.00] DMI: Notebook W65_67SZ   
 /W65_67SZ, BIOS 1.03.05 02/26/2014
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] e820: last_pfn = 0x41fe00 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-C write-protect
[0.00]   D-E7FFF uncachable
[0.00]   E8000-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 00 mask 7C write-back
[0.00]   1 base 04 mask 7FE000 write-back
[0.00]   2 base 00E000 mask 7FE000 uncachable
[0.00]   3 base 00DE00 mask 7FFE00 uncachable
[0.00]   4 base 00DD00 mask 7FFF00 uncachable
[0.00]   5 base 041FE0 mask 7FFFE0 uncachable
[0.00]   6 disabled
[0.00]   7 disabled
[0.00]   8 disabled
[0.00]   9 disabled
[0.00] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- WT  
[0.00] e820: update [mem 0xdd00-0x] usable ==> reserved
[0.00] e820: last_pfn = 0xdc000 max_arch_pfn = 0x4
[0.00] found SMP MP-table at [mem 0x000fd820-0x000fd82f] mapped at 
[880fd820]
[0.00] Base memory trampoline at [88097000] 97000 size 24576
[0.00] Using GB pages for direct mapping
[0.00] BRK [0x01a05000, 0x01a05fff] PGTABLE
[0.00] BRK [0x01a06000, 0x01a06fff] PGTABLE
[0.00] BRK [0x01a07000, 0x01a07fff] PGTABLE
[0.00] BRK [0x01a08000, 0x01a08fff] PGTABLE
[0.00] BRK [0x01a09000, 0x01a09fff] 

Re: 4.5 Regression - mouse not working after resume from suspend

2016-02-06 Thread Chris Clayton


On 06/02/16 08:37, Chris Clayton wrote:
> There seems to be a regression in resuming my laptop from a suspend to RAM or 
> disk. The symptom is that my bluetooth
> mouse doesn't work after the resume. The kernel is built after a pull of 
> Linus' tree this morning (v4.5-rc2-340-g5af9c2e).
> 
> Attached is the output from dmesg showing the boot, suspend (to RAM) and 
> resume. You'll see that during the resume,
> error -517 is being reported for some devices. Suspend/resume has worked 
> perfectly with a 4.[234].x kernels.
> 
> I'll start a bisection, but thought I'd give a heads up in case someone can 
> see the problem before I get done with the
> bisect.
> 

The bisection ended up at:

2ff13894cfb877cb3d02d96a8402202f0a6f3efd is the first bad commit
commit 2ff13894cfb877cb3d02d96a8402202f0a6f3efd
Author: Johan Hedberg <johan.hedb...@intel.com>
Date:   Wed Nov 25 16:15:44 2015 +0200

Bluetooth: Perform HCI update for power on synchronously

The request to update HCI during power on is always coming either from
hdev->req_workqueue or through an ioctl, so it's safe to use
hci_req_sync for it. This way we also eliminate potential races with
incoming mgmt commands or other actions while powering on.

Part of this refactoring is the splitting of mgmt_powered() into
mgmt_power_on() and __mgmt_power_off() functions. The main reason is
the different requirements as far as hdev locking is concerned, as
highlighted with the __ prefix of the power off API.

Since the power on in the case of clearing the AUTO_OFF flag cannot be
done synchronously in the set_powered mgmt handler, the hci_power_on
work callback is extended to cover this (which also simplifies the
set_powered helper a lot).

Signed-off-by: Johan Hedberg <johan.hedb...@intel.com>
Signed-off-by: Marcel Holtmann <mar...@holtmann.org>

:04 04 a093d0be66f39f99c33a6a4725b2330ca9b41d03 
a1eff79cec3ee7208e5aa200ab5069726bbeea8e M  include
:04 04 d2d122193b33d45fcb9c2bc69f2024487a7528a0 
0036e1ec2e125f2432cfd420b5f79ca133ec34f7 M  net


The bisect log is:

git bisect start
# bad: [5af9c2e19da6514a1a50b07d97d93b74a7711873] Merge branch 'akpm' (patches 
from Andrew)
git bisect bad 5af9c2e19da6514a1a50b07d97d93b74a7711873
# good: [afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc] Linux 4.4
git bisect good afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc
# bad: [63b6da39bb38e8f1a1ef3180d32a39d6baf9da84] perf: Fix 
perf_event_exit_task() race
git bisect bad 63b6da39bb38e8f1a1ef3180d32a39d6baf9da84
# bad: [aee3bfa3307cd0da2126bdc0ea359dabea5ee8f7] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect bad aee3bfa3307cd0da2126bdc0ea359dabea5ee8f7
# good: [60b7eca1dc2ec066916b3b7ac6ad89bea13cb9af] Merge tag 'upstream-4.5-rc1' 
of git://git.infradead.org/linux-ubifs
git bisect good 60b7eca1dc2ec066916b3b7ac6ad89bea13cb9af
# bad: [a188222b6ed29404ac2d4232d35d1fe0e77af370] net: Rename NETIF_F_ALL_CSUM 
to NETIF_F_CSUM_MASK
git bisect bad a188222b6ed29404ac2d4232d35d1fe0e77af370
# good: [1343c65f70ee1b1f968a08b30e1836a4e37116cd] fm10k: always check init_hw 
for errors
git bisect good 1343c65f70ee1b1f968a08b30e1836a4e37116cd
# good: [bc9b145a092aca91a7f6ef40cdb3628b6ada7ec9] Merge branch 
'for-4.5-ancestor-test' of
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
git bisect good bc9b145a092aca91a7f6ef40cdb3628b6ada7ec9
# good: [a4fcad656e1100bdda9b0b752b93a1a276810469] fm10k: whitespace cleanups
git bisect good a4fcad656e1100bdda9b0b752b93a1a276810469
# bad: [7302b9d90117496049dd4bfa28755f7c2ed55b27] ieee802154/adf7242: Driver 
for ADF7242 MAC IEEE802154
git bisect bad 7302b9d90117496049dd4bfa28755f7c2ed55b27
# bad: [a0c38245153abe1fd844af9b166d1a5d5dafe7b1] Bluetooth: hci_intel: Use 
shorter timeout for HCI commands
git bisect bad a0c38245153abe1fd844af9b166d1a5d5dafe7b1
# good: [bf943cbf76ecd3b9838a80d5e08777b0f4ccc665] Bluetooth: Move fast 
connectable code to hci_request.c
git bisect good bf943cbf76ecd3b9838a80d5e08777b0f4ccc665
# bad: [742c59516822f4a4bc23b0961d88c569a7f1bf71] Bluetooth: Simplify setting 
Configuration Field
git bisect bad 742c59516822f4a4bc23b0961d88c569a7f1bf71
# bad: [02c04afea93fbba7925984df455bc63e7d92da97] Bluetooth: Simplify 
read_adv_features code
git bisect bad 02c04afea93fbba7925984df455bc63e7d92da97
# bad: [2ff13894cfb877cb3d02d96a8402202f0a6f3efd] Bluetooth: Perform HCI update 
for power on synchronously
git bisect bad 2ff13894cfb877cb3d02d96a8402202f0a6f3efd
# first bad commit: [2ff13894cfb877cb3d02d96a8402202f0a6f3efd] Bluetooth: 
Perform HCI update for power on synchronously


Just shout if you need any additional diagnotics.


> Chris
> 


Re: 4.5 Regression - mouse not working after resume from suspend

2016-02-06 Thread Chris Clayton


On 06/02/16 11:38, Chris Clayton wrote:
> 
> 
> On 06/02/16 08:37, Chris Clayton wrote:
>> There seems to be a regression in resuming my laptop from a suspend to RAM 
>> or disk. The symptom is that my bluetooth
>> mouse doesn't work after the resume. The kernel is built after a pull of 
>> Linus' tree this morning (v4.5-rc2-340-g5af9c2e).
>>
>> Attached is the output from dmesg showing the boot, suspend (to RAM) and 
>> resume. You'll see that during the resume,
>> error -517 is being reported for some devices. Suspend/resume has worked 
>> perfectly with a 4.[234].x kernels.
>>
>> I'll start a bisection, but thought I'd give a heads up in case someone can 
>> see the problem before I get done with the
>> bisect.
>>
> 
> The bisection ended up at:
> 
> 2ff13894cfb877cb3d02d96a8402202f0a6f3efd is the first bad commit
> commit 2ff13894cfb877cb3d02d96a8402202f0a6f3efd
> Author: Johan Hedberg <johan.hedb...@intel.com>
> Date:   Wed Nov 25 16:15:44 2015 +0200
> 
> Bluetooth: Perform HCI update for power on synchronously
> 
> The request to update HCI during power on is always coming either from
> hdev->req_workqueue or through an ioctl, so it's safe to use
> hci_req_sync for it. This way we also eliminate potential races with
> incoming mgmt commands or other actions while powering on.
> 
> Part of this refactoring is the splitting of mgmt_powered() into
> mgmt_power_on() and __mgmt_power_off() functions. The main reason is
> the different requirements as far as hdev locking is concerned, as
> highlighted with the __ prefix of the power off API.
> 
> Since the power on in the case of clearing the AUTO_OFF flag cannot be
> done synchronously in the set_powered mgmt handler, the hci_power_on
> work callback is extended to cover this (which also simplifies the
> set_powered helper a lot).
> 
> Signed-off-by: Johan Hedberg <johan.hedb...@intel.com>
> Signed-off-by: Marcel Holtmann <mar...@holtmann.org>
> 
> :04 04 a093d0be66f39f99c33a6a4725b2330ca9b41d03 
> a1eff79cec3ee7208e5aa200ab5069726bbeea8e M  include
> :04 04 d2d122193b33d45fcb9c2bc69f2024487a7528a0 
> 0036e1ec2e125f2432cfd420b5f79ca133ec34f7 M  net
> 
> 

I've just built a kernel at bf943cbf76ecd3b9838a80d5e08777b0f4ccc665 (the 
commit prior to the one the bisect landed on)
and my BT mouse works fine after a suspend/resume. With a kernel built at 
2ff13894cfb877cb3d02d96a8402202f0a6f3efd, the
mouse does not work after resume.

> The bisect log is:
> 
> git bisect start
> # bad: [5af9c2e19da6514a1a50b07d97d93b74a7711873] Merge branch 'akpm' 
> (patches from Andrew)
> git bisect bad 5af9c2e19da6514a1a50b07d97d93b74a7711873
> # good: [afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc] Linux 4.4
> git bisect good afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc
> # bad: [63b6da39bb38e8f1a1ef3180d32a39d6baf9da84] perf: Fix 
> perf_event_exit_task() race
> git bisect bad 63b6da39bb38e8f1a1ef3180d32a39d6baf9da84
> # bad: [aee3bfa3307cd0da2126bdc0ea359dabea5ee8f7] Merge 
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
> git bisect bad aee3bfa3307cd0da2126bdc0ea359dabea5ee8f7
> # good: [60b7eca1dc2ec066916b3b7ac6ad89bea13cb9af] Merge tag 
> 'upstream-4.5-rc1' of git://git.infradead.org/linux-ubifs
> git bisect good 60b7eca1dc2ec066916b3b7ac6ad89bea13cb9af
> # bad: [a188222b6ed29404ac2d4232d35d1fe0e77af370] net: Rename 
> NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK
> git bisect bad a188222b6ed29404ac2d4232d35d1fe0e77af370
> # good: [1343c65f70ee1b1f968a08b30e1836a4e37116cd] fm10k: always check 
> init_hw for errors
> git bisect good 1343c65f70ee1b1f968a08b30e1836a4e37116cd
> # good: [bc9b145a092aca91a7f6ef40cdb3628b6ada7ec9] Merge branch 
> 'for-4.5-ancestor-test' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
> git bisect good bc9b145a092aca91a7f6ef40cdb3628b6ada7ec9
> # good: [a4fcad656e1100bdda9b0b752b93a1a276810469] fm10k: whitespace cleanups
> git bisect good a4fcad656e1100bdda9b0b752b93a1a276810469
> # bad: [7302b9d90117496049dd4bfa28755f7c2ed55b27] ieee802154/adf7242: Driver 
> for ADF7242 MAC IEEE802154
> git bisect bad 7302b9d90117496049dd4bfa28755f7c2ed55b27
> # bad: [a0c38245153abe1fd844af9b166d1a5d5dafe7b1] Bluetooth: hci_intel: Use 
> shorter timeout for HCI commands
> git bisect bad a0c38245153abe1fd844af9b166d1a5d5dafe7b1
> # good: [bf943cbf76ecd3b9838a80d5e08777b0f4ccc665] Bluetooth: Move fast 
> connectable code to hci_request.c
> git bisect good bf943cbf76ecd3b9838a80d5e08777b0f4ccc665
> # bad: [742c59516822f4a4bc23b0961d88c569a7f1bf71] Bluetooth: Simplify setting 
> Configuration Field
> g

4.5 Regression - mouse not working after resume from suspend

2016-02-06 Thread Chris Clayton
There seems to be a regression in resuming my laptop from a suspend to RAM or 
disk. The symptom is that my bluetooth
mouse doesn't work after the resume. The kernel is built after a pull of Linus' 
tree this morning (v4.5-rc2-340-g5af9c2e).

Attached is the output from dmesg showing the boot, suspend (to RAM) and 
resume. You'll see that during the resume,
error -517 is being reported for some devices. Suspend/resume has worked 
perfectly with a 4.[234].x kernels.

I'll start a bisection, but thought I'd give a heads up in case someone can see 
the problem before I get done with the
bisect.

Chris
[0.00] Linux version 4.5.0-rc2+ (chris@laptop) (gcc version 5.3.1 
20160202 (GCC) ) #318 SMP PREEMPT Sat Feb 6 06:38:55 GMT 2016
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-4.5.0-rc2+ ro 
root=/dev/sda2 resume=/dev/sda6 rootfstype=ext4 net.ifnames=0
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: Supporting XSAVE feature 0x01: 'x87 floating point 
registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x02: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x04: 'AVX registers'
[0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, 
using 'standard' format.
[0.00] x86/fpu: Using 'eager' FPU context switches.
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009d7ff] usable
[0.00] BIOS-e820: [mem 0x0009d800-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000e-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0xd7216fff] usable
[0.00] BIOS-e820: [mem 0xd7217000-0xd721dfff] ACPI NVS
[0.00] BIOS-e820: [mem 0xd721e000-0xd7a0cfff] usable
[0.00] BIOS-e820: [mem 0xd7a0d000-0xd7ca1fff] reserved
[0.00] BIOS-e820: [mem 0xd7ca2000-0xdb4d] usable
[0.00] BIOS-e820: [mem 0xdb4e-0xdb82dfff] reserved
[0.00] BIOS-e820: [mem 0xdb82e000-0xdb88afff] usable
[0.00] BIOS-e820: [mem 0xdb88b000-0xdb9bcfff] ACPI NVS
[0.00] BIOS-e820: [mem 0xdb9bd000-0xdbffefff] reserved
[0.00] BIOS-e820: [mem 0xdbfff000-0xdbff] usable
[0.00] BIOS-e820: [mem 0xdd00-0xdf1f] reserved
[0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xfed0-0xfed03fff] reserved
[0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xff00-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00041fdf] usable
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.7 present.
[0.00] DMI: Notebook W65_67SZ   
 /W65_67SZ, BIOS 1.03.05 02/26/2014
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] e820: last_pfn = 0x41fe00 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-C write-protect
[0.00]   D-E7FFF uncachable
[0.00]   E8000-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 00 mask 7C write-back
[0.00]   1 base 04 mask 7FE000 write-back
[0.00]   2 base 00E000 mask 7FE000 uncachable
[0.00]   3 base 00DE00 mask 7FFE00 uncachable
[0.00]   4 base 00DD00 mask 7FFF00 uncachable
[0.00]   5 base 041FE0 mask 7FFFE0 uncachable
[0.00]   6 disabled
[0.00]   7 disabled
[0.00]   8 disabled
[0.00]   9 disabled
[0.00] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- WT  
[0.00] e820: update [mem 0xdd00-0x] usable ==> reserved
[0.00] e820: last_pfn = 0xdc000 max_arch_pfn = 0x4
[0.00] found SMP MP-table at [mem 0x000fd820-0x000fd82f] mapped at 
[880fd820]
[0.00] Base memory trampoline at [88097000] 97000 size 24576
[0.00] Using GB pages for direct mapping
[0.00] BRK [0x01a05000, 0x01a05fff] PGTABLE
[0.00] BRK [0x01a06000, 0x01a06fff] PGTABLE
[0.00] BRK [0x01a07000, 0x01a07fff] PGTABLE
[0.00] BRK [0x01a08000, 0x01a08fff] PGTABLE
[0.00] BRK [0x01a09000, 0x01a09fff] 

Re: EXT4: new warnings from 4.3.0-rc2

2015-09-21 Thread Chris Clayton


On 09/21/15 15:55, Chris Clayton wrote:
> Thanks Ortwin.
> 
> On 09/21/15 14:27, Ortwin Glück wrote:
>>> [2.481399] EXT4-fs (sda2): couldn't mount as ext3 due to feature 
>>> incompatibilities
>>> [2.482426] EXT4-fs (sda2): couldn't mount as ext2 due to feature 
>>> incompatibilities
>>
>> As the kernel doesn't know which FS your root is, it tries the whole list of 
>> filesystems (init/do_mounts.c
>> mount_block_root()). Since the removal of ext3, now the ext4 code is 
>> responsbile for mounting ext3. Since your FS is
>> ext4 and not ext3, the probe for ext3 fails. That's what the message tells 
>> you. You get these even in previous kernels
>> if you say N to ext3 during config.
>>
> No, I do not get the messages from 4.2.0 even though it is configured the 
> same as 4.3.0-rc3 as far as EXT{2,3,4} is
> concerned:
> 
> # CONFIG_EXT2_FS is not set
> # CONFIG_EXT3_FS is not set
> CONFIG_EXT4_FS=y
> CONFIG_EXT4_USE_FOR_EXT2=y
> # CONFIG_EXT4_FS_POSIX_ACL is not set
> # CONFIG_EXT4_FS_SECURITY is not set
> # CONFIG_EXT4_ENCRYPTION is not set
> # CONFIG_EXT4_DEBUG is not set
> [chris:~/kernel/linux]$ cd ../linux-4.2.0/
> [chris:~/kernel/linux-4.2.0]$ grep EXT[234] .config
> # CONFIG_EXT2_FS is not set
> # CONFIG_EXT3_FS is not set
> CONFIG_EXT4_FS=y
> CONFIG_EXT4_USE_FOR_EXT23=y
> # CONFIG_EXT4_FS_POSIX_ACL is not set
> # CONFIG_EXT4_FS_SECURITY is not set
> # CONFIG_EXT4_ENCRYPTION is not set
> # CONFIG_EXT4_DEBUG is not set
> 
> That's why I said they are new messages.
> 
> I've just booted 4.1.7 and I get the messages from that kernel too. I wonder 
> if there's a recent fix that has made it
> into 4.1.7, but not into 4.2.0. I'll apply Greg's 4.2.1-rc1 patch and see 
> what I get with that.
> 

Applying the 4.2.1-rc1 patch results in a kernel that emits the messages, so I 
guess my fix-not-yet-in-4.2 theory is right.

I'll just ignore the messages. Sorry for the noise.

> Chris
> 
> 
>> If it bugs you, you can add a hint to your kernel command line: 
>> rootfstype=ext4
>>
>> Ortwin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: EXT4: new warnings from 4.3.0-rc2

2015-09-21 Thread Chris Clayton
Thanks Ortwin.

On 09/21/15 14:27, Ortwin Glück wrote:
>> [2.481399] EXT4-fs (sda2): couldn't mount as ext3 due to feature 
>> incompatibilities
>> [2.482426] EXT4-fs (sda2): couldn't mount as ext2 due to feature 
>> incompatibilities
> 
> As the kernel doesn't know which FS your root is, it tries the whole list of 
> filesystems (init/do_mounts.c
> mount_block_root()). Since the removal of ext3, now the ext4 code is 
> responsbile for mounting ext3. Since your FS is
> ext4 and not ext3, the probe for ext3 fails. That's what the message tells 
> you. You get these even in previous kernels
> if you say N to ext3 during config.
> 
No, I do not get the messages from 4.2.0 even though it is configured the same 
as 4.3.0-rc3 as far as EXT{2,3,4} is
concerned:

# CONFIG_EXT2_FS is not set
# CONFIG_EXT3_FS is not set
CONFIG_EXT4_FS=y
CONFIG_EXT4_USE_FOR_EXT2=y
# CONFIG_EXT4_FS_POSIX_ACL is not set
# CONFIG_EXT4_FS_SECURITY is not set
# CONFIG_EXT4_ENCRYPTION is not set
# CONFIG_EXT4_DEBUG is not set
[chris:~/kernel/linux]$ cd ../linux-4.2.0/
[chris:~/kernel/linux-4.2.0]$ grep EXT[234] .config
# CONFIG_EXT2_FS is not set
# CONFIG_EXT3_FS is not set
CONFIG_EXT4_FS=y
CONFIG_EXT4_USE_FOR_EXT23=y
# CONFIG_EXT4_FS_POSIX_ACL is not set
# CONFIG_EXT4_FS_SECURITY is not set
# CONFIG_EXT4_ENCRYPTION is not set
# CONFIG_EXT4_DEBUG is not set

That's why I said they are new messages.

I've just booted 4.1.7 and I get the messages from that kernel too. I wonder if 
there's a recent fix that has made it
into 4.1.7, but not into 4.2.0. I'll apply Greg's 4.2.1-rc1 patch and see what 
I get with that.

Chris


> If it bugs you, you can add a hint to your kernel command line: 
> rootfstype=ext4
> 
> Ortwin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


EXT4: new warnings from 4.3.0-rc2

2015-09-21 Thread Chris Clayton
Hi,

I've just built and booted 4.3.0-rc2 and I'm seeing the following new messages 
on the console during boot up:

[2.481399] EXT4-fs (sda2): couldn't mount as ext3 due to feature 
incompatibilities
[2.482426] EXT4-fs (sda2): couldn't mount as ext2 due to feature 
incompatibilities

They are immediately followed by:

[2.507948] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: 
(null)
[3.549523] EXT4-fs (sda2): re-mounted. Opts: (null)

and they are the messages I normally see (from 4.2.0 and earlier).

sda2 is my root partition and is mounted OK, so my system is operating as 
before, but I thought you would want a heads
up about these (slightly alarming) new console messages.

The output from dmesg is attached, in case it helps.

Chris
[0.00] Initializing cgroup subsys cpu
[0.00] Linux version 4.3.0-rc2 (chris@laptop) (gcc version 5.2.1 
20150915 (GCC) ) #251 SMP PREEMPT Mon Sep 21 07:50:42 BST 2015
[0.00] Command line: root=/dev/sda2 ro resume=/dev/sda6
[0.00] x86/fpu: xstate_offset[2]: 0240, xstate_sizes[2]: 0100
[0.00] x86/fpu: Supporting XSAVE feature 0x01: 'x87 floating point 
registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x02: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x04: 'AVX registers'
[0.00] x86/fpu: Enabled xstate features 0x7, context size is 0x340 
bytes, using 'standard' format.
[0.00] x86/fpu: Using 'eager' FPU context switches.
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009d7ff] usable
[0.00] BIOS-e820: [mem 0x0009d800-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000e-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0xd7216fff] usable
[0.00] BIOS-e820: [mem 0xd7217000-0xd721dfff] ACPI NVS
[0.00] BIOS-e820: [mem 0xd721e000-0xd7a0cfff] usable
[0.00] BIOS-e820: [mem 0xd7a0d000-0xd7ca1fff] reserved
[0.00] BIOS-e820: [mem 0xd7ca2000-0xdb4d] usable
[0.00] BIOS-e820: [mem 0xdb4e-0xdb82dfff] reserved
[0.00] BIOS-e820: [mem 0xdb82e000-0xdb88afff] usable
[0.00] BIOS-e820: [mem 0xdb88b000-0xdb9bcfff] ACPI NVS
[0.00] BIOS-e820: [mem 0xdb9bd000-0xdbffefff] reserved
[0.00] BIOS-e820: [mem 0xdbfff000-0xdbff] usable
[0.00] BIOS-e820: [mem 0xdd00-0xdf1f] reserved
[0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xfed0-0xfed03fff] reserved
[0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xff00-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00041fdf] usable
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.7 present.
[0.00] DMI: Notebook W65_67SZ   
 /W65_67SZ, BIOS 1.03.05 02/26/2014
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] e820: last_pfn = 0x41fe00 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-C write-protect
[0.00]   D-E7FFF uncachable
[0.00]   E8000-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 00 mask 7C write-back
[0.00]   1 base 04 mask 7FE000 write-back
[0.00]   2 base 00E000 mask 7FE000 uncachable
[0.00]   3 base 00DE00 mask 7FFE00 uncachable
[0.00]   4 base 00DD00 mask 7FFF00 uncachable
[0.00]   5 base 041FE0 mask 7FFFE0 uncachable
[0.00]   6 disabled
[0.00]   7 disabled
[0.00]   8 disabled
[0.00]   9 disabled
[0.00] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- WT  
[0.00] e820: update [mem 0xdd00-0x] usable ==> reserved
[0.00] e820: last_pfn = 0xdc000 max_arch_pfn = 0x4
[0.00] found SMP MP-table at [mem 0x000fd820-0x000fd82f] mapped at 
[880fd820]
[0.00] Base memory trampoline at [88097000] 97000 size 24576
[0.00] Using GB pages for direct mapping
[0.00] init_memory_mapping: [mem 0x-0x000f]
[0.00]  [mem 

Re: EXT4: new warnings from 4.3.0-rc2

2015-09-21 Thread Chris Clayton


On 09/21/15 15:55, Chris Clayton wrote:
> Thanks Ortwin.
> 
> On 09/21/15 14:27, Ortwin Glück wrote:
>>> [2.481399] EXT4-fs (sda2): couldn't mount as ext3 due to feature 
>>> incompatibilities
>>> [2.482426] EXT4-fs (sda2): couldn't mount as ext2 due to feature 
>>> incompatibilities
>>
>> As the kernel doesn't know which FS your root is, it tries the whole list of 
>> filesystems (init/do_mounts.c
>> mount_block_root()). Since the removal of ext3, now the ext4 code is 
>> responsbile for mounting ext3. Since your FS is
>> ext4 and not ext3, the probe for ext3 fails. That's what the message tells 
>> you. You get these even in previous kernels
>> if you say N to ext3 during config.
>>
> No, I do not get the messages from 4.2.0 even though it is configured the 
> same as 4.3.0-rc3 as far as EXT{2,3,4} is
> concerned:
> 
> # CONFIG_EXT2_FS is not set
> # CONFIG_EXT3_FS is not set
> CONFIG_EXT4_FS=y
> CONFIG_EXT4_USE_FOR_EXT2=y
> # CONFIG_EXT4_FS_POSIX_ACL is not set
> # CONFIG_EXT4_FS_SECURITY is not set
> # CONFIG_EXT4_ENCRYPTION is not set
> # CONFIG_EXT4_DEBUG is not set
> [chris:~/kernel/linux]$ cd ../linux-4.2.0/
> [chris:~/kernel/linux-4.2.0]$ grep EXT[234] .config
> # CONFIG_EXT2_FS is not set
> # CONFIG_EXT3_FS is not set
> CONFIG_EXT4_FS=y
> CONFIG_EXT4_USE_FOR_EXT23=y
> # CONFIG_EXT4_FS_POSIX_ACL is not set
> # CONFIG_EXT4_FS_SECURITY is not set
> # CONFIG_EXT4_ENCRYPTION is not set
> # CONFIG_EXT4_DEBUG is not set
> 
> That's why I said they are new messages.
> 
> I've just booted 4.1.7 and I get the messages from that kernel too. I wonder 
> if there's a recent fix that has made it
> into 4.1.7, but not into 4.2.0. I'll apply Greg's 4.2.1-rc1 patch and see 
> what I get with that.
> 

Applying the 4.2.1-rc1 patch results in a kernel that emits the messages, so I 
guess my fix-not-yet-in-4.2 theory is right.

I'll just ignore the messages. Sorry for the noise.

> Chris
> 
> 
>> If it bugs you, you can add a hint to your kernel command line: 
>> rootfstype=ext4
>>
>> Ortwin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: EXT4: new warnings from 4.3.0-rc2

2015-09-21 Thread Chris Clayton
Thanks Ortwin.

On 09/21/15 14:27, Ortwin Glück wrote:
>> [2.481399] EXT4-fs (sda2): couldn't mount as ext3 due to feature 
>> incompatibilities
>> [2.482426] EXT4-fs (sda2): couldn't mount as ext2 due to feature 
>> incompatibilities
> 
> As the kernel doesn't know which FS your root is, it tries the whole list of 
> filesystems (init/do_mounts.c
> mount_block_root()). Since the removal of ext3, now the ext4 code is 
> responsbile for mounting ext3. Since your FS is
> ext4 and not ext3, the probe for ext3 fails. That's what the message tells 
> you. You get these even in previous kernels
> if you say N to ext3 during config.
> 
No, I do not get the messages from 4.2.0 even though it is configured the same 
as 4.3.0-rc3 as far as EXT{2,3,4} is
concerned:

# CONFIG_EXT2_FS is not set
# CONFIG_EXT3_FS is not set
CONFIG_EXT4_FS=y
CONFIG_EXT4_USE_FOR_EXT2=y
# CONFIG_EXT4_FS_POSIX_ACL is not set
# CONFIG_EXT4_FS_SECURITY is not set
# CONFIG_EXT4_ENCRYPTION is not set
# CONFIG_EXT4_DEBUG is not set
[chris:~/kernel/linux]$ cd ../linux-4.2.0/
[chris:~/kernel/linux-4.2.0]$ grep EXT[234] .config
# CONFIG_EXT2_FS is not set
# CONFIG_EXT3_FS is not set
CONFIG_EXT4_FS=y
CONFIG_EXT4_USE_FOR_EXT23=y
# CONFIG_EXT4_FS_POSIX_ACL is not set
# CONFIG_EXT4_FS_SECURITY is not set
# CONFIG_EXT4_ENCRYPTION is not set
# CONFIG_EXT4_DEBUG is not set

That's why I said they are new messages.

I've just booted 4.1.7 and I get the messages from that kernel too. I wonder if 
there's a recent fix that has made it
into 4.1.7, but not into 4.2.0. I'll apply Greg's 4.2.1-rc1 patch and see what 
I get with that.

Chris


> If it bugs you, you can add a hint to your kernel command line: 
> rootfstype=ext4
> 
> Ortwin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


EXT4: new warnings from 4.3.0-rc2

2015-09-21 Thread Chris Clayton
Hi,

I've just built and booted 4.3.0-rc2 and I'm seeing the following new messages 
on the console during boot up:

[2.481399] EXT4-fs (sda2): couldn't mount as ext3 due to feature 
incompatibilities
[2.482426] EXT4-fs (sda2): couldn't mount as ext2 due to feature 
incompatibilities

They are immediately followed by:

[2.507948] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: 
(null)
[3.549523] EXT4-fs (sda2): re-mounted. Opts: (null)

and they are the messages I normally see (from 4.2.0 and earlier).

sda2 is my root partition and is mounted OK, so my system is operating as 
before, but I thought you would want a heads
up about these (slightly alarming) new console messages.

The output from dmesg is attached, in case it helps.

Chris
[0.00] Initializing cgroup subsys cpu
[0.00] Linux version 4.3.0-rc2 (chris@laptop) (gcc version 5.2.1 
20150915 (GCC) ) #251 SMP PREEMPT Mon Sep 21 07:50:42 BST 2015
[0.00] Command line: root=/dev/sda2 ro resume=/dev/sda6
[0.00] x86/fpu: xstate_offset[2]: 0240, xstate_sizes[2]: 0100
[0.00] x86/fpu: Supporting XSAVE feature 0x01: 'x87 floating point 
registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x02: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x04: 'AVX registers'
[0.00] x86/fpu: Enabled xstate features 0x7, context size is 0x340 
bytes, using 'standard' format.
[0.00] x86/fpu: Using 'eager' FPU context switches.
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009d7ff] usable
[0.00] BIOS-e820: [mem 0x0009d800-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000e-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0xd7216fff] usable
[0.00] BIOS-e820: [mem 0xd7217000-0xd721dfff] ACPI NVS
[0.00] BIOS-e820: [mem 0xd721e000-0xd7a0cfff] usable
[0.00] BIOS-e820: [mem 0xd7a0d000-0xd7ca1fff] reserved
[0.00] BIOS-e820: [mem 0xd7ca2000-0xdb4d] usable
[0.00] BIOS-e820: [mem 0xdb4e-0xdb82dfff] reserved
[0.00] BIOS-e820: [mem 0xdb82e000-0xdb88afff] usable
[0.00] BIOS-e820: [mem 0xdb88b000-0xdb9bcfff] ACPI NVS
[0.00] BIOS-e820: [mem 0xdb9bd000-0xdbffefff] reserved
[0.00] BIOS-e820: [mem 0xdbfff000-0xdbff] usable
[0.00] BIOS-e820: [mem 0xdd00-0xdf1f] reserved
[0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xfed0-0xfed03fff] reserved
[0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xff00-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00041fdf] usable
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.7 present.
[0.00] DMI: Notebook W65_67SZ   
 /W65_67SZ, BIOS 1.03.05 02/26/2014
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] e820: last_pfn = 0x41fe00 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-C write-protect
[0.00]   D-E7FFF uncachable
[0.00]   E8000-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 00 mask 7C write-back
[0.00]   1 base 04 mask 7FE000 write-back
[0.00]   2 base 00E000 mask 7FE000 uncachable
[0.00]   3 base 00DE00 mask 7FFE00 uncachable
[0.00]   4 base 00DD00 mask 7FFF00 uncachable
[0.00]   5 base 041FE0 mask 7FFFE0 uncachable
[0.00]   6 disabled
[0.00]   7 disabled
[0.00]   8 disabled
[0.00]   9 disabled
[0.00] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- WT  
[0.00] e820: update [mem 0xdd00-0x] usable ==> reserved
[0.00] e820: last_pfn = 0xdc000 max_arch_pfn = 0x4
[0.00] found SMP MP-table at [mem 0x000fd820-0x000fd82f] mapped at 
[880fd820]
[0.00] Base memory trampoline at [88097000] 97000 size 24576
[0.00] Using GB pages for direct mapping
[0.00] init_memory_mapping: [mem 0x-0x000f]
[0.00]  [mem 

Re: [PATCH] iommu: prompt for IOMMU_IO_PGTABLE_LPAE on ARM archs only

2015-02-16 Thread Chris Clayton


On 02/16/15 16:32, Will Deacon wrote:
> Hi Chris,
> 
> On Sun, Feb 15, 2015 at 11:17:19AM +0000, Chris Clayton wrote:
>> When running "make oldconfig" for an x86_64 kernel, I was prompted for a
>> setting for IOMMU_IO_PGTABLE_LPAE. From the prompt and the help text it
>> appears that this config item is relevant to ARMv7/v8 only. This patch
>> prevents the prompt on non-ARM architectures. Compile tested building a
>> cross-compiled x86_64 kernel in an x86 user space. The resultant kernel
>> boots fine and I am running it now.
>>
>> Fixes: e1d3c0fd701df831169b116cd5c5d6203ac07f70
>> Cc: will.dea...@arm.com
>> Signed-off-by: Chris Clayton 
>>
>> --- linux/drivers/iommu/Kconfig.orig2015-02-15 09:44:01.235927248 +
>> +++ linux/drivers/iommu/Kconfig 2015-02-15 09:44:41.131926434 +
>> @@ -22,6 +22,7 @@ config IOMMU_IO_PGTABLE
>>
>>  config IOMMU_IO_PGTABLE_LPAE
>> bool "ARMv7/v8 Long Descriptor Format"
>> +   depends on ARM || ARM64
>> select IOMMU_IO_PGTABLE
>> help
>>   Enable support for the ARM long descriptor pagetable format.
> 
> What's the problem with this? The page-table code is intentionally
> decoupled from the CPU architecture and having this boot-tested on x86
> found some real bugs that I'm currently fixing. Sure, you probably don't
> need this on your box, but it's not default y and you don't have to
> select it.
> 

There's no real problem except that, as I said, the prompt and the help text 
suggest that the config is relevant to ARM
architecture only. Same with the help text. When it popped up on x86_64, it was 
a surprise.

As you say, I can simply answer "N", but the prompt and the help need 
correcting, because for an ordinary Joe User like
me, it's misleading.


> Will
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] iommu: prompt for IOMMU_IO_PGTABLE_LPAE on ARM archs only

2015-02-16 Thread Chris Clayton


On 02/16/15 16:32, Will Deacon wrote:
 Hi Chris,
 
 On Sun, Feb 15, 2015 at 11:17:19AM +, Chris Clayton wrote:
 When running make oldconfig for an x86_64 kernel, I was prompted for a
 setting for IOMMU_IO_PGTABLE_LPAE. From the prompt and the help text it
 appears that this config item is relevant to ARMv7/v8 only. This patch
 prevents the prompt on non-ARM architectures. Compile tested building a
 cross-compiled x86_64 kernel in an x86 user space. The resultant kernel
 boots fine and I am running it now.

 Fixes: e1d3c0fd701df831169b116cd5c5d6203ac07f70
 Cc: will.dea...@arm.com
 Signed-off-by: Chris Clayton chris2...@googlemail.com

 --- linux/drivers/iommu/Kconfig.orig2015-02-15 09:44:01.235927248 +
 +++ linux/drivers/iommu/Kconfig 2015-02-15 09:44:41.131926434 +
 @@ -22,6 +22,7 @@ config IOMMU_IO_PGTABLE

  config IOMMU_IO_PGTABLE_LPAE
 bool ARMv7/v8 Long Descriptor Format
 +   depends on ARM || ARM64
 select IOMMU_IO_PGTABLE
 help
   Enable support for the ARM long descriptor pagetable format.
 
 What's the problem with this? The page-table code is intentionally
 decoupled from the CPU architecture and having this boot-tested on x86
 found some real bugs that I'm currently fixing. Sure, you probably don't
 need this on your box, but it's not default y and you don't have to
 select it.
 

There's no real problem except that, as I said, the prompt and the help text 
suggest that the config is relevant to ARM
architecture only. Same with the help text. When it popped up on x86_64, it was 
a surprise.

As you say, I can simply answer N, but the prompt and the help need 
correcting, because for an ordinary Joe User like
me, it's misleading.


 Will
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] iommu: prompt for IOMMU_IO_PGTABLE_LPAE on ARM archs only

2015-02-15 Thread Chris Clayton
When running "make oldconfig" for an x86_64 kernel, I was prompted for a 
setting for IOMMU_IO_PGTABLE_LPAE. From the
prompt and the help text it appears that this config item is relevant to 
ARMv7/v8 only. This patch prevents the prompt
on non-ARM architectures. Compile tested building a cross-compiled x86_64 
kernel in an x86 user space. The resultant
kernel boots fine and I am running it now.

Fixes: e1d3c0fd701df831169b116cd5c5d6203ac07f70
Cc: will.dea...@arm.com
Signed-off-by: Chris Clayton 

--- linux/drivers/iommu/Kconfig.orig2015-02-15 09:44:01.235927248 +
+++ linux/drivers/iommu/Kconfig 2015-02-15 09:44:41.131926434 +
@@ -22,6 +22,7 @@ config IOMMU_IO_PGTABLE

 config IOMMU_IO_PGTABLE_LPAE
bool "ARMv7/v8 Long Descriptor Format"
+   depends on ARM || ARM64
select IOMMU_IO_PGTABLE
help
  Enable support for the ARM long descriptor pagetable format.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] iommu: prompt for IOMMU_IO_PGTABLE_LPAE on ARM archs only

2015-02-15 Thread Chris Clayton
When running make oldconfig for an x86_64 kernel, I was prompted for a 
setting for IOMMU_IO_PGTABLE_LPAE. From the
prompt and the help text it appears that this config item is relevant to 
ARMv7/v8 only. This patch prevents the prompt
on non-ARM architectures. Compile tested building a cross-compiled x86_64 
kernel in an x86 user space. The resultant
kernel boots fine and I am running it now.

Fixes: e1d3c0fd701df831169b116cd5c5d6203ac07f70
Cc: will.dea...@arm.com
Signed-off-by: Chris Clayton chris2...@googlemail.com

--- linux/drivers/iommu/Kconfig.orig2015-02-15 09:44:01.235927248 +
+++ linux/drivers/iommu/Kconfig 2015-02-15 09:44:41.131926434 +
@@ -22,6 +22,7 @@ config IOMMU_IO_PGTABLE

 config IOMMU_IO_PGTABLE_LPAE
bool ARMv7/v8 Long Descriptor Format
+   depends on ARM || ARM64
select IOMMU_IO_PGTABLE
help
  Enable support for the ARM long descriptor pagetable format.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG in 3.19.0-rc3+

2015-01-11 Thread Chris Clayton
Thanks Konstantin.

[snip]

>>>
>>> Looks like degree (%edx) is 1 on anon-vma desruction.
>>> Probably I've overlooked some weird conrner case in vma splitting/merging.
>>>
>>> Could you try this patch. It disables vma merging end eliminates half
>>> of complicated paths.
>>> As I see merging is optional, everything should work fine without it.
>>>
>>> --- a/mm/mmap.c
>>> +++ b/mm/mmap.c
>>> @@ -1048,7 +1048,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>>>  * We later require that vma->vm_flags == vm_flags,
>>>  * so this tests vma->vm_flags & VM_SPECIAL, too.
>>>  */
>>> -   if (vm_flags & VM_SPECIAL)
>>> +   if (1)
>>> return NULL;
>>>
>>> if (prev)
>>>
>>>
>>>
>>>
>>> Code from your oops.
>>>
>>> Code: 00 ad de 48 89 43 18 e8 c5 f9 00 00 48 8b 45 10 48 8d 55 10 48
>>> 83 e8 10 49 39 d6 74 54 48 8b 7d 08 48 89 eb 8b 57 34 85 d2 74 9e <0f>
>>> 0b 0f 1f 40 00 e8 6b fc ff ff eb 9a 66 0f 1f 84 00 00 00 00
>>> All code
>>> 
>>>0: 00 ad de 48 89 43 add%ch,0x438948de(%rbp)
>>>6: 18 e8 sbb%ch,%al
>>>8: c5 f9 00 (bad)
>>>b: 00 48 8b add%cl,-0x75(%rax)
>>>e: 45 10 48 8d   adc%r9b,-0x73(%r8)
>>>   12: 55   push   %rbp
>>>   13: 10 48 83 adc%cl,-0x7d(%rax)
>>>   16: e8 10 49 39 d6   callq  0xd639492b
>>>   1b: 74 54 je 0x71
>>>   1d: 48 8b 7d 08   mov0x8(%rbp),%rdi
>>>   21: 48 89 eb mov%rbp,%rbx
>>>   24: 8b 57 34 mov0x34(%rdi),%edx
>>>   27: 85 d2 test   %edx,%edx
>>>   29: 74 9e je 0xffc9
>>>   2b:* 0f 0b ud2 <-- trapping instruction
>>>   2d: 0f 1f 40 00   nopl   0x0(%rax)
>>>   31: e8 6b fc ff ff   callq  0xfca1
>>>   36: eb 9a jmp0xffd2
>>>   38: 66   data16
>>>   39: 0f   .byte 0xf
>>>   3a: 1f   (bad)
>>>   3b: 84 00 test   %al,(%rax)
>>>   3d: 00 00 add%al,(%rax)
>>> ...
>>>
>>> Code starting with the faulting instruction
>>> ===
>>>0: 0f 0b ud2
>>>2: 0f 1f 40 00   nopl   0x0(%rax)
>>>6: e8 6b fc ff ff   callq  0xfc76
>>>b: eb 9a jmp0xffa7
>>>    d: 66   data16
>>>e: 0f   .byte 0xf
>>>f: 1f   (bad)
>>>   10: 84 00 test   %al,(%rax)
>>>   12: 00 00 add%al,(%rax)
>>>
>>>
>>> +Added Oded Gabbay  into cc, he's reported this
>>> problem too.
>>>
>> Thanks for the fast reply.
>>
>> I applied the patch and tested it. I wasn't able to reproduce *my* problem,
>> so you are definitely in the right direction :)
> 
> Ok. I've found something. Try patch from attachment.
> 

Your patch has fixed the BUG for me. Thank you.

Tested-by: Chris Clayton 

>>
>> Oded
>>
>>>>
>>>> In case it helps, I've attached the xz-compressed related config file.
>>>>
>>>> Chris
>>>>
>>>>>
>>>>> I've attached the full kernel log file for that boot.
>>>>>
>>>>> Chris
>>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG in 3.19.0-rc3+

2015-01-11 Thread Chris Clayton


On 01/11/15 09:52, Oded Gabbay wrote:
> 
> 
> On 01/11/2015 11:37 AM, Konstantin Khlebnikov wrote:
>> On Sun, Jan 11, 2015 at 11:16 AM, Chris Clayton
>>  wrote:
>>> Hi,
>>>
>>> I've done the bisect and the outcome is below, but, because I almost always 
>>> forget to mention it, I'll say here that I
>>> am running a 32 bit user space on a 64 bit kernel.
>>>
>>> On 01/10/15 20:17, Chris Clayton wrote:
>>>> Hi,
>>>>
>>>> I'm getting a bug a BUG report from a kernel built from a pull (earlier 
>>>> today) of the current development kernel
>>>> (running git describe gives v3.19-rc3-169-geb74926). So that I have 
>>>> useable wireless networking, I have also applied the
>>>> latest seven iwlwifi patches from the wireless-drivers git tree. Prior to 
>>>> today's pull, I was not seeing anything
>>>> unusual in dmesg.
>>>>
>>>> The BUG reported is as follows:
>>>>
>>>> Jan 10 19:41:32 laptop kernel: [ cut here ]
>>>> Jan 10 19:41:32 laptop kernel: kernel BUG at mm/rmap.c:399!
>>>> Jan 10 19:41:32 laptop kernel: invalid opcode:  [#1] PREEMPT SMP
>>>> Jan 10 19:41:32 laptop kernel: Modules linked in: rfcomm snd_hda_codec_via 
>>>> iwlmvm coretemp snd_hda_codec_hdmi
>>>> snd_hda_codec_generic snd_hda_intel mac80211 hwmon snd_hda_controller 
>>>> x86_pkg_temp_thermal acpi_cpufreq iwlwifi cfg80211
>>>> snd_hda_codec snd_hwdep
>>>> Jan 10 19:41:32 laptop kernel: CPU: 1 PID: 353 Comm: fc-cache Not tainted 
>>>> 3.19.0-rc3+ #42
>>>> Jan 10 19:41:32 laptop kernel: Hardware name: Notebook 
>>>> W65_67SZ/W65_67SZ
>>>>, BIOS 1.03.05 02/26/2014
>>>> Jan 10 19:41:32 laptop kernel: task: 8800da98c5c0 ti: 880408dd4000 
>>>> task.ti: 880408dd4000
>>>> Jan 10 19:41:32 laptop kernel: RIP: 0010:[]  
>>>> [] unlink_anon_vmas+0x17a/0x200
>>>> Jan 10 19:41:33 laptop kernel: RSP: 0018:880408dd7d88  EFLAGS: 00010286
>>>> Jan 10 19:41:33 laptop kernel: RAX: 88040b79e150 RBX: 88040b79e140 
>>>> RCX: 
>>>> Jan 10 19:41:33 laptop kernel: RDX: 0001 RSI: 880409f04360 
>>>> RDI: 880409f04320
>>>> Jan 10 19:41:33 laptop kernel: RBP: 88040cb13278 R08:  
>>>> R09: 88040d801c00
>>>> Jan 10 19:41:33 laptop kernel: R10: 88041fa546e0 R11: 88040b79e160 
>>>> R12: 880409f04320
>>>> Jan 10 19:41:33 laptop kernel: R13: 88040cb13278 R14: 88040cb13288 
>>>> R15: 88040cb13210
>>>> Jan 10 19:41:33 laptop kernel: FS:  () 
>>>> GS:88041fa4() knlGS:
>>>> Jan 10 19:41:33 laptop kernel: CS:  0010 DS: 002b ES: 002b CR0: 
>>>> 80050033
>>>> Jan 10 19:41:33 laptop kernel: CR2: f722c8d4 CR3: 0004082a8000 
>>>> CR4: 001407e0
>>>> Jan 10 19:41:33 laptop kernel: Stack:
>>>> Jan 10 19:41:33 laptop kernel:  88040d6cfbd8 88040d6cfba0 
>>>> 88040cecd160 88040cb13210
>>>> Jan 10 19:41:33 laptop kernel:  88040cbbb630 f7151000 
>>>> 880408dd7e28 
>>>> Jan 10 19:41:33 laptop kernel:   810e3633 
>>>>  
>>>> Jan 10 19:41:33 laptop kernel: Call Trace:
>>>> Jan 10 19:41:33 laptop kernel:  [] ? 
>>>> free_pgtables+0x83/0xf0
>>>> Jan 10 19:41:34 laptop kernel:  [] ? exit_mmap+0xc3/0x150
>>>> Jan 10 19:41:34 laptop kernel:  [] ? 
>>>> __do_page_fault+0x17d/0x4b0
>>>> Jan 10 19:41:34 laptop kernel:  [] ? mmput+0x21/0xc0
>>>> Jan 10 19:41:34 laptop kernel:  [] ? do_exit+0x26d/0xa50
>>>> Jan 10 19:41:34 laptop kernel:  [] ? 
>>>> mntput_no_expire+0x9/0x140
>>>> Jan 10 19:41:34 laptop kernel:  [] ? 
>>>> task_work_run+0xbc/0xf0
>>>> Jan 10 19:41:34 laptop kernel:  [] ? 
>>>> do_group_exit+0x34/0xb0
>>>> Jan 10 19:41:34 laptop kernel:  [] ? 
>>>> SyS_exit_group+0xf/0x10
>>>> Jan 10 19:41:34 laptop kernel:  [] ? 
>>>> sysenter_dispatch+0x7/0x1e
>>>> Jan 10 19:41:34 laptop kernel: Code: 00 ad de 48 89 43 18 e8 c5 f9 00 00 
>>>> 48 8b 45 10 48 8d 55 10 48 83 e8 10 49 39

Re: BUG in 3.19.0-rc3+

2015-01-11 Thread Chris Clayton
Hi,

I've done the bisect and the outcome is below, but, because I almost always 
forget to mention it, I'll say here that I
am running a 32 bit user space on a 64 bit kernel.

On 01/10/15 20:17, Chris Clayton wrote:
> Hi,
> 
> I'm getting a bug a BUG report from a kernel built from a pull (earlier 
> today) of the current development kernel
> (running git describe gives v3.19-rc3-169-geb74926). So that I have useable 
> wireless networking, I have also applied the
> latest seven iwlwifi patches from the wireless-drivers git tree. Prior to 
> today's pull, I was not seeing anything
> unusual in dmesg.
> 
> The BUG reported is as follows:
> 
> Jan 10 19:41:32 laptop kernel: [ cut here ]
> Jan 10 19:41:32 laptop kernel: kernel BUG at mm/rmap.c:399!
> Jan 10 19:41:32 laptop kernel: invalid opcode:  [#1] PREEMPT SMP
> Jan 10 19:41:32 laptop kernel: Modules linked in: rfcomm snd_hda_codec_via 
> iwlmvm coretemp snd_hda_codec_hdmi
> snd_hda_codec_generic snd_hda_intel mac80211 hwmon snd_hda_controller 
> x86_pkg_temp_thermal acpi_cpufreq iwlwifi cfg80211
> snd_hda_codec snd_hwdep
> Jan 10 19:41:32 laptop kernel: CPU: 1 PID: 353 Comm: fc-cache Not tainted 
> 3.19.0-rc3+ #42
> Jan 10 19:41:32 laptop kernel: Hardware name: Notebook
>  W65_67SZ/W65_67SZ
>, BIOS 1.03.05 02/26/2014
> Jan 10 19:41:32 laptop kernel: task: 8800da98c5c0 ti: 880408dd4000 
> task.ti: 880408dd4000
> Jan 10 19:41:32 laptop kernel: RIP: 0010:[]  
> [] unlink_anon_vmas+0x17a/0x200
> Jan 10 19:41:33 laptop kernel: RSP: 0018:880408dd7d88  EFLAGS: 00010286
> Jan 10 19:41:33 laptop kernel: RAX: 88040b79e150 RBX: 88040b79e140 
> RCX: 
> Jan 10 19:41:33 laptop kernel: RDX: 0001 RSI: 880409f04360 
> RDI: 880409f04320
> Jan 10 19:41:33 laptop kernel: RBP: 88040cb13278 R08:  
> R09: 88040d801c00
> Jan 10 19:41:33 laptop kernel: R10: 88041fa546e0 R11: 88040b79e160 
> R12: 880409f04320
> Jan 10 19:41:33 laptop kernel: R13: 88040cb13278 R14: 88040cb13288 
> R15: 88040cb13210
> Jan 10 19:41:33 laptop kernel: FS:  () 
> GS:88041fa4() knlGS:
> Jan 10 19:41:33 laptop kernel: CS:  0010 DS: 002b ES: 002b CR0: 
> 80050033
> Jan 10 19:41:33 laptop kernel: CR2: f722c8d4 CR3: 0004082a8000 
> CR4: 001407e0
> Jan 10 19:41:33 laptop kernel: Stack:
> Jan 10 19:41:33 laptop kernel:  88040d6cfbd8 88040d6cfba0 
> 88040cecd160 88040cb13210
> Jan 10 19:41:33 laptop kernel:  88040cbbb630 f7151000 
> 880408dd7e28 
> Jan 10 19:41:33 laptop kernel:   810e3633 
>  
> Jan 10 19:41:33 laptop kernel: Call Trace:
> Jan 10 19:41:33 laptop kernel:  [] ? free_pgtables+0x83/0xf0
> Jan 10 19:41:34 laptop kernel:  [] ? exit_mmap+0xc3/0x150
> Jan 10 19:41:34 laptop kernel:  [] ? 
> __do_page_fault+0x17d/0x4b0
> Jan 10 19:41:34 laptop kernel:  [] ? mmput+0x21/0xc0
> Jan 10 19:41:34 laptop kernel:  [] ? do_exit+0x26d/0xa50
> Jan 10 19:41:34 laptop kernel:  [] ? 
> mntput_no_expire+0x9/0x140
> Jan 10 19:41:34 laptop kernel:  [] ? task_work_run+0xbc/0xf0
> Jan 10 19:41:34 laptop kernel:  [] ? do_group_exit+0x34/0xb0
> Jan 10 19:41:34 laptop kernel:  [] ? SyS_exit_group+0xf/0x10
> Jan 10 19:41:34 laptop kernel:  [] ? 
> sysenter_dispatch+0x7/0x1e
> Jan 10 19:41:34 laptop kernel: Code: 00 ad de 48 89 43 18 e8 c5 f9 00 00 48 
> 8b 45 10 48 8d 55 10 48 83 e8 10 49 39 d6 74
> 54 48 8b 7d 08 48 89 eb 8b 57 34 85 d2 74 9e <0f> 0b 0f 1f 40 00 e8 6b fc ff 
> ff eb 9a 66 0f 1f 84 00 00 00 00
> Jan 10 19:41:34 laptop kernel: RIP  [] 
> unlink_anon_vmas+0x17a/0x200
> Jan 10 19:41:34 laptop kernel:  RSP 
> Jan 10 19:41:34 laptop kernel: ---[ end trace 4aa713b2a9aa664b ]---
> Jan 10 19:41:34 laptop kernel: Fixing recursive fault but reboot is needed!
> Jan 10 19:41:34 laptop kernel: nf_conntrack version 0.5.0 (16384 buckets, 
> 65536 max)

[snip]

> 
> I won't get time tonight, but I can bisect it tomorrow, so this is just a 
> heads up in case the problem (and fix) jumps
> out at anyone.  Before I bisect I'll build and run a kernel without the 
> iwlwifi patches.

The bisect ended up at:

7a3ef208e662f4b63d43a23f61a64a129c525bbc is the first bad commit
commit 7a3ef208e662f4b63d43a23f61a64a129c525bbc
Author: Konstantin Khlebnikov 
Date:   Thu Jan 8 14:32:15 2015 -0800

mm: prevent endless growth of anon_vma hierarchy

Constantly forking task causes unlimited grow of anon_vma chain.  Each
next child allocates new level of anon_vmas and links vma to all

Re: BUG in 3.19.0-rc3+

2015-01-11 Thread Chris Clayton
Hi,

I've done the bisect and the outcome is below, but, because I almost always 
forget to mention it, I'll say here that I
am running a 32 bit user space on a 64 bit kernel.

On 01/10/15 20:17, Chris Clayton wrote:
 Hi,
 
 I'm getting a bug a BUG report from a kernel built from a pull (earlier 
 today) of the current development kernel
 (running git describe gives v3.19-rc3-169-geb74926). So that I have useable 
 wireless networking, I have also applied the
 latest seven iwlwifi patches from the wireless-drivers git tree. Prior to 
 today's pull, I was not seeing anything
 unusual in dmesg.
 
 The BUG reported is as follows:
 
 Jan 10 19:41:32 laptop kernel: [ cut here ]
 Jan 10 19:41:32 laptop kernel: kernel BUG at mm/rmap.c:399!
 Jan 10 19:41:32 laptop kernel: invalid opcode:  [#1] PREEMPT SMP
 Jan 10 19:41:32 laptop kernel: Modules linked in: rfcomm snd_hda_codec_via 
 iwlmvm coretemp snd_hda_codec_hdmi
 snd_hda_codec_generic snd_hda_intel mac80211 hwmon snd_hda_controller 
 x86_pkg_temp_thermal acpi_cpufreq iwlwifi cfg80211
 snd_hda_codec snd_hwdep
 Jan 10 19:41:32 laptop kernel: CPU: 1 PID: 353 Comm: fc-cache Not tainted 
 3.19.0-rc3+ #42
 Jan 10 19:41:32 laptop kernel: Hardware name: Notebook
  W65_67SZ/W65_67SZ
, BIOS 1.03.05 02/26/2014
 Jan 10 19:41:32 laptop kernel: task: 8800da98c5c0 ti: 880408dd4000 
 task.ti: 880408dd4000
 Jan 10 19:41:32 laptop kernel: RIP: 0010:[810ef7ea]  
 [810ef7ea] unlink_anon_vmas+0x17a/0x200
 Jan 10 19:41:33 laptop kernel: RSP: 0018:880408dd7d88  EFLAGS: 00010286
 Jan 10 19:41:33 laptop kernel: RAX: 88040b79e150 RBX: 88040b79e140 
 RCX: 
 Jan 10 19:41:33 laptop kernel: RDX: 0001 RSI: 880409f04360 
 RDI: 880409f04320
 Jan 10 19:41:33 laptop kernel: RBP: 88040cb13278 R08:  
 R09: 88040d801c00
 Jan 10 19:41:33 laptop kernel: R10: 88041fa546e0 R11: 88040b79e160 
 R12: 880409f04320
 Jan 10 19:41:33 laptop kernel: R13: 88040cb13278 R14: 88040cb13288 
 R15: 88040cb13210
 Jan 10 19:41:33 laptop kernel: FS:  () 
 GS:88041fa4() knlGS:
 Jan 10 19:41:33 laptop kernel: CS:  0010 DS: 002b ES: 002b CR0: 
 80050033
 Jan 10 19:41:33 laptop kernel: CR2: f722c8d4 CR3: 0004082a8000 
 CR4: 001407e0
 Jan 10 19:41:33 laptop kernel: Stack:
 Jan 10 19:41:33 laptop kernel:  88040d6cfbd8 88040d6cfba0 
 88040cecd160 88040cb13210
 Jan 10 19:41:33 laptop kernel:  88040cbbb630 f7151000 
 880408dd7e28 
 Jan 10 19:41:33 laptop kernel:   810e3633 
  
 Jan 10 19:41:33 laptop kernel: Call Trace:
 Jan 10 19:41:33 laptop kernel:  [810e3633] ? free_pgtables+0x83/0xf0
 Jan 10 19:41:34 laptop kernel:  [810ec3c3] ? exit_mmap+0xc3/0x150
 Jan 10 19:41:34 laptop kernel:  [8103980d] ? 
 __do_page_fault+0x17d/0x4b0
 Jan 10 19:41:34 laptop kernel:  [81042a21] ? mmput+0x21/0xc0
 Jan 10 19:41:34 laptop kernel:  [8104673d] ? do_exit+0x26d/0xa50
 Jan 10 19:41:34 laptop kernel:  [8111fe89] ? 
 mntput_no_expire+0x9/0x140
 Jan 10 19:41:34 laptop kernel:  [8105ca1c] ? task_work_run+0xbc/0xf0
 Jan 10 19:41:34 laptop kernel:  [81047d44] ? do_group_exit+0x34/0xb0
 Jan 10 19:41:34 laptop kernel:  [81047dcf] ? SyS_exit_group+0xf/0x10
 Jan 10 19:41:34 laptop kernel:  [815e0f9f] ? 
 sysenter_dispatch+0x7/0x1e
 Jan 10 19:41:34 laptop kernel: Code: 00 ad de 48 89 43 18 e8 c5 f9 00 00 48 
 8b 45 10 48 8d 55 10 48 83 e8 10 49 39 d6 74
 54 48 8b 7d 08 48 89 eb 8b 57 34 85 d2 74 9e 0f 0b 0f 1f 40 00 e8 6b fc ff 
 ff eb 9a 66 0f 1f 84 00 00 00 00
 Jan 10 19:41:34 laptop kernel: RIP  [810ef7ea] 
 unlink_anon_vmas+0x17a/0x200
 Jan 10 19:41:34 laptop kernel:  RSP 880408dd7d88
 Jan 10 19:41:34 laptop kernel: ---[ end trace 4aa713b2a9aa664b ]---
 Jan 10 19:41:34 laptop kernel: Fixing recursive fault but reboot is needed!
 Jan 10 19:41:34 laptop kernel: nf_conntrack version 0.5.0 (16384 buckets, 
 65536 max)

[snip]

 
 I won't get time tonight, but I can bisect it tomorrow, so this is just a 
 heads up in case the problem (and fix) jumps
 out at anyone.  Before I bisect I'll build and run a kernel without the 
 iwlwifi patches.

The bisect ended up at:

7a3ef208e662f4b63d43a23f61a64a129c525bbc is the first bad commit
commit 7a3ef208e662f4b63d43a23f61a64a129c525bbc
Author: Konstantin Khlebnikov koc...@gmail.com
Date:   Thu Jan 8 14:32:15 2015 -0800

mm: prevent endless growth of anon_vma hierarchy

Constantly forking task causes unlimited grow of anon_vma chain.  Each
next child allocates new level of anon_vmas and links vma to all
previous levels because pages might be inherited from any level.

This patch adds heuristic which decides

  1   2   3   >