Re: linux-5.10.11 build failure
Hi Greg, On 29/01/2021 15:14, Josh Poimboeuf wrote: > On Fri, Jan 29, 2021 at 12:09:53PM +0100, Greg Kroah-Hartman wrote: >> On Fri, Jan 29, 2021 at 11:03:26AM +0000, Chris Clayton wrote: >>> >>> >>> On 29/01/2021 10:11, Greg Kroah-Hartman wrote: >>>> On Thu, Jan 28, 2021 at 10:00:15AM -0600, Josh Poimboeuf wrote: ... >>>> >>>> It is in Linus's tree now :) >>>> >>>> Now grabbed. >>>> >>> >>> Are you sure, Greg? I don't see the patch in Linus' tree at >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git. Nor do >>> is see it in your stable queue at >>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/. >>> For clarity, I've attached the patch which >>> fixes problem I reported and is currently sat in >>> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git As I >>> understand it, the patch is scheduled to be included in a pull request to >>> Linus this weekend in time for -rc6. >>> >>> In fact, I did a pull from Linus' tree a few minutes ago and the build >>> failed in the way I reported in this thread. I >>> added the patch and the build now succeeds. >> >> Ok, sorry, no, I grabbed 1d489151e9f9 ("objtool: Don't fail on missing >> symbol table") which is what Josh asked me to take. I got that confused >> here. > > I'm probably responsible for that confusion, I got mixed up myself. > It'll be a good idea to take both anyway. > The patch is now in Linus' tree at 5e6dca82bcaa49348f9e5fcb48df4881f6d6c4ae Thanks. Chris
Re: linux-5.10.11 build failure
On 28/01/2021 15:52, Josh Poimboeuf wrote: > On Thu, Jan 28, 2021 at 11:24:47AM +, Thomas Backlund wrote: >> Den 28.1.2021 kl. 12:05, skrev Chris Clayton: >>> >>> On 28/01/2021 09:34, Greg Kroah-Hartman wrote: >>>> On Thu, Jan 28, 2021 at 09:17:10AM +, Chris Clayton wrote: >>>>> Hi, >>>>> >>>>> Building 5.10.11 fails on my (x86-64) laptop thusly: >>>>> >>>>> .. >>>>> >>>>> AS arch/x86/entry/thunk_64.o >>>>>CC arch/x86/entry/vsyscall/vsyscall_64.o >>>>>AS arch/x86/realmode/rm/header.o >>>>>CC arch/x86/mm/pat/set_memory.o >>>>>CC arch/x86/events/amd/core.o >>>>>CC arch/x86/kernel/fpu/init.o >>>>>CC arch/x86/entry/vdso/vma.o >>>>>CC kernel/sched/core.o >>>>> arch/x86/entry/thunk_64.o: warning: objtool: missing symbol for insn at >>>>> offset 0x3e >>>>> >>>>>AS arch/x86/realmode/rm/trampoline_64.o >>>>> make[2]: *** [scripts/Makefile.build:360: arch/x86/entry/thunk_64.o] >>>>> Error 255 >>>>> make[2]: *** Deleting file 'arch/x86/entry/thunk_64.o' >>>>> make[2]: *** Waiting for unfinished jobs >>>>> >>>>> .. >>>>> >>>>> Compiler is latest snapshot of gcc-10. >>>>> >>>>> Happy to test the fix but please cc me as I'm not subscribed >>>> >>>> Can you do 'git bisect' to track down the offending commit? >>>> >>> >>> Sure, but I'll hold that request for a while. I updated to binutils-2.36 on >>> Monday and I'm pretty sure that is a feature >>> of this build fail. I've reverted binutils to 2.35.1, and the build >>> succeeds. Updated to 2.36 again and, surprise, >>> surprise, the kernel build fails again. >>> >>> I've had a glance at the binutils ML and there are all sorts of issues >>> being reported, but it's beyond my knowledge to >>> assess if this build error is related to any of them. >>> >>> I'll stick with binutils-2.35.1 for the time being. >>> >>>> And what exact gcc version are you using? >>>> >>> >>> It's built from the 10-20210123 snapshot tarball. >>> >>> I can report this to the binutils folks, but might it be better if the >>> objtool maintainer looks at it first? The >>> binutils change might just have opened the gate to a bug in objtool. >>> >>>> thanks, >>>> >>>> greg k-h >>>> >>> >> >> >> AFAIK you need this in stable trees: >> >> From 1d489151e9f9d1647110277ff77282fe4d96d09b Mon Sep 17 00:00:00 2001 >> From: Josh Poimboeuf >> Date: Thu, 14 Jan 2021 16:14:01 -0600 >> Subject: [PATCH] objtool: Don't fail on missing symbol table > > Actually I think you need: > > 5e6dca82bcaa ("x86/entry: Emit a symbol for register restoring thunk") > > I submitted a patch to stable list a few days ago. > Yes, that's what I concluded, Josh. 5.10.11 builds with that patch added but it's not in Linus's tree yet, so, as I understand it, is not yet a candidate from stable. > (Though it's possible you need both commits, I'm not sure if binutils > 2.36 has the symbol stripping stuff) >
Re: linux-5.10.11 build failure
On 28/01/2021 14:41, Greg Kroah-Hartman wrote: > On Thu, Jan 28, 2021 at 01:38:25PM +0000, Chris Clayton wrote: >> Thanks, Thomas. >> >> On 28/01/2021 11:24, Thomas Backlund wrote: >>> Den 28.1.2021 kl. 12:05, skrev Chris Clayton: >>>> >>>> On 28/01/2021 09:34, Greg Kroah-Hartman wrote: >>>>> On Thu, Jan 28, 2021 at 09:17:10AM +, Chris Clayton wrote: >>>>>> Hi, >>>>>> >>>>>> Building 5.10.11 fails on my (x86-64) laptop thusly: >>>>>> >>>>>> .. >>>>>> >>>>>> AS arch/x86/entry/thunk_64.o >>>>>>CC arch/x86/entry/vsyscall/vsyscall_64.o >>>>>>AS arch/x86/realmode/rm/header.o >>>>>>CC arch/x86/mm/pat/set_memory.o >>>>>>CC arch/x86/events/amd/core.o >>>>>>CC arch/x86/kernel/fpu/init.o >>>>>>CC arch/x86/entry/vdso/vma.o >>>>>>CC kernel/sched/core.o >>>>>> arch/x86/entry/thunk_64.o: warning: objtool: missing symbol for insn at >>>>>> offset 0x3e >>>>>> >>>>>>AS arch/x86/realmode/rm/trampoline_64.o >>>>>> make[2]: *** [scripts/Makefile.build:360: arch/x86/entry/thunk_64.o] >>>>>> Error 255 >>>>>> make[2]: *** Deleting file 'arch/x86/entry/thunk_64.o' >>>>>> make[2]: *** Waiting for unfinished jobs >>>>>> >>>>>> .. >>>>>> >>>>>> Compiler is latest snapshot of gcc-10. >>>>>> >>>>>> Happy to test the fix but please cc me as I'm not subscribed >>>>> >>>>> Can you do 'git bisect' to track down the offending commit? >>>>> >>>> >>>> Sure, but I'll hold that request for a while. I updated to binutils-2.36 >>>> on Monday and I'm pretty sure that is a feature >>>> of this build fail. I've reverted binutils to 2.35.1, and the build >>>> succeeds. Updated to 2.36 again and, surprise, >>>> surprise, the kernel build fails again. >>>> >>>> I've had a glance at the binutils ML and there are all sorts of issues >>>> being reported, but it's beyond my knowledge to >>>> assess if this build error is related to any of them. >>>> >>>> I'll stick with binutils-2.35.1 for the time being. >>>> >>>>> And what exact gcc version are you using? >>>>> >>>> >>>> It's built from the 10-20210123 snapshot tarball. >>>> >>>> I can report this to the binutils folks, but might it be better if the >>>> objtool maintainer looks at it first? The >>>> binutils change might just have opened the gate to a bug in objtool. >>>> >>>>> thanks, >>>>> >>>>> greg k-h >>>>> >>>> >>> >>> >>> AFAIK you need this in stable trees: >>> >>> From 1d489151e9f9d1647110277ff77282fe4d96d09b Mon Sep 17 00:00:00 2001 >>> From: Josh Poimboeuf >>> Date: Thu, 14 Jan 2021 16:14:01 -0600 >>> Subject: [PATCH] objtool: Don't fail on missing symbol table >>> >>> >> >> That may be the caae, but it doesn't fix the build failure I've reported in >> this thread. However, as suggested by Tor, >> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/patch/?id=5e6dca82bcaa49348f9e5fcb48df4881f6d6c4ae >> does fix it. >> >> That hasn't made Linus' tree yet and I don't see a pull request, but it is >> in linux-next so I guess it could make it in >> -rc6. > > Ok, thanks, so this is not a new regression for 5.10.y. > That seems to be the case, Greg. Neither 5.10.10 nor 5.10.9 build either. > greg k-h >
Re: linux-5.10.11 build failure
Thanks, Thomas. On 28/01/2021 11:24, Thomas Backlund wrote: > Den 28.1.2021 kl. 12:05, skrev Chris Clayton: >> >> On 28/01/2021 09:34, Greg Kroah-Hartman wrote: >>> On Thu, Jan 28, 2021 at 09:17:10AM +, Chris Clayton wrote: >>>> Hi, >>>> >>>> Building 5.10.11 fails on my (x86-64) laptop thusly: >>>> >>>> .. >>>> >>>> AS arch/x86/entry/thunk_64.o >>>>CC arch/x86/entry/vsyscall/vsyscall_64.o >>>>AS arch/x86/realmode/rm/header.o >>>>CC arch/x86/mm/pat/set_memory.o >>>>CC arch/x86/events/amd/core.o >>>>CC arch/x86/kernel/fpu/init.o >>>>CC arch/x86/entry/vdso/vma.o >>>>CC kernel/sched/core.o >>>> arch/x86/entry/thunk_64.o: warning: objtool: missing symbol for insn at >>>> offset 0x3e >>>> >>>>AS arch/x86/realmode/rm/trampoline_64.o >>>> make[2]: *** [scripts/Makefile.build:360: arch/x86/entry/thunk_64.o] Error >>>> 255 >>>> make[2]: *** Deleting file 'arch/x86/entry/thunk_64.o' >>>> make[2]: *** Waiting for unfinished jobs >>>> >>>> .. >>>> >>>> Compiler is latest snapshot of gcc-10. >>>> >>>> Happy to test the fix but please cc me as I'm not subscribed >>> >>> Can you do 'git bisect' to track down the offending commit? >>> >> >> Sure, but I'll hold that request for a while. I updated to binutils-2.36 on >> Monday and I'm pretty sure that is a feature >> of this build fail. I've reverted binutils to 2.35.1, and the build >> succeeds. Updated to 2.36 again and, surprise, >> surprise, the kernel build fails again. >> >> I've had a glance at the binutils ML and there are all sorts of issues being >> reported, but it's beyond my knowledge to >> assess if this build error is related to any of them. >> >> I'll stick with binutils-2.35.1 for the time being. >> >>> And what exact gcc version are you using? >>> >> >> It's built from the 10-20210123 snapshot tarball. >> >> I can report this to the binutils folks, but might it be better if the >> objtool maintainer looks at it first? The >> binutils change might just have opened the gate to a bug in objtool. >> >>> thanks, >>> >>> greg k-h >>> >> > > > AFAIK you need this in stable trees: > > From 1d489151e9f9d1647110277ff77282fe4d96d09b Mon Sep 17 00:00:00 2001 > From: Josh Poimboeuf > Date: Thu, 14 Jan 2021 16:14:01 -0600 > Subject: [PATCH] objtool: Don't fail on missing symbol table > > That may be the caae, but it doesn't fix the build failure I've reported in this thread. However, as suggested by Tor, https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/patch/?id=5e6dca82bcaa49348f9e5fcb48df4881f6d6c4ae does fix it. That hasn't made Linus' tree yet and I don't see a pull request, but it is in linux-next so I guess it could make it in -rc6. Chris > -- > Thomas >
Re: linux-5.10.11 build failure
On 28/01/2021 09:34, Greg Kroah-Hartman wrote: > On Thu, Jan 28, 2021 at 09:17:10AM +0000, Chris Clayton wrote: >> Hi, >> >> Building 5.10.11 fails on my (x86-64) laptop thusly: >> >> .. >> >> AS arch/x86/entry/thunk_64.o >> CC arch/x86/entry/vsyscall/vsyscall_64.o >> AS arch/x86/realmode/rm/header.o >> CC arch/x86/mm/pat/set_memory.o >> CC arch/x86/events/amd/core.o >> CC arch/x86/kernel/fpu/init.o >> CC arch/x86/entry/vdso/vma.o >> CC kernel/sched/core.o >> arch/x86/entry/thunk_64.o: warning: objtool: missing symbol for insn at >> offset 0x3e >> >> AS arch/x86/realmode/rm/trampoline_64.o >> make[2]: *** [scripts/Makefile.build:360: arch/x86/entry/thunk_64.o] Error >> 255 >> make[2]: *** Deleting file 'arch/x86/entry/thunk_64.o' >> make[2]: *** Waiting for unfinished jobs >> >> .. >> >> Compiler is latest snapshot of gcc-10. >> >> Happy to test the fix but please cc me as I'm not subscribed > > Can you do 'git bisect' to track down the offending commit? > Sure, but I'll hold that request for a while. I updated to binutils-2.36 on Monday and I'm pretty sure that is a feature of this build fail. I've reverted binutils to 2.35.1, and the build succeeds. Updated to 2.36 again and, surprise, surprise, the kernel build fails again. I've had a glance at the binutils ML and there are all sorts of issues being reported, but it's beyond my knowledge to assess if this build error is related to any of them. I'll stick with binutils-2.35.1 for the time being. > And what exact gcc version are you using? > It's built from the 10-20210123 snapshot tarball. I can report this to the binutils folks, but might it be better if the objtool maintainer looks at it first? The binutils change might just have opened the gate to a bug in objtool. > thanks, > > greg k-h > Thanks. Chris
linux-5.10.11 build failure
Hi, Building 5.10.11 fails on my (x86-64) laptop thusly: .. AS arch/x86/entry/thunk_64.o CC arch/x86/entry/vsyscall/vsyscall_64.o AS arch/x86/realmode/rm/header.o CC arch/x86/mm/pat/set_memory.o CC arch/x86/events/amd/core.o CC arch/x86/kernel/fpu/init.o CC arch/x86/entry/vdso/vma.o CC kernel/sched/core.o arch/x86/entry/thunk_64.o: warning: objtool: missing symbol for insn at offset 0x3e AS arch/x86/realmode/rm/trampoline_64.o make[2]: *** [scripts/Makefile.build:360: arch/x86/entry/thunk_64.o] Error 255 make[2]: *** Deleting file 'arch/x86/entry/thunk_64.o' make[2]: *** Waiting for unfinished jobs .. Compiler is latest snapshot of gcc-10. Happy to test the fix but please cc me as I'm not subscribed Thanks, Chris
Re: [PATCH] misc: rtsx: do not setting OC_POWER_DOWN reg in rtsx_pci_init_ocp()
Hi Greg, On 18/09/2020 15:35, Chris Clayton wrote: > Mmm, gmail on android seems to have snuck some html into my reply, so here > goes again... > > On 14/09/2020 16:58, Greg KH wrote: >> On Sun, Sep 13, 2020 at 09:40:56AM +0100, Chris Clayton wrote: >>> Hi Greg and Arnd, >>> >>> On 24/08/2020 04:00, ricky...@realtek.com wrote: >>>> From: Ricky Wu >>>> >>>> this power saving action in rtsx_pci_init_ocp() cause INTEL-NUC6 platform >>>> missing card reader >>>> >>> >>> In his changelog above, Ricky didn't mention that this patch fixes a >>> regression that was introduced (in 5.1) by commit >>> bede03a579b3. >>> >>> The patch that I posted to LKML contained the appropriate Fixes, etc tags. >>> After discussion, the patch was changed to >>> remove the code that effectively disables the RTS5229 cardreader on (at >>> least some) Intel NUC boxes. I prepared the >>> patch that Ricky submitted but he didn't include my Signed-off-by or the >>> Fixes tag. I think the following needs to be >>> added to the changelog. >>> >>> Fixes: bede03a579b3 ("misc: rtsx: Enable OCP for rts522a rts524a rts525a >>> rts5260") >>> Link: https://marc.info/?l=linux-kernel=159105912832257 >>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=204003 >>> Signed-off-by: Chris Clayton >>> >>> bede03a579b3 introduced a bug which leaves the rts5229 PCI Express card >>> reader on the Intel NUC6CAYH box. >>> >>> My main point, however, is that the patch is also needed in the 5.4 >>> (longterm) and 5.8 (stable) series kernels. >> >> It's too late to change the commit log now that it is in my tree, but >> once it hits Linus's tree for 5.9-rc1, I can backport it to those stable >> trees if someone reminds me :) >> This is the reminder you suggested. The patch is now in Linus's tree and the commit id is 551b6729578a8981c46af964c10bf7d5d9ddca83. Chris > > Thanks, Greg. I'll send the reminder. > > Chris >> thanks, >> >> greg k-h >>
Re: [PATCH] misc: rtsx: do not setting OC_POWER_DOWN reg in rtsx_pci_init_ocp()
Mmm, gmail on android seems to have snuck some html into my reply, so here goes again... On 14/09/2020 16:58, Greg KH wrote: > On Sun, Sep 13, 2020 at 09:40:56AM +0100, Chris Clayton wrote: >> Hi Greg and Arnd, >> >> On 24/08/2020 04:00, ricky...@realtek.com wrote: >>> From: Ricky Wu >>> >>> this power saving action in rtsx_pci_init_ocp() cause INTEL-NUC6 platform >>> missing card reader >>> >> >> In his changelog above, Ricky didn't mention that this patch fixes a >> regression that was introduced (in 5.1) by commit >> bede03a579b3. >> >> The patch that I posted to LKML contained the appropriate Fixes, etc tags. >> After discussion, the patch was changed to >> remove the code that effectively disables the RTS5229 cardreader on (at >> least some) Intel NUC boxes. I prepared the >> patch that Ricky submitted but he didn't include my Signed-off-by or the >> Fixes tag. I think the following needs to be >> added to the changelog. >> >> Fixes: bede03a579b3 ("misc: rtsx: Enable OCP for rts522a rts524a rts525a >> rts5260") >> Link: https://marc.info/?l=linux-kernel=159105912832257 >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=204003 >> Signed-off-by: Chris Clayton >> >> bede03a579b3 introduced a bug which leaves the rts5229 PCI Express card >> reader on the Intel NUC6CAYH box. >> >> My main point, however, is that the patch is also needed in the 5.4 >> (longterm) and 5.8 (stable) series kernels. > > It's too late to change the commit log now that it is in my tree, but > once it hits Linus's tree for 5.9-rc1, I can backport it to those stable > trees if someone reminds me :) > Thanks, Greg. I'll send the reminder. Chris > thanks, > > greg k-h >
Re: [PATCH] misc: rtsx: do not setting OC_POWER_DOWN reg in rtsx_pci_init_ocp()
Thanks Bjorn. On 13/09/2020 17:49, Bjorn Helgaas wrote: > On Sun, Sep 13, 2020 at 09:40:56AM +0100, Chris Clayton wrote: >> Hi Greg and Arnd, >> >> On 24/08/2020 04:00, ricky...@realtek.com wrote: >>> From: Ricky Wu >>> >>> this power saving action in rtsx_pci_init_ocp() cause INTEL-NUC6 platform >>> missing card reader >> >> In his changelog above, Ricky didn't mention that this patch fixes a >> regression that was introduced (in 5.1) by commit bede03a579b3. >> >> The patch that I posted to LKML contained the appropriate Fixes, etc >> tags. After discussion, the patch was changed to remove the code >> that effectively disables the RTS5229 cardreader on (at least some) >> Intel NUC boxes. I prepared the patch that Ricky submitted but he >> didn't include my Signed-off-by or the Fixes tag. I think the >> following needs to be added to the changelog. >> >> Fixes: bede03a579b3 ("misc: rtsx: Enable OCP for rts522a rts524a rts525a >> rts5260") >> Link: https://marc.info/?l=linux-kernel=159105912832257 > > Better lore link: > > Link: > https://lore.kernel.org/r/CACYmiSer8FA+qjh8NHZJ2maxSd-=rwddz2f7_-e4um1nxuz...@mail.gmail.com/ > > But I'm not sure the above is the most relevant. Seems like the one > below is more to the point since it has the exact patch below and is > part of a thread developing it: > > Link: > https://lore.kernel.org/r/20da8b4b-8426-9568-c0f1-4d1c2006c...@googlemail.com/ > Yes, I meant to change the quote to that thread but ... more haste less speed. >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=204003 >> Signed-off-by: Chris Clayton >> >> bede03a579b3 introduced a bug which leaves the rts5229 PCI Express >> card reader on the Intel NUC6CAYH box. >> >> My main point, however, is that the patch is also needed in the 5.4 >> (longterm) and 5.8 (stable) series kernels. > > This would be accomplished by: > > Cc: sta...@vger.kernel.org# v5.1+ > Thanks for the tip. I'm about to set off on a 4-day journey, so I'll send an updated patch when I return on Friday, >>> Signed-off-by: Ricky Wu >>> --- >>> drivers/misc/cardreader/rtsx_pcr.c | 4 >>> 1 file changed, 4 deletions(-) >>> >>> diff --git a/drivers/misc/cardreader/rtsx_pcr.c >>> b/drivers/misc/cardreader/rtsx_pcr.c >>> index 37ccc67f4914..3a4a7b0cc098 100644 >>> --- a/drivers/misc/cardreader/rtsx_pcr.c >>> +++ b/drivers/misc/cardreader/rtsx_pcr.c >>> @@ -1155,10 +1155,6 @@ void rtsx_pci_init_ocp(struct rtsx_pcr *pcr) >>> rtsx_pci_write_register(pcr, REG_OCPGLITCH, >>> SD_OCP_GLITCH_MASK, pcr->hw_param.ocp_glitch); >>> rtsx_pci_enable_ocp(pcr); >>> - } else { >>> - /* OC power down */ >>> - rtsx_pci_write_register(pcr, FPDCTL, OC_POWER_DOWN, >>> - OC_POWER_DOWN); >>> } >>> } >>> } >>>
Re: [PATCH] misc: rtsx: do not setting OC_POWER_DOWN reg in rtsx_pci_init_ocp()
Hi Greg and Arnd, On 24/08/2020 04:00, ricky...@realtek.com wrote: > From: Ricky Wu > > this power saving action in rtsx_pci_init_ocp() cause INTEL-NUC6 platform > missing card reader > In his changelog above, Ricky didn't mention that this patch fixes a regression that was introduced (in 5.1) by commit bede03a579b3. The patch that I posted to LKML contained the appropriate Fixes, etc tags. After discussion, the patch was changed to remove the code that effectively disables the RTS5229 cardreader on (at least some) Intel NUC boxes. I prepared the patch that Ricky submitted but he didn't include my Signed-off-by or the Fixes tag. I think the following needs to be added to the changelog. Fixes: bede03a579b3 ("misc: rtsx: Enable OCP for rts522a rts524a rts525a rts5260") Link: https://marc.info/?l=linux-kernel=159105912832257 Link: https://bugzilla.kernel.org/show_bug.cgi?id=204003 Signed-off-by: Chris Clayton bede03a579b3 introduced a bug which leaves the rts5229 PCI Express card reader on the Intel NUC6CAYH box. My main point, however, is that the patch is also needed in the 5.4 (longterm) and 5.8 (stable) series kernels. Thanks. > Signed-off-by: Ricky Wu > --- > drivers/misc/cardreader/rtsx_pcr.c | 4 > 1 file changed, 4 deletions(-) > > diff --git a/drivers/misc/cardreader/rtsx_pcr.c > b/drivers/misc/cardreader/rtsx_pcr.c > index 37ccc67f4914..3a4a7b0cc098 100644 > --- a/drivers/misc/cardreader/rtsx_pcr.c > +++ b/drivers/misc/cardreader/rtsx_pcr.c > @@ -1155,10 +1155,6 @@ void rtsx_pci_init_ocp(struct rtsx_pcr *pcr) > rtsx_pci_write_register(pcr, REG_OCPGLITCH, > SD_OCP_GLITCH_MASK, pcr->hw_param.ocp_glitch); > rtsx_pci_enable_ocp(pcr); > - } else { > - /* OC power down */ > - rtsx_pci_write_register(pcr, FPDCTL, OC_POWER_DOWN, > - OC_POWER_DOWN); > } > } > } >
Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader on Intel NUC boxes
Hi Ricky. On 05/08/2020 13:48, Chris Clayton wrote: > Hi Ricky >> >> Ah, OK. I'll prepare the patch and send it to you once I've tested it. >> > > Please see the patch included below. As you suggested, it removes the code > that does the OC power down on card readers > that are not members of your A series. The patch is against a fresh pull of > Linus's tree this morning > (v5.8-2768-g4da9f3302615). > > I've tested the resultant kernel on my Intel NUC6CAYH box (which contains an > NUC66AYB board) and the rts5229 works fine. > I've also tested it on my laptop which also has a card reader supported by > the rtsx_pci driver (an RTL8411B) and that > also works fine. > > As I mentioned yesterday, I think it's a candidate for the 5.4 ans 5.7 stable > series. > > Thanks for your patience! > > Chris > > Signed-off-by: Chris Clayton > > --- a/drivers/misc/cardreader/rtsx_pcr.c2020-08-05 07:10:21.752072515 > +0100 > +++ b/drivers/misc/cardreader/rtsx_pcr.c2020-08-05 07:11:05.568074278 > +0100 > @@ -1172,10 +1172,6 @@ void rtsx_pci_init_ocp(struct rtsx_pcr * > rtsx_pci_write_register(pcr, REG_OCPGLITCH, > SD_OCP_GLITCH_MASK, pcr->hw_param.ocp_glitch); > rtsx_pci_enable_ocp(pcr); > - } else { > - /* OC power down */ > - rtsx_pci_write_register(pcr, FPDCTL, OC_POWER_DOWN, > - OC_POWER_DOWN); > } > } > } > > Is there some problem with my patch? If you are too busy to deal with it, perhaps I can submit directly it to Greg/Arnd. Thanks Chris
Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader on Intel NUC boxes
Hi Ricky On 05/08/2020 06:51, Chris Clayton wrote: > Thanks, Ricky. > > On 05/08/2020 03:35, 吳昊澄 Ricky wrote: >> >> >>> -Original Message- >>> From: Chris Clayton [mailto:chris2...@googlemail.com] >>> Sent: Tuesday, August 04, 2020 7:52 PM >>> To: 吳昊澄 Ricky; gre...@linuxfoundation.org >>> Cc: LKML; rdun...@infradead.org; philqua...@gmail.com; Arnd Bergmann >>> Subject: Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader >>> on >>> Intel NUC boxes >>> >>> >>> >>> On 04/08/2020 11:46, 吳昊澄 Ricky wrote: >>>>> -Original Message- >>>>> From: Chris Clayton [mailto:chris2...@googlemail.com] >>>>> Sent: Tuesday, August 04, 2020 4:51 PM >>>>> To: 吳昊澄 Ricky; gre...@linuxfoundation.org >>>>> Cc: LKML; rdun...@infradead.org; philqua...@gmail.com; Arnd Bergmann >>>>> Subject: Re: PATCH: rtsx_pci driver - don't disable the rts5229 card >>>>> reader on >>>>> Intel NUC boxes >>>>> >>>>> >>>>> >>>>> On 04/08/2020 09:08, 吳昊澄 Ricky wrote: >>>>>>> -Original Message- >>>>>>> From: gre...@linuxfoundation.org [mailto:gre...@linuxfoundation.org] >>>>>>> Sent: Tuesday, August 04, 2020 3:49 PM >>>>>>> To: 吳昊澄 Ricky >>>>>>> Cc: Chris Clayton; LKML; rdun...@infradead.org; philqua...@gmail.com; >>>>> Arnd >>>>>>> Bergmann >>>>>>> Subject: Re: PATCH: rtsx_pci driver - don't disable the rts5229 card >>>>>>> reader >>> on >>>>>>> Intel NUC boxes >>>>>>> >>>>>>> On Tue, Aug 04, 2020 at 02:44:41AM +, 吳昊澄 Ricky wrote: >>>>>>>> Hi Chris, >>>>>>>> >>>>>>>> rtsx_pci_write_register(pcr, FPDTL, OC_POWER_DOWN, >>>>> OC_POWER_DOWN); >>>>>>>> This register operation saved power under 1mA, so if do not care the >>>>>>>> 1mA >>>>>>> power we can have a patch to remove it, make compatible with NUC6 >>>>>>>> We tested others our card reader that remove it, we did not see any >>>>>>>> side >>>>> effect >>>>>>>> >>>>>>>> Hi Greg k-h, >>>>>>>> >>>>>>>> Do you have any comments? >>>>>>> >>>>>>> comments on what? I don't know what you are responding to here, sorry. >>>>>>> >>>>>> Can we have a patch to kernel for NUC6? It may cause more power(1mA) but >>> it >>>>> will have more compatibility >>>>>> >>>>> >>>>> Ricky, >>>>> >>>>> I don't understand why you want to completely remove the code that sets up >>> the >>>>> 1mA power saving. That code was there >>>>> even before your patch (bede03a579b3b4a036003c4862cc1baa4ddc351f), so I >>>>> assume it benefits some of the Realtek card >>>>> readers. Before your patch however, rtsx_pci_init_ocp() was not called >>>>> for the >>>>> rts5229 reader, but the patch introduced >>>>> an unconditional call to that function into rtsx_pci_init_hw(), which is >>>>> run for >>> the >>>>> rts5229. That is what now disables >>>>> the card reader. >>>>> >>>>> Now, I don't know whether other cards are affected, although I don't >>>>> recall >>>>> seeing any reported as I searched the kernel >>>>> and ubuntu bugzillas for any analysis of the problem. I know this is not >>>>> what >>> the >>>>> patch I sent does, but having thought >>>>> about it more, seems to me that the simplest fix is to skip the new call >>>>> to >>>>> rtsx_pci_init_ocp() if the reader is an rts5229. >>>>> >>>> >>>> Because we are thinking about if others our card reader that not belong A >>> series(my ocp patch coverage) also on NUC6 platform maybe have the same >>> problem... >>>> >>> >>> OK. What if we do make the new call but only for the card readers that are >>> in the >>> A series? Are they the ones that have >>> PID_5n
Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader on Intel NUC boxes
Thanks, Ricky. On 05/08/2020 03:35, 吳昊澄 Ricky wrote: > > >> -Original Message----- >> From: Chris Clayton [mailto:chris2...@googlemail.com] >> Sent: Tuesday, August 04, 2020 7:52 PM >> To: 吳昊澄 Ricky; gre...@linuxfoundation.org >> Cc: LKML; rdun...@infradead.org; philqua...@gmail.com; Arnd Bergmann >> Subject: Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader >> on >> Intel NUC boxes >> >> >> >> On 04/08/2020 11:46, 吳昊澄 Ricky wrote: >>>> -Original Message- >>>> From: Chris Clayton [mailto:chris2...@googlemail.com] >>>> Sent: Tuesday, August 04, 2020 4:51 PM >>>> To: 吳昊澄 Ricky; gre...@linuxfoundation.org >>>> Cc: LKML; rdun...@infradead.org; philqua...@gmail.com; Arnd Bergmann >>>> Subject: Re: PATCH: rtsx_pci driver - don't disable the rts5229 card >>>> reader on >>>> Intel NUC boxes >>>> >>>> >>>> >>>> On 04/08/2020 09:08, 吳昊澄 Ricky wrote: >>>>>> -Original Message- >>>>>> From: gre...@linuxfoundation.org [mailto:gre...@linuxfoundation.org] >>>>>> Sent: Tuesday, August 04, 2020 3:49 PM >>>>>> To: 吳昊澄 Ricky >>>>>> Cc: Chris Clayton; LKML; rdun...@infradead.org; philqua...@gmail.com; >>>> Arnd >>>>>> Bergmann >>>>>> Subject: Re: PATCH: rtsx_pci driver - don't disable the rts5229 card >>>>>> reader >> on >>>>>> Intel NUC boxes >>>>>> >>>>>> On Tue, Aug 04, 2020 at 02:44:41AM +, 吳昊澄 Ricky wrote: >>>>>>> Hi Chris, >>>>>>> >>>>>>> rtsx_pci_write_register(pcr, FPDTL, OC_POWER_DOWN, >>>> OC_POWER_DOWN); >>>>>>> This register operation saved power under 1mA, so if do not care the 1mA >>>>>> power we can have a patch to remove it, make compatible with NUC6 >>>>>>> We tested others our card reader that remove it, we did not see any side >>>> effect >>>>>>> >>>>>>> Hi Greg k-h, >>>>>>> >>>>>>> Do you have any comments? >>>>>> >>>>>> comments on what? I don't know what you are responding to here, sorry. >>>>>> >>>>> Can we have a patch to kernel for NUC6? It may cause more power(1mA) but >> it >>>> will have more compatibility >>>>> >>>> >>>> Ricky, >>>> >>>> I don't understand why you want to completely remove the code that sets up >> the >>>> 1mA power saving. That code was there >>>> even before your patch (bede03a579b3b4a036003c4862cc1baa4ddc351f), so I >>>> assume it benefits some of the Realtek card >>>> readers. Before your patch however, rtsx_pci_init_ocp() was not called for >>>> the >>>> rts5229 reader, but the patch introduced >>>> an unconditional call to that function into rtsx_pci_init_hw(), which is >>>> run for >> the >>>> rts5229. That is what now disables >>>> the card reader. >>>> >>>> Now, I don't know whether other cards are affected, although I don't recall >>>> seeing any reported as I searched the kernel >>>> and ubuntu bugzillas for any analysis of the problem. I know this is not >>>> what >> the >>>> patch I sent does, but having thought >>>> about it more, seems to me that the simplest fix is to skip the new call to >>>> rtsx_pci_init_ocp() if the reader is an rts5229. >>>> >>> >>> Because we are thinking about if others our card reader that not belong A >> series(my ocp patch coverage) also on NUC6 platform maybe have the same >> problem... >>> >> >> OK. What if we do make the new call but only for the card readers that are >> in the >> A series? Are they the ones that have >> PID_5nnn defines in include/linux/rtsx_pci.h? Or is there another simple way >> of >> identifying that a reader is a member of >> the A series? >> >> I'm thinking of something like: >> static bool rtsx_pci_is_series_A(pcr) >> { >> unsigned short device = pcr->pci->device; >> >> return device == PID524A || device == PID_5249 || device == PID_5250 || >> device == PID_525A >> || device == PID_525A || device == PID_5260 || device == >> PID_5261; >> } >> >> then in rtsx_pci_init_hw() change the unconditional call to: >> >> if rtsx_pci_is_series_A(pcr) >> rtsx_pci_init_ocp(); >> >> Does that seem OK? >> > Previously, I want to remove > else { > /* OC power down */ > rtsx_pci_write_register(pcr, FPDCTL, OC_POWER_DOWN, > OC_POWER_DOWN); > } > Because in our A-series card Reader we already assigned option->ocp_en to 1 > in self init_params() , this is an easy way to fix this problem > Ah, OK. I'll prepare the patch and send it to you once I've tested it. Chris
Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader on Intel NUC boxes
On 04/08/2020 11:46, 吳昊澄 Ricky wrote: >> -Original Message- >> From: Chris Clayton [mailto:chris2...@googlemail.com] >> Sent: Tuesday, August 04, 2020 4:51 PM >> To: 吳昊澄 Ricky; gre...@linuxfoundation.org >> Cc: LKML; rdun...@infradead.org; philqua...@gmail.com; Arnd Bergmann >> Subject: Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader >> on >> Intel NUC boxes >> >> >> >> On 04/08/2020 09:08, 吳昊澄 Ricky wrote: >>>> -Original Message- >>>> From: gre...@linuxfoundation.org [mailto:gre...@linuxfoundation.org] >>>> Sent: Tuesday, August 04, 2020 3:49 PM >>>> To: 吳昊澄 Ricky >>>> Cc: Chris Clayton; LKML; rdun...@infradead.org; philqua...@gmail.com; >> Arnd >>>> Bergmann >>>> Subject: Re: PATCH: rtsx_pci driver - don't disable the rts5229 card >>>> reader on >>>> Intel NUC boxes >>>> >>>> On Tue, Aug 04, 2020 at 02:44:41AM +, 吳昊澄 Ricky wrote: >>>>> Hi Chris, >>>>> >>>>> rtsx_pci_write_register(pcr, FPDTL, OC_POWER_DOWN, >> OC_POWER_DOWN); >>>>> This register operation saved power under 1mA, so if do not care the 1mA >>>> power we can have a patch to remove it, make compatible with NUC6 >>>>> We tested others our card reader that remove it, we did not see any side >> effect >>>>> >>>>> Hi Greg k-h, >>>>> >>>>> Do you have any comments? >>>> >>>> comments on what? I don't know what you are responding to here, sorry. >>>> >>> Can we have a patch to kernel for NUC6? It may cause more power(1mA) but it >> will have more compatibility >>> >> >> Ricky, >> >> I don't understand why you want to completely remove the code that sets up >> the >> 1mA power saving. That code was there >> even before your patch (bede03a579b3b4a036003c4862cc1baa4ddc351f), so I >> assume it benefits some of the Realtek card >> readers. Before your patch however, rtsx_pci_init_ocp() was not called for >> the >> rts5229 reader, but the patch introduced >> an unconditional call to that function into rtsx_pci_init_hw(), which is run >> for the >> rts5229. That is what now disables >> the card reader. >> >> Now, I don't know whether other cards are affected, although I don't recall >> seeing any reported as I searched the kernel >> and ubuntu bugzillas for any analysis of the problem. I know this is not >> what the >> patch I sent does, but having thought >> about it more, seems to me that the simplest fix is to skip the new call to >> rtsx_pci_init_ocp() if the reader is an rts5229. >> > > Because we are thinking about if others our card reader that not belong A > series(my ocp patch coverage) also on NUC6 platform maybe have the same > problem... > OK. What if we do make the new call but only for the card readers that are in the A series? Are they the ones that have PID_5nnn defines in include/linux/rtsx_pci.h? Or is there another simple way of identifying that a reader is a member of the A series? I'm thinking of something like: static bool rtsx_pci_is_series_A(pcr) { unsigned short device = pcr->pci->device; return device == PID524A || device == PID_5249 || device == PID_5250 || device == PID_525A || device == PID_525A || device == PID_5260 || device == PID_5261; } then in rtsx_pci_init_hw() change the unconditional call to: if rtsx_pci_is_series_A(pcr) rtsx_pci_init_ocp(); Does that seem OK? >> If you agree, I can prepare a patch and send it to you. Whatever the >> solution is, it >> will also be needed in the stable >> kernels later than 5.0. >> > > OK, I agree your opinion, for now can only patch rts5229 first make NUC6 user > can work well > > Thank you > >> Chris >>>> greg k-h >>>> >>>> --Please consider the environment before printing this e-mail.
Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader on Intel NUC boxes
On 04/08/2020 09:08, 吳昊澄 Ricky wrote: >> -Original Message- >> From: gre...@linuxfoundation.org [mailto:gre...@linuxfoundation.org] >> Sent: Tuesday, August 04, 2020 3:49 PM >> To: 吳昊澄 Ricky >> Cc: Chris Clayton; LKML; rdun...@infradead.org; philqua...@gmail.com; Arnd >> Bergmann >> Subject: Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader >> on >> Intel NUC boxes >> >> On Tue, Aug 04, 2020 at 02:44:41AM +, 吳昊澄 Ricky wrote: >>> Hi Chris, >>> >>> rtsx_pci_write_register(pcr, FPDTL, OC_POWER_DOWN, OC_POWER_DOWN); >>> This register operation saved power under 1mA, so if do not care the 1mA >> power we can have a patch to remove it, make compatible with NUC6 >>> We tested others our card reader that remove it, we did not see any side >>> effect >>> >>> Hi Greg k-h, >>> >>> Do you have any comments? >> >> comments on what? I don't know what you are responding to here, sorry. >> > Can we have a patch to kernel for NUC6? It may cause more power(1mA) but it > will have more compatibility > Ricky, I don't understand why you want to completely remove the code that sets up the 1mA power saving. That code was there even before your patch (bede03a579b3b4a036003c4862cc1baa4ddc351f), so I assume it benefits some of the Realtek card readers. Before your patch however, rtsx_pci_init_ocp() was not called for the rts5229 reader, but the patch introduced an unconditional call to that function into rtsx_pci_init_hw(), which is run for the rts5229. That is what now disables the card reader. Now, I don't know whether other cards are affected, although I don't recall seeing any reported as I searched the kernel and ubuntu bugzillas for any analysis of the problem. I know this is not what the patch I sent does, but having thought about it more, seems to me that the simplest fix is to skip the new call to rtsx_pci_init_ocp() if the reader is an rts5229. If you agree, I can prepare a patch and send it to you. Whatever the solution is, it will also be needed in the stable kernels later than 5.0. Chris >> greg k-h >> >> --Please consider the environment before printing this e-mail.
Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader on Intel NUC boxes
Hi, Ricky On 03/08/2020 04:01, 吳昊澄 Ricky wrote: > Hi Chris, > > We don’t think this is our bug... > This register(FPDCTL) write to OC_POWER_DOWN is for our power saving feature, > not to disable the reader > In your case, we cannot reproduce this on our side that we mention before, we > don’t have the platform(Intel NUC Tall Arches Canyon NUC6CAYH Celeron J345) > to see this issue > But we think this issue maybe only on this platform, our RTS5229 works well > on the new kernel all platform that we have > > Ricky Perhaps I should have used the word regression rather than bug. I didn't buy the machine until earlier this year, but other people who have reported this problem have indicated that until bede03a579b3 was applied (during the 5.1 merge window), the driver supported the card reader on this on the Intel NUC boxes. I know that is true because I built and tested a 5.0 kernel and the card reader worked fine. I've also built and tested an 5.1-rc1 kernel and the card reader no longer works. Whether by design or by accident, the card reader worked and now it doesn't. That's a regression in my book! Since you signed off the patch that caused the regression, I believe it is your bug. Thanks. Chris > >> -Original Message- >> From: Chris Clayton [mailto:chris2...@googlemail.com] >> Sent: Monday, August 03, 2020 3:59 AM >> To: LKML; 吳昊澄 Ricky; gre...@linuxfoundation.org; rdun...@infradead.org; >> philqua...@gmail.com; Arnd Bergmann >> Subject: Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader >> on >> Intel NUC boxes >> >> Sorry, I should have said that the patch is against 5.7.12. It applies to >> upstream, >> but with offsets. >> >> On 02/08/2020 20:48, Chris Clayton wrote: >>> bede03a579b3 introduced a bug which leaves the rts5229 PCI Express card >> reader on my Intel NUC6CAYH box. >>> >>> The bug is in drivers/misc/cardreader/rtsx_pcr.c. A call to >>> rtsx_pci_init_ocp() >> was added to rtsx_pci_init_hw(). >>> At the call point, pcr->ops->init_ocp is NULL and pcr->option.ocp_en is 0, >>> so in >> rtsx_pci_init_ocp() the cardreader >>> gets disabled. >>> >>> I've avoided this by making excution code that results in the reader being >> disabled conditional on the device >>> not being an RTS5229. Of course, other rtsxxx card readers may also be >> disabled by this bug. I don't have the >>> knowledge to address that, so I'll leave to the driver maintainers. >>> >>> The patch to avoid the bug is attached. >>> >>> Fixes: bede03a579b3 ("misc: rtsx: Enable OCP for rts522a rts524a rts525a >> rts5260") >>> Link: https://marc.info/?l=linux-kernel=159105912832257 >>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=204003 >>> Signed-off-by: Chris Clayton >>> >>> bede03a579b3 introduced a bug which leaves the rts5229 PCI Express card >> reader on my Intel NUC6CAYH box. >>> >>> The bug is in drivers/misc/cardreader/rtsx_pcr.c. A call to >>> rtsx_pci_init_ocp() >> was added to rtsx_pci_init_hw(). >>> At the call point, pcr->ops->init_ocp is NULL and pcr->option.ocp_en is 0, >>> so in >> rtsx_pci_init_ocp() the cardreader >>> gets disabled. >>> >>> I've avoided this by making excution code that results in the reader being >> disabled conditional on the device >>> not being an RTS5229. Of course, other rtsxxx card readers may also be >> disabled by this bug. I don't have the >>> knowledge to address that, so I'll leave to the driver maintainers. >>> >>> The patch to avoid the bug is attached. >>> >>> Chris >>> >> >> --Please consider the environment before printing this e-mail.
Re: PATCH: rtsx_pci driver - don't disable the rts5229 card reader on Intel NUC boxes
Sorry, I should have said that the patch is against 5.7.12. It applies to upstream, but with offsets. On 02/08/2020 20:48, Chris Clayton wrote: > bede03a579b3 introduced a bug which leaves the rts5229 PCI Express card > reader on my Intel NUC6CAYH box. > > The bug is in drivers/misc/cardreader/rtsx_pcr.c. A call to > rtsx_pci_init_ocp() was added to rtsx_pci_init_hw(). > At the call point, pcr->ops->init_ocp is NULL and pcr->option.ocp_en is 0, so > in rtsx_pci_init_ocp() the cardreader > gets disabled. > > I've avoided this by making excution code that results in the reader being > disabled conditional on the device > not being an RTS5229. Of course, other rtsxxx card readers may also be > disabled by this bug. I don't have the > knowledge to address that, so I'll leave to the driver maintainers. > > The patch to avoid the bug is attached. > > Fixes: bede03a579b3 ("misc: rtsx: Enable OCP for rts522a rts524a rts525a > rts5260") > Link: https://marc.info/?l=linux-kernel=159105912832257 > Link: https://bugzilla.kernel.org/show_bug.cgi?id=204003 > Signed-off-by: Chris Clayton > > bede03a579b3 introduced a bug which leaves the rts5229 PCI Express card > reader on my Intel NUC6CAYH box. > > The bug is in drivers/misc/cardreader/rtsx_pcr.c. A call to > rtsx_pci_init_ocp() was added to rtsx_pci_init_hw(). > At the call point, pcr->ops->init_ocp is NULL and pcr->option.ocp_en is 0, so > in rtsx_pci_init_ocp() the cardreader > gets disabled. > > I've avoided this by making excution code that results in the reader being > disabled conditional on the device > not being an RTS5229. Of course, other rtsxxx card readers may also be > disabled by this bug. I don't have the > knowledge to address that, so I'll leave to the driver maintainers. > > The patch to avoid the bug is attached. > > Chris >
PATCH: rtsx_pci driver - don't disable the rts5229 card reader on Intel NUC boxes
bede03a579b3 introduced a bug which leaves the rts5229 PCI Express card reader on my Intel NUC6CAYH box. The bug is in drivers/misc/cardreader/rtsx_pcr.c. A call to rtsx_pci_init_ocp() was added to rtsx_pci_init_hw(). At the call point, pcr->ops->init_ocp is NULL and pcr->option.ocp_en is 0, so in rtsx_pci_init_ocp() the cardreader gets disabled. I've avoided this by making excution code that results in the reader being disabled conditional on the device not being an RTS5229. Of course, other rtsxxx card readers may also be disabled by this bug. I don't have the knowledge to address that, so I'll leave to the driver maintainers. The patch to avoid the bug is attached. Fixes: bede03a579b3 ("misc: rtsx: Enable OCP for rts522a rts524a rts525a rts5260") Link: https://marc.info/?l=linux-kernel=159105912832257 Link: https://bugzilla.kernel.org/show_bug.cgi?id=204003 Signed-off-by: Chris Clayton bede03a579b3 introduced a bug which leaves the rts5229 PCI Express card reader on my Intel NUC6CAYH box. The bug is in drivers/misc/cardreader/rtsx_pcr.c. A call to rtsx_pci_init_ocp() was added to rtsx_pci_init_hw(). At the call point, pcr->ops->init_ocp is NULL and pcr->option.ocp_en is 0, so in rtsx_pci_init_ocp() the cardreader gets disabled. I've avoided this by making excution code that results in the reader being disabled conditional on the device not being an RTS5229. Of course, other rtsxxx card readers may also be disabled by this bug. I don't have the knowledge to address that, so I'll leave to the driver maintainers. The patch to avoid the bug is attached. Chris --- linux-5.7.12/drivers/misc/cardreader/rtsx_pcr.c.orig 2020-08-02 13:36:50.216947944 +0100 +++ linux-5.7.12/drivers/misc/cardreader/rtsx_pcr.c 2020-08-02 18:37:30.456610731 +0100 @@ -1200,9 +1200,13 @@ void rtsx_pci_init_ocp(struct rtsx_pcr * SD_OCP_GLITCH_MASK, pcr->hw_param.ocp_glitch); rtsx_pci_enable_ocp(pcr); } else { - /* OC power down */ - rtsx_pci_write_register(pcr, FPDCTL, OC_POWER_DOWN, -OC_POWER_DOWN); + /* On (some?) Intel NUC platforms, this disables + * the rts5229 cardreader, so don't do it + */ + if(!CHK_PCI_PID(pcr, 0x5229)) +/* OC power down */ +rtsx_pci_write_register(pcr, FPDCTL, OC_POWER_DOWN, + OC_POWER_DOWN); } } }
Re: Linux 5.3.6
On 12/10/2019 21:55, Gabriel C wrote: > Am Sa., 12. Okt. 2019 um 21:16 Uhr schrieb Chris Clayton > : >> >> >>> I'm announcing the release of the 5.3.6 kernel. >> >> >> 5.3.6 build fails here with: >> >> arch/x86/entry/vdso/vdso64.so.dbg: undefined symbols found >> CC arch/x86/kernel/cpu/mce/threshold.o >> make[3]: *** [arch/x86/entry/vdso/Makefile:59: >> arch/x86/entry/vdso/vdso64.so.dbg] Error 1 >> make[3]: *** Deleting file 'arch/x86/entry/vdso/vdso64.so.dbg' >> make[2]: *** [scripts/Makefile.build:497: arch/x86/entry/vdso] Error 2 >> make[1]: *** [scripts/Makefile.build:497: arch/x86/entry] Error 2 >> make[1]: *** Waiting for unfinished jobs >> > > What is your default linker ? > > Also does make LD=ld.bfd fixes that for you ? > Thanks, Gabriel. The default linker is gold, but your suggestion above fixed the build. I think I'll set the default to LD.bfd. > See https://bugzilla.kernel.org/show_bug.cgi?id=204951 > > BR, > > Gabriel C. >
Re: Linux 5.3.6
> I'm announcing the release of the 5.3.6 kernel. 5.3.6 build fails here with: arch/x86/entry/vdso/vdso64.so.dbg: undefined symbols found CC arch/x86/kernel/cpu/mce/threshold.o make[3]: *** [arch/x86/entry/vdso/Makefile:59: arch/x86/entry/vdso/vdso64.so.dbg] Error 1 make[3]: *** Deleting file 'arch/x86/entry/vdso/vdso64.so.dbg' make[2]: *** [scripts/Makefile.build:497: arch/x86/entry/vdso] Error 2 make[1]: *** [scripts/Makefile.build:497: arch/x86/entry] Error 2 make[1]: *** Waiting for unfinished jobs.... Chris Clayton
Re: [PATCH] timekeeping/vsyscall: Prevent math overflow in BOOTTIME update
Thanks Thomas. On 22/08/2019 12:00, Thomas Gleixner wrote: > The VDSO update for CLOCK_BOOTTIME has a overflow issue as it shifts the > nanoseconds based boot time offset left by the clocksource shift. That > overflows once the boot time offset becomes large enough. As a consequence > CLOCK_BOOTTIME in the VDSO becomes a random number causing applications to > misbehave. > > Fix it by storing a timespec64 representation of the offset when boot time > is adjusted and add that to the MONOTONIC base time value in the vdso data > page. Using the timespec64 representation avoids a 64bit division in the > update code. > I've tested resume from both suspend and hibernate and this patch fixes the problem I reported. Tested-by: Chris Clayton > Fixes: 44f57d788e7d ("timekeeping: Provide a generic update_vsyscall() > implementation") > Reported-by: Chris Clayton > Signed-off-by: Thomas Gleixner > --- > include/linux/timekeeper_internal.h |5 + > kernel/time/timekeeping.c |5 + > kernel/time/vsyscall.c | 22 +- > 3 files changed, 23 insertions(+), 9 deletions(-) > > --- a/include/linux/timekeeper_internal.h > +++ b/include/linux/timekeeper_internal.h > @@ -57,6 +57,7 @@ struct tk_read_base { > * @cs_was_changed_seq: The sequence number of clocksource change events > * @next_leap_ktime: CLOCK_MONOTONIC time value of a pending leap-second > * @raw_sec: CLOCK_MONOTONIC_RAW time in seconds > + * @monotonic_to_boot: CLOCK_MONOTONIC to CLOCK_BOOTTIME offset > * @cycle_interval: Number of clock cycles in one NTP interval > * @xtime_interval: Number of clock shifted nano seconds in one NTP > * interval. > @@ -84,6 +85,9 @@ struct tk_read_base { > * > * wall_to_monotonic is no longer the boot time, getboottime must be > * used instead. > + * > + * @monotonic_to_boottime is a timespec64 representation of @offs_boot to > + * accelerate the VDSO update for CLOCK_BOOTTIME. > */ > struct timekeeper { > struct tk_read_base tkr_mono; > @@ -99,6 +103,7 @@ struct timekeeper { > u8 cs_was_changed_seq; > ktime_t next_leap_ktime; > u64 raw_sec; > + struct timespec64 monotonic_to_boot; > > /* The following members are for timekeeping internal use */ > u64 cycle_interval; > --- a/kernel/time/timekeeping.c > +++ b/kernel/time/timekeeping.c > @@ -146,6 +146,11 @@ static void tk_set_wall_to_mono(struct t > static inline void tk_update_sleep_time(struct timekeeper *tk, ktime_t delta) > { > tk->offs_boot = ktime_add(tk->offs_boot, delta); > + /* > + * Timespec representation for VDSO update to avoid 64bit division > + * on every update. > + */ > + tk->monotonic_to_boot = ktime_to_timespec64(tk->offs_boot); > } > > /* > --- a/kernel/time/vsyscall.c > +++ b/kernel/time/vsyscall.c > @@ -17,7 +17,7 @@ static inline void update_vdso_data(stru > struct timekeeper *tk) > { > struct vdso_timestamp *vdso_ts; > - u64 nsec; > + u64 nsec, sec; > > vdata[CS_HRES_COARSE].cycle_last= tk->tkr_mono.cycle_last; > vdata[CS_HRES_COARSE].mask = tk->tkr_mono.mask; > @@ -45,23 +45,27 @@ static inline void update_vdso_data(stru > } > vdso_ts->nsec = nsec; > > - /* CLOCK_MONOTONIC_RAW */ > - vdso_ts = [CS_RAW].basetime[CLOCK_MONOTONIC_RAW]; > - vdso_ts->sec= tk->raw_sec; > - vdso_ts->nsec = tk->tkr_raw.xtime_nsec; > + /* Copy MONOTONIC time for BOOTTIME */ > + sec = vdso_ts->sec; > + /* Add the boot offset */ > + sec += tk->monotonic_to_boot.tv_sec; > + nsec+= (u64)tk->monotonic_to_boot.tv_nsec << tk->tkr_mono.shift; > > /* CLOCK_BOOTTIME */ > vdso_ts = [CS_HRES_COARSE].basetime[CLOCK_BOOTTIME]; > - vdso_ts->sec= tk->xtime_sec + tk->wall_to_monotonic.tv_sec; > - nsec = tk->tkr_mono.xtime_nsec; > - nsec += ((u64)(tk->wall_to_monotonic.tv_nsec + > -ktime_to_ns(tk->offs_boot)) << tk->tkr_mono.shift); > + vdso_ts->sec= sec; > + > while (nsec >= (((u64)NSEC_PER_SEC) << tk->tkr_mono.shift)) { > nsec -= (((u64)NSEC_PER_SEC) << tk->tkr_mono.shift); > vdso_ts->sec++; > } > vdso_ts->nsec = nsec; > > + /* CLOCK_MONOTONIC_RAW */ > + vdso_ts = [CS_RAW].basetime[CLOCK_MONOTONIC_RAW]; > + vdso_ts->sec= tk->raw_sec; > + vdso_ts->nsec = tk->tkr_raw.xtime_nsec; > + > /* CLOCK_TAI */ > vdso_ts = [CS_HRES_COARSE].basetime[CLOCK_TAI]; > vdso_ts->sec= tk->xtime_sec + (s64)tk->tai_offset; >
Re: PROBLEM: 5.3.0-rc* causes iwlwifi failure
Thanks, Stuart. On 18/08/2019 11:55, Stuart Little wrote: > On Sun, Aug 18, 2019 at 09:17:59AM +0100, Chris Clayton wrote: >> >> >> On 17/08/2019 22:44, Stuart Little wrote: >>> After some private coaching from Serge Belyshev on git-revert I can confirm >>> that reverting that commit atop the current tree resolves the issue (the >>> wifi card scans for and finds networks just fine, no dmesg errors reported, >>> etc.). >>> >> >> I've reported the "Microcode SW error detected" issue too, but, wrongly, >> only to LKML. I'll point that thread to this >> one. I've also been experiencing my network stopping working after suspend >> resume, but haven't got round to reporting >> that yet. >> >> What was the git magic that you acquired to revert the patch, please? >> By following the advice below, I reverted 4fd445a2c855bbcab81fbe06d110e78dbd974a5b and using the resultant kernel I haven't seen the "Microcode SW error detected" again. I am, however, still experiencing the problem of my network not working after resume from suspend. I've reported it to LKML, so it can be followed there should anyone need/want to. > > $ git revert > > This will fail as noted, but will place in a revert mode where you can fix > the errors. > > $ git status > > will show (it did in my case, for the latest Linux tree at the time I did > this) a modified file > > drivers/net/wireless/intel/iwlwifi/mvm/fw.c > > to be committed without issue and a conflicted file > > drivers/net/wireless/intel/iwlwifi/mvm/nvm.c > > whose conflicts you have to first resolve. > > I then opened that conflicted file in a text editor and simply removed > everything between the lines > > <<<<<<< HEAD > > and > >>>>>>>> parent of 4fd445a2c855... iwlwifi: mvm: Add log information about SAR >>>>>>>> status > > (inclusive). This resolved the conflict, whereupon > > $ git revert --continue > > and > > $ git commit -a > > will finish the reversion. > >>> On Sat, Aug 17, 2019 at 11:59:59AM +0300, Serge Belyshev wrote: >>>> >>>>> I am on an Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz running Linux >>>>> x86_64 (Slackware), with a custom-compiled 5.3.0-rc4 (.config >>>>> attached). >>>>> >>>>> I am using the Intel wifi adapter on this machine: >>>>> >>>>> 02:00.0 Network controller: Intel Corporation Device 24fb (rev 10) >>>>> >>>>> with the iwlwifi driver. I am attaching the output to 'lspci -vv -s >>>>> 02:00.0' as the file device-info. >>>>> >>>>> All 5.3.0-rc* versions I have tried (including rc4) cause multiple >>>>> dmesg iwlwifi-related errors (dmesg attached). Examples: >>>>> >>>>> iwlwifi :02:00.0: Failed to get geographic profile info -5 >>>>> iwlwifi :02:00.0: Microcode SW error detected. Restarting 0x8200 >>>>> iwlwifi :02:00.0: 0x0038 | BAD_COMMAND >>>>> >>>> >>>> I have my logs filled with similar garbage throughout 5.3-rc*. Also >>>> since 5.3-rcsomething not only it WARNS in dmesg about firmware failure, >>>> but completely stops working after suspend/resume cycle. >>>> >>>> It looks like that: >>>> >>>> commit 4fd445a2c855bbcab81fbe06d110e78dbd974a5b >>>> Author: Haim Dreyfuss >>>> Date: Thu May 2 11:45:02 2019 +0300 >>>> >>>> iwlwifi: mvm: Add log information about SAR status >>>> >>>> Inform users when SAR status is changing. >>>> >>>> Signed-off-by: Haim Dreyfuss >>>> Signed-off-by: Luca Coelho >>>> >>>> >>>> is the culprit. (manually) reverting it on top of 5.3-rc4 makes >>>> everything work again. >>>
Regression in 5.3-rc1 and later
Hi everyone, Firstly, apologies to anyone on the long cc list that turns out not to be particularly interested in the following, but you were all marked as cc'd in the commit message below. I've found a problem that isn't present in 5.2 series or 4.19 series kernels, and seems to have arrived in 5.3-rc1. The problem is that if I suspend (to ram) my laptop, on resume 14 minutes or more after suspending, I have no networking functionality. If I resume the laptop after 13 minutes or less, networking works fine. I haven't tried to get finer grained timings between 13 and 14 minutes, but can do if it would help. ifconfig shows that wlan0 is still up and still has its assigned ip address but, for instance, a ping of any other device on my network, fails as does pinging, say, kernel.org. I've tried "downing" the network with (/sbin/ifdown) and unloading the iwlmvm module and then reloading the module and "upping" (/sbin/ifup) the network, but my network is still unusable. I should add that the problem also manifests if I hibernate the laptop, although my testing of this has been minimal. I can do more if required. As I say, the problem first appears in 5.3-rc1, so I've bisected between 5.2.0 and 5.3-rc1 and that concluded with: [chris:~/kernel/linux]$ git bisect good 7ac8707479886c75f353bfb6a8273f423cfccb23 is the first bad commit commit 7ac8707479886c75f353bfb6a8273f423cfccb23 Author: Vincenzo Frascino Date: Fri Jun 21 10:52:49 2019 +0100 x86/vdso: Switch to generic vDSO implementation The x86 vDSO library requires some adaptations to take advantage of the newly introduced generic vDSO library. Introduce the following changes: - Modification of vdso.c to be compliant with the common vdso datapage - Use of lib/vdso for gettimeofday [ tglx: Massaged changelog and cleaned up the function signature formatting ] Signed-off-by: Vincenzo Frascino Signed-off-by: Thomas Gleixner Cc: linux-a...@vger.kernel.org Cc: linux-arm-ker...@lists.infradead.org Cc: linux-m...@vger.kernel.org Cc: linux-kselft...@vger.kernel.org Cc: Catalin Marinas Cc: Will Deacon Cc: Arnd Bergmann Cc: Russell King Cc: Ralf Baechle Cc: Paul Burton Cc: Daniel Lezcano Cc: Mark Salyzyn Cc: Peter Collingbourne Cc: Shuah Khan Cc: Dmitry Safonov <0x7f454...@gmail.com> Cc: Rasmus Villemoes Cc: Huw Davies Cc: Shijith Thotton Cc: Andre Przywara Link: https://lkml.kernel.org/r/20190621095252.32307-23-vincenzo.frasc...@arm.com arch/x86/Kconfig | 3 + arch/x86/entry/vdso/Makefile | 9 ++ arch/x86/entry/vdso/vclock_gettime.c | 245 --- arch/x86/entry/vdso/vdsox32.lds.S| 1 + arch/x86/entry/vsyscall/Makefile | 2 - arch/x86/entry/vsyscall/vsyscall_gtod.c | 83 --- arch/x86/include/asm/pvclock.h | 2 +- arch/x86/include/asm/vdso/gettimeofday.h | 191 arch/x86/include/asm/vdso/vsyscall.h | 44 ++ arch/x86/include/asm/vgtod.h | 75 +- arch/x86/include/asm/vvar.h | 7 +- arch/x86/kernel/pvclock.c| 1 + 12 files changed, 284 insertions(+), 379 deletions(-) delete mode 100644 arch/x86/entry/vsyscall/vsyscall_gtod.c create mode 100644 arch/x86/include/asm/vdso/gettimeofday.h create mode 100644 arch/x86/include/asm/vdso/vsyscall.h To confirm my bisection was correct, I did a git checkout of 7ac8707479886c75f353bfb6a8273f423cfccb2. As expected, the kernel exhibited the problem I've described. However, a kernel built at the immediately preceding (parent?) commit (bfe801ebe84f42b4666d3f0adde90f504d56e35b) has a working network after a (>= 14minute) suspend/resume cycle. As the module name implies, I'm using wireless networking. The hardware is detected as "Intel(R) Wireless-AC 9260 160MHz, REV=0x324" by iwlwifi. I'm more than happy to provide additional diagnostics (but may need a little hand-holding) and to apply diagnostic or fix patches, but please cc me on any reply as I'm not subscribed to any of the kernel-related mailing lists. Chris
Re: iwlwifi: microcode SW error detected
On 18/08/2019 09:21, Chris Clayton wrote: > > > On 17/08/2019 08:19, Chris Clayton wrote: >> Hi. >> >> I just found the following error in the output from dmesg. >> >> [ 4023.460058] iwlwifi :02:00.0: Microcode SW error detected. Restarting >> 0x0. > > Since reporting, I've found that this problem is being explored in the thread > that starts at > https://marc.info/?l=linux-kernel=15660151913. Mmm, that's a dead link. Don't knwo what happened there but the real link is https://marc.info/?l=linux-kernel=156265244614126 > > Chris >
linux-5.3.0-rc5: new build warning
Hi, I've just built 5.3.0-rc5 and a warning that I do not recall having seen before was emitted: ... HOSTCC scripts/extract-cert HOSTCC /mnt/kernel/linux/tools/objtool/fixdep.o HOSTLD arch/x86/tools/relocs HOSTLD /mnt/kernel/linux/tools/objtool/fixdep-in.o LINK /mnt/kernel/linux/tools/objtool/fixdep CC /mnt/kernel/linux/tools/objtool/builtin-check.o CC /mnt/kernel/linux/tools/objtool/builtin-orc.o GEN /mnt/kernel/linux/tools/objtool/arch/x86/lib/inat-tables.c awk: arch/x86/tools/gen-insn-attr-x86.awk:260: warning: regexp escape sequence `\:' is not a known regexp operator awk: arch/x86/tools/gen-insn-attr-x86.awk:350: (FILENAME=arch/x86/lib/x86-opcode-map.txt FNR=41) warning: regexp escape sequence `\&' is not a known regexp operator CC /mnt/kernel/linux/tools/objtool/exec-cmd.o CC /mnt/kernel/linux/tools/objtool/check.o CC /mnt/kernel/linux/tools/objtool/arch/x86/decode.o CC /mnt/kernel/linux/tools/objtool/orc_gen.o CC /mnt/kernel/linux/tools/objtool/help.o CC /mnt/kernel/linux/tools/objtool/orc_dump.o CC /mnt/kernel/linux/tools/objtool/pager.o ... Happy to test the fix, but please cc me as I'm not subscribed Chris
Re: iwlwifi: microcode SW error detected
On 17/08/2019 08:19, Chris Clayton wrote: > Hi. > > I just found the following error in the output from dmesg. > > [ 4023.460058] iwlwifi :02:00.0: Microcode SW error detected. Restarting > 0x0. Since reporting, I've found that this problem is being explored in the thread that starts at https://marc.info/?l=linux-kernel=15660151913. Chris
Re: PROBLEM: 5.3.0-rc* causes iwlwifi failure
On 17/08/2019 22:44, Stuart Little wrote: > After some private coaching from Serge Belyshev on git-revert I can confirm > that reverting that commit atop the current tree resolves the issue (the wifi > card scans for and finds networks just fine, no dmesg errors reported, etc.). > I've reported the "Microcode SW error detected" issue too, but, wrongly, only to LKML. I'll point that thread to this one. I've also been experiencing my network stopping working after suspend resume, but haven't got round to reporting that yet. What was the git magic that you acquired to revert the patch, please? > On Sat, Aug 17, 2019 at 11:59:59AM +0300, Serge Belyshev wrote: >> >>> I am on an Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz running Linux >>> x86_64 (Slackware), with a custom-compiled 5.3.0-rc4 (.config >>> attached). >>> >>> I am using the Intel wifi adapter on this machine: >>> >>> 02:00.0 Network controller: Intel Corporation Device 24fb (rev 10) >>> >>> with the iwlwifi driver. I am attaching the output to 'lspci -vv -s >>> 02:00.0' as the file device-info. >>> >>> All 5.3.0-rc* versions I have tried (including rc4) cause multiple >>> dmesg iwlwifi-related errors (dmesg attached). Examples: >>> >>> iwlwifi :02:00.0: Failed to get geographic profile info -5 >>> iwlwifi :02:00.0: Microcode SW error detected. Restarting 0x8200 >>> iwlwifi :02:00.0: 0x0038 | BAD_COMMAND >>> >> >> I have my logs filled with similar garbage throughout 5.3-rc*. Also >> since 5.3-rcsomething not only it WARNS in dmesg about firmware failure, >> but completely stops working after suspend/resume cycle. >> >> It looks like that: >> >> commit 4fd445a2c855bbcab81fbe06d110e78dbd974a5b >> Author: Haim Dreyfuss >> Date: Thu May 2 11:45:02 2019 +0300 >> >> iwlwifi: mvm: Add log information about SAR status >> >> Inform users when SAR status is changing. >> >> Signed-off-by: Haim Dreyfuss >> Signed-off-by: Luca Coelho >> >> >> is the culprit. (manually) reverting it on top of 5.3-rc4 makes >> everything work again. >
iwlwifi: microcode SW error detected
Hi. I just found the following error in the output from dmesg. [ 4023.460058] iwlwifi :02:00.0: Microcode SW error detected. Restarting 0x0. [ 4023.460178] iwlwifi :02:00.0: Start IWL Error Log Dump: [ 4023.460179] iwlwifi :02:00.0: Status: 0x0080, count: 6 [ 4023.460180] iwlwifi :02:00.0: Loaded firmware version: 46.93e59cf4.0 [ 4023.460181] iwlwifi :02:00.0: 0x22CE | ADVANCED_SYSASSERT [ 4023.460182] iwlwifi :02:00.0: 0x0590A2F0 | trm_hw_status0 [ 4023.460182] iwlwifi :02:00.0: 0x | trm_hw_status1 [ 4023.460183] iwlwifi :02:00.0: 0x00488472 | branchlink2 [ 4023.460183] iwlwifi :02:00.0: 0x00479392 | interruptlink1 [ 4023.460184] iwlwifi :02:00.0: 0x | interruptlink2 [ 4023.460184] iwlwifi :02:00.0: 0x012C | data1 [ 4023.460185] iwlwifi :02:00.0: 0x | data2 [ 4023.460186] iwlwifi :02:00.0: 0x0400 | data3 [ 4023.460186] iwlwifi :02:00.0: 0x42001A44 | beacon time [ 4023.460187] iwlwifi :02:00.0: 0x4E9F05CD | tsf low [ 4023.460187] iwlwifi :02:00.0: 0x00D8 | tsf hi [ 4023.460188] iwlwifi :02:00.0: 0x | time gp1 [ 4023.460188] iwlwifi :02:00.0: 0xEF55F6D0 | time gp2 [ 4023.460189] iwlwifi :02:00.0: 0x0001 | uCode revision type [ 4023.460190] iwlwifi :02:00.0: 0x002E | uCode version major [ 4023.460190] iwlwifi :02:00.0: 0x93E59CF4 | uCode version minor [ 4023.460191] iwlwifi :02:00.0: 0x0321 | hw version [ 4023.460191] iwlwifi :02:00.0: 0x00C89004 | board version [ 4023.460192] iwlwifi :02:00.0: 0x0A05001C | hcmd [ 4023.460192] iwlwifi :02:00.0: 0xA2F93802 | isr0 [ 4023.460193] iwlwifi :02:00.0: 0x0004 | isr1 [ 4023.460193] iwlwifi :02:00.0: 0x1802 | isr2 [ 4023.460194] iwlwifi :02:00.0: 0x40417DCD | isr3 [ 4023.460195] iwlwifi :02:00.0: 0x | isr4 [ 4023.460195] iwlwifi :02:00.0: 0x0A04001C | last cmd Id [ 4023.460196] iwlwifi :02:00.0: 0x00018802 | wait_event [ 4023.460196] iwlwifi :02:00.0: 0x4A88 | l2p_control [ 4023.460197] iwlwifi :02:00.0: 0x0020 | l2p_duration [ 4023.460197] iwlwifi :02:00.0: 0x03BF | l2p_mhvalid [ 4023.460198] iwlwifi :02:00.0: 0x00EF | l2p_addr_match [ 4023.460198] iwlwifi :02:00.0: 0x000D | lmpm_pmg_sel [ 4023.460199] iwlwifi :02:00.0: 0x19071250 | timestamp [ 4023.460199] iwlwifi :02:00.0: 0x14C0E8E8 | flow_handler [ 4023.460257] iwlwifi :02:00.0: 0x | ADVANCED_SYSASSERT [ 4023.460257] iwlwifi :02:00.0: 0x | umac branchlink1 [ 4023.460258] iwlwifi :02:00.0: 0x | umac branchlink2 [ 4023.460258] iwlwifi :02:00.0: 0x | umac interruptlink1 [ 4023.460259] iwlwifi :02:00.0: 0x | umac interruptlink2 [ 4023.460260] iwlwifi :02:00.0: 0x | umac data1 [ 4023.460260] iwlwifi :02:00.0: 0x | umac data2 [ 4023.460261] iwlwifi :02:00.0: 0x | umac data3 [ 4023.460261] iwlwifi :02:00.0: 0x | umac major [ 4023.460262] iwlwifi :02:00.0: 0x | umac minor [ 4023.460262] iwlwifi :02:00.0: 0x | frame pointer [ 4023.460263] iwlwifi :02:00.0: 0x | stack pointer [ 4023.460263] iwlwifi :02:00.0: 0x | last host cmd [ 4023.460264] iwlwifi :02:00.0: 0x | isr status reg [ 4023.460278] iwlwifi :02:00.0: Fseq Registers: [ 4023.460282] iwlwifi :02:00.0: 0x0568FC22 | FSEQ_ERROR_CODE [ 4023.460289] iwlwifi :02:00.0: 0x | FSEQ_TOP_INIT_VERSION [ 4023.460297] iwlwifi :02:00.0: 0xDFFC324F | FSEQ_CNVIO_INIT_VERSION [ 4023.460304] iwlwifi :02:00.0: 0xA371 | FSEQ_OTP_VERSION [ 4023.460312] iwlwifi :02:00.0: 0xC338B29A | FSEQ_TOP_CONTENT_VERSION [ 4023.460319] iwlwifi :02:00.0: 0xD9E91E16 | FSEQ_ALIVE_TOKEN [ 4023.460327] iwlwifi :02:00.0: 0xAC99E6BF | FSEQ_CNVI_ID [ 4023.460334] iwlwifi :02:00.0: 0x07665623 | FSEQ_CNVR_ID [ 4023.460342] iwlwifi :02:00.0: 0x01000200 | CNVI_AUX_MISC_CHIP [ 4023.460349] iwlwifi :02:00.0: 0x01300202 | CNVR_AUX_MISC_CHIP [ 4023.460357] iwlwifi :02:00.0: 0x485B | CNVR_SCU_SD_REGS_SD_REG_DIG_DCDC_VTRIM [ 4023.460413] iwlwifi :02:00.0: 0x0BADCAFE | CNVR_SCU_SD_REGS_SD_REG_ACTIVE_VDIG_MIRROR [ 4023.460421] iwlwifi :02:00.0: Collecting data: trigger 2 fired. [ 4023.460424] ieee80211 phy0: Hardware restart was requested [ 4024.639366] iwlwifi :02:00.0: Applying debug destination EXTERNAL_DRAM [ 4024.753171] iwlwifi :02:00.0: Applying debug destination EXTERNAL_DRAM [ 4024.817999] iwlwifi :02:00.0: FW already configured (0) - re-configuring [ 4024.829374] iwlwifi :02:00.0: BIOS contains WGDS but no WRDS The output messages from the driver when the system starts are: [3.667365] iwlwifi :02:00.0: enabling device ( -> 0002) [3.670357] iwlwifi :02:00.0: Found debug destination: EXTERNAL_DRAM [3.670360] iwlwifi :02:00.0: Found debug configuration: 0 [3.670525] iwlwifi
Re: [PATCH v2] x86/boot: save fields explicitly, zero out everything else
On 31/07/2019 06:46, john.hubb...@gmail.com wrote: > From: John Hubbard > > Recent gcc compilers (gcc 9.1) generate warnings about an > out of bounds memset, if you trying memset across several fields > of a struct. This generated a couple of warnings on x86_64 builds. > > Fix this by explicitly saving the fields in struct boot_params > that are intended to be preserved, and zeroing all the rest. > I applied John's patch below to v5.3-rc3-285-gecb095bff5d4 and have been running the resultant kernel for two days now, including 7 or 8 cold starts and reboots. The warnings that were produced by gcc9 are no longer emitted and, other than a pre-existing problem (no network after resume from suspend or hibernate which I will investigate and, if necessary, report later today), the kernel has supported my typical day to day activities (building software, email, browsing, listening to music, watching video) without problem. Tested-by: Chris Clayton > Suggested-by: Thomas Gleixner > Suggested-by: H. Peter Anvin > Signed-off-by: John Hubbard > --- > arch/x86/include/asm/bootparam_utils.h | 62 +++--- > 1 file changed, 47 insertions(+), 15 deletions(-) > > diff --git a/arch/x86/include/asm/bootparam_utils.h > b/arch/x86/include/asm/bootparam_utils.h > index 101eb944f13c..514aee24b8de 100644 > --- a/arch/x86/include/asm/bootparam_utils.h > +++ b/arch/x86/include/asm/bootparam_utils.h > @@ -18,6 +18,20 @@ > * Note: efi_info is commonly left uninitialized, but that field has a > * private magic, so it is better to leave it unchanged. > */ > + > +#define sizeof_mbr(type, member) ({ sizeof(((type *)0)->member); }) > + > +#define BOOT_PARAM_PRESERVE(struct_member) \ > + { \ > + .start = offsetof(struct boot_params, struct_member), \ > + .len = sizeof_mbr(struct boot_params, struct_member), \ > + } > + > +struct boot_params_to_save { > + unsigned int start; > + unsigned int len; > +}; > + > static void sanitize_boot_params(struct boot_params *boot_params) > { > /* > @@ -35,21 +49,39 @@ static void sanitize_boot_params(struct boot_params > *boot_params) >* problems again. >*/ > if (boot_params->sentinel) { > - /* fields in boot_params are left uninitialized, clear them */ > - boot_params->acpi_rsdp_addr = 0; > - memset(_params->ext_ramdisk_image, 0, > -(char *)_params->efi_info - > - (char *)_params->ext_ramdisk_image); > - memset(_params->kbd_status, 0, > -(char *)_params->hdr - > -(char *)_params->kbd_status); > - memset(_params->_pad7[0], 0, > -(char *)_params->edd_mbr_sig_buffer[0] - > - (char *)_params->_pad7[0]); > - memset(_params->_pad8[0], 0, > -(char *)_params->eddbuf[0] - > - (char *)_params->_pad8[0]); > - memset(_params->_pad9[0], 0, sizeof(boot_params->_pad9)); > + static struct boot_params scratch; > + char *bp_base = (char *)boot_params; > + char *save_base = (char *) > + int i; > + > + const struct boot_params_to_save to_save[] = { > + BOOT_PARAM_PRESERVE(screen_info), > + BOOT_PARAM_PRESERVE(apm_bios_info), > + BOOT_PARAM_PRESERVE(tboot_addr), > + BOOT_PARAM_PRESERVE(ist_info), > + BOOT_PARAM_PRESERVE(acpi_rsdp_addr), > + BOOT_PARAM_PRESERVE(hd0_info), > + BOOT_PARAM_PRESERVE(hd1_info), > + BOOT_PARAM_PRESERVE(sys_desc_table), > + BOOT_PARAM_PRESERVE(olpc_ofw_header), > + BOOT_PARAM_PRESERVE(efi_info), > + BOOT_PARAM_PRESERVE(alt_mem_k), > + BOOT_PARAM_PRESERVE(scratch), > + BOOT_PARAM_PRESERVE(e820_entries), > + BOOT_PARAM_PRESERVE(eddbuf_entries), > + BOOT_PARAM_PRESERVE(edd_mbr_sig_buf_entries), > + BOOT_PARAM_PRESERVE(edd_mbr_sig_buffer), > + BOOT_PARAM_PRESERVE(e820_table), > + BOOT_PARAM_PRESERVE(eddbuf), > + }; > + > + memset(, 0, sizeof(scratch)); > + > + for (i = 0; i < ARRAY_SIZE(to_save); i++) > + memcpy(save_base + to_save[i].start, > +bp_base + to_save[i].start, to_save[i].len); > + > + memcpy(boot_params, save_base, sizeof(*boot_params)); > } > } > >
Re: Warnings whilst building 5.2.0+
On 09/07/2019 12:39, Chris Clayton wrote: > > > On 09/07/2019 11:37, Enrico Weigelt, metux IT consult wrote: >> On 09.07.19 08:06, Chris Clayton wrote: >> >> Hi, >> >>> I've pulled Linus' tree this morning and, after running 'make oldconfig', >>> tried a build. During that build I got the >>> following warnings, which look to me like they should be fixed. 'git >>> describe' shows v5.2-915-g5ad18b2e60b7 and my >>> compiler is the 20190706 snapshot of gcc 9. >> >> Thanks for the report. I'm rebuilding right know anyways, so I'll look >> out for it. > > Thanks for the reply. > >>> In file included from arch/x86/kernel/head64.c:35: >>> In function 'sanitize_boot_params', >>> inlined from 'copy_bootdata' at arch/x86/kernel/head64.c:391:2: >>> ./arch/x86/include/asm/bootparam_utils.h:40:3: warning: 'memset' offset >>> [197, 448] from the object at 'boot_params' is >>> out of the bounds of referenced subobject 'ext_ramdisk_image' with type >>> 'unsigned int' at offset 192 [-Warray-bounds] >>>40 | memset(_params->ext_ramdisk_image, 0, >>> | ^~ >>>41 | (char *)_params->efi_info - >>> | >>>42 |(char *)_params->ext_ramdisk_image); >>> | >>> ./arch/x86/include/asm/bootparam_utils.h:43:3: warning: 'memset' offset >>> [493, 497] from the object at 'boot_params' is >>> out of the bounds of referenced subobject 'kbd_status' with type 'unsigned >>> char' at offset 491 [-Warray-bounds] >>>43 | memset(_params->kbd_status, 0, >>> | ^~~ >>>44 | (char *)_params->hdr - >>> | ~~~ >>>45 | (char *)_params->kbd_status); >>> | ~ >> >> Can you check older versions, too ? Maybe also trying older gcc ? >> > > I see the same warnings building linux-5.2.0 with gcc9. However, I don't see > the warnings building linux-5.2.0 with the > the 20190705 of gcc8. So the warnings could result from an improvement (i.e. > the problem was in the kernel, but > undiscovered by gcc8) or from a regression in gcc9. > >From the discussion starting at >https://marc.info/?l=linux-kernel=156401014023908, it would appear that the >problem is undiscovered by gcc8. Building a fresh pull of Linus' tree this morning (v5.3-rc3-282-g33920f1ec5bf), I see that the warnings are still being emitted. Adding the participants in the other discussion to this one. >> >> --mtx >>
Re: Warnings whilst building 5.2.0+
On 09/07/2019 11:37, Enrico Weigelt, metux IT consult wrote: > On 09.07.19 08:06, Chris Clayton wrote: > > Hi, > >> I've pulled Linus' tree this morning and, after running 'make oldconfig', >> tried a build. During that build I got the >> following warnings, which look to me like they should be fixed. 'git >> describe' shows v5.2-915-g5ad18b2e60b7 and my >> compiler is the 20190706 snapshot of gcc 9. > > Thanks for the report. I'm rebuilding right know anyways, so I'll look > out for it. Thanks for the reply. >> In file included from arch/x86/kernel/head64.c:35: >> In function 'sanitize_boot_params', >> inlined from 'copy_bootdata' at arch/x86/kernel/head64.c:391:2: >> ./arch/x86/include/asm/bootparam_utils.h:40:3: warning: 'memset' offset >> [197, 448] from the object at 'boot_params' is >> out of the bounds of referenced subobject 'ext_ramdisk_image' with type >> 'unsigned int' at offset 192 [-Warray-bounds] >>40 | memset(_params->ext_ramdisk_image, 0, >> | ^~ >>41 | (char *)_params->efi_info - >> | >>42 |(char *)_params->ext_ramdisk_image); >> | >> ./arch/x86/include/asm/bootparam_utils.h:43:3: warning: 'memset' offset >> [493, 497] from the object at 'boot_params' is >> out of the bounds of referenced subobject 'kbd_status' with type 'unsigned >> char' at offset 491 [-Warray-bounds] >>43 | memset(_params->kbd_status, 0, >> | ^~~ >>44 | (char *)_params->hdr - >> | ~~~ >>45 | (char *)_params->kbd_status); >> | ~ > > Can you check older versions, too ? Maybe also trying older gcc ? > I see the same warnings building linux-5.2.0 with gcc9. However, I don't see the warnings building linux-5.2.0 with the the 20190705 of gcc8. So the warnings could result from an improvement (i.e. the problem was in the kernel, but undiscovered by gcc8) or from a regression in gcc9. > > --mtx >
Warnings whilst building 5.2.0+
Hi, I've pulled Linus' tree this morning and, after running 'make oldconfig', tried a build. During that build I got the following warnings, which look to me like they should be fixed. 'git describe' shows v5.2-915-g5ad18b2e60b7 and my compiler is the 20190706 snapshot of gcc 9. In file included from arch/x86/kernel/head64.c:35: In function 'sanitize_boot_params', inlined from 'copy_bootdata' at arch/x86/kernel/head64.c:391:2: ./arch/x86/include/asm/bootparam_utils.h:40:3: warning: 'memset' offset [197, 448] from the object at 'boot_params' is out of the bounds of referenced subobject 'ext_ramdisk_image' with type 'unsigned int' at offset 192 [-Warray-bounds] 40 | memset(_params->ext_ramdisk_image, 0, | ^~ 41 | (char *)_params->efi_info - | 42 |(char *)_params->ext_ramdisk_image); | ./arch/x86/include/asm/bootparam_utils.h:43:3: warning: 'memset' offset [493, 497] from the object at 'boot_params' is out of the bounds of referenced subobject 'kbd_status' with type 'unsigned char' at offset 491 [-Warray-bounds] 43 | memset(_params->kbd_status, 0, | ^~~ 44 | (char *)_params->hdr - | ~~~ 45 | (char *)_params->kbd_status); | ~ Happy to test any patches, but please cc me as I'm not subscribed to LKML. Chris
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 11/10/2018 13:23, Maciej S. Szmigiero wrote: > On 11.10.2018 10:24, Chris Clayton wrote: >> On 11/10/2018 01:12, Maciej S. Szmigiero wrote: >>> On 11.10.2018 00:49, Chris Clayton wrote: >>>>> Now, knowing the "right" value you can experiment with what >>>>> rtl_init_rxcfg() >>>>> writes (under the "default:" label for your NIC model). >>>>> >>>> >>>> This might be more interesting. Through a combination of viewing the >>>> output from pr_notice() and the output from >>>> "ethtool -d", I can see RxConfig with the following values >>>> >>>>During boot:0x00028700 >>>>Before suspend: 0x0002870e >>>>During resume: 0x00024000 >>>>Post resume:0x0002870e >>>> >>>> As I did with 4.18.10 early on in the process, I removed the call to >>>> rtl_init_rxcfg() from rtl_hw_start() and rebuilt, >>>> installed and rebooted. Now I see the following values: >>>> >>>>During boot:0x00028700 >>>>Before suspend: 0x0002870e >>>>During resume: 0x00024000 >>>>Post resume:0x0002400e >>>> >>> >>> Now we can finally see some difference... >>> Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST >>> (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this >>> is kind of expected - one can see that the working configuration >>> post-resume has bit 14 (or 0x4000) set, too. >>> >>> This bit is described in the driver as RX_MULTI_EN ("8111c only") and is >>> set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35. >>> >>> RTL_GIGA_MAC_VER_35 is described in the driver as being in the same >>> family as your RTL_GIGA_MAC_VER_38, so can you please try the following >>> change: >>> --- r8169.c >>> +++ r8169.c >>> @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816 >>> case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24: >>> case RTL_GIGA_MAC_VER_34: >>> case RTL_GIGA_MAC_VER_35: >>> + case RTL_GIGA_MAC_VER_38: >>> RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | >>> RX_DMA_BURST); >>> break; >>> case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51: >>> >>> This will add RX_MULTI_EN also for your chip model (you need to add back >>> the call to rtl_init_rxcfg() to rtl_hw_start(), naturally). >>> >> >> That's done the trick. With the above change applied, my network runs >> running fine after a suspend/resume cycle and the >> ping times are back in the 14-15ms range. > > Nice! > > I will submit a patch, it would be great if you could test it and then > add a "Tested-by:" tag. > Will do, Maciej. Thanks for solving this. >> Chris > > Maciej >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 11/10/2018 13:23, Maciej S. Szmigiero wrote: > On 11.10.2018 10:24, Chris Clayton wrote: >> On 11/10/2018 01:12, Maciej S. Szmigiero wrote: >>> On 11.10.2018 00:49, Chris Clayton wrote: >>>>> Now, knowing the "right" value you can experiment with what >>>>> rtl_init_rxcfg() >>>>> writes (under the "default:" label for your NIC model). >>>>> >>>> >>>> This might be more interesting. Through a combination of viewing the >>>> output from pr_notice() and the output from >>>> "ethtool -d", I can see RxConfig with the following values >>>> >>>>During boot:0x00028700 >>>>Before suspend: 0x0002870e >>>>During resume: 0x00024000 >>>>Post resume:0x0002870e >>>> >>>> As I did with 4.18.10 early on in the process, I removed the call to >>>> rtl_init_rxcfg() from rtl_hw_start() and rebuilt, >>>> installed and rebooted. Now I see the following values: >>>> >>>>During boot:0x00028700 >>>>Before suspend: 0x0002870e >>>>During resume: 0x00024000 >>>>Post resume:0x0002400e >>>> >>> >>> Now we can finally see some difference... >>> Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST >>> (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this >>> is kind of expected - one can see that the working configuration >>> post-resume has bit 14 (or 0x4000) set, too. >>> >>> This bit is described in the driver as RX_MULTI_EN ("8111c only") and is >>> set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35. >>> >>> RTL_GIGA_MAC_VER_35 is described in the driver as being in the same >>> family as your RTL_GIGA_MAC_VER_38, so can you please try the following >>> change: >>> --- r8169.c >>> +++ r8169.c >>> @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816 >>> case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24: >>> case RTL_GIGA_MAC_VER_34: >>> case RTL_GIGA_MAC_VER_35: >>> + case RTL_GIGA_MAC_VER_38: >>> RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | >>> RX_DMA_BURST); >>> break; >>> case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51: >>> >>> This will add RX_MULTI_EN also for your chip model (you need to add back >>> the call to rtl_init_rxcfg() to rtl_hw_start(), naturally). >>> >> >> That's done the trick. With the above change applied, my network runs >> running fine after a suspend/resume cycle and the >> ping times are back in the 14-15ms range. > > Nice! > > I will submit a patch, it would be great if you could test it and then > add a "Tested-by:" tag. > Will do, Maciej. Thanks for solving this. >> Chris > > Maciej >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 11/10/2018 01:12, Maciej S. Szmigiero wrote: > On 11.10.2018 00:49, Chris Clayton wrote: >>> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() >>> writes (under the "default:" label for your NIC model). >>> >> >> This might be more interesting. Through a combination of viewing the output >> from pr_notice() and the output from >> "ethtool -d", I can see RxConfig with the following values >> >> During boot:0x00028700 >> Before suspend: 0x0002870e >> During resume: 0x00024000 >> Post resume:0x0002870e >> >> As I did with 4.18.10 early on in the process, I removed the call to >> rtl_init_rxcfg() from rtl_hw_start() and rebuilt, >> installed and rebooted. Now I see the following values: >> >> During boot:0x00028700 >> Before suspend: 0x0002870e >> During resume: 0x00024000 >> Post resume:0x0002400e >> > > Now we can finally see some difference... > Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST > (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this > is kind of expected - one can see that the working configuration > post-resume has bit 14 (or 0x4000) set, too. > > This bit is described in the driver as RX_MULTI_EN ("8111c only") and is > set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35. > > RTL_GIGA_MAC_VER_35 is described in the driver as being in the same > family as your RTL_GIGA_MAC_VER_38, so can you please try the following > change: > --- r8169.c > +++ r8169.c > @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816 > case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24: > case RTL_GIGA_MAC_VER_34: > case RTL_GIGA_MAC_VER_35: > + case RTL_GIGA_MAC_VER_38: > RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | > RX_DMA_BURST); > break; > case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51: > > This will add RX_MULTI_EN also for your chip model (you need to add back > the call to rtl_init_rxcfg() to rtl_hw_start(), naturally). > That's done the trick. With the above change applied, my network runs running fine after a suspend/resume cycle and the ping times are back in the 14-15ms range. Chris > If this does not help then I would try another values in the above write: > 1) RTL_W32(tp, RxConfig, 0x00024000); > 2) RTL_W32(tp, RxConfig, 0x4000); > 3) RTL_W32(tp, RxConfig, RX_DMA_BURST); > 4) RTL_W32(tp, RxConfig, RX128_INT_EN); > >> Chris > > Maciej >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 11/10/2018 01:12, Maciej S. Szmigiero wrote: > On 11.10.2018 00:49, Chris Clayton wrote: >>> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() >>> writes (under the "default:" label for your NIC model). >>> >> >> This might be more interesting. Through a combination of viewing the output >> from pr_notice() and the output from >> "ethtool -d", I can see RxConfig with the following values >> >> During boot:0x00028700 >> Before suspend: 0x0002870e >> During resume: 0x00024000 >> Post resume:0x0002870e >> >> As I did with 4.18.10 early on in the process, I removed the call to >> rtl_init_rxcfg() from rtl_hw_start() and rebuilt, >> installed and rebooted. Now I see the following values: >> >> During boot:0x00028700 >> Before suspend: 0x0002870e >> During resume: 0x00024000 >> Post resume:0x0002400e >> > > Now we can finally see some difference... > Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST > (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this > is kind of expected - one can see that the working configuration > post-resume has bit 14 (or 0x4000) set, too. > > This bit is described in the driver as RX_MULTI_EN ("8111c only") and is > set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35. > > RTL_GIGA_MAC_VER_35 is described in the driver as being in the same > family as your RTL_GIGA_MAC_VER_38, so can you please try the following > change: > --- r8169.c > +++ r8169.c > @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816 > case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24: > case RTL_GIGA_MAC_VER_34: > case RTL_GIGA_MAC_VER_35: > + case RTL_GIGA_MAC_VER_38: > RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | > RX_DMA_BURST); > break; > case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51: > > This will add RX_MULTI_EN also for your chip model (you need to add back > the call to rtl_init_rxcfg() to rtl_hw_start(), naturally). > That's done the trick. With the above change applied, my network runs running fine after a suspend/resume cycle and the ping times are back in the 14-15ms range. Chris > If this does not help then I would try another values in the above write: > 1) RTL_W32(tp, RxConfig, 0x00024000); > 2) RTL_W32(tp, RxConfig, 0x4000); > 3) RTL_W32(tp, RxConfig, RX_DMA_BURST); > 4) RTL_W32(tp, RxConfig, RX128_INT_EN); > >> Chris > > Maciej >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
OK, right kernel/module used this time. Please see findings below. On 10/10/2018 01:24, Maciej S. Szmigiero wrote: > On 09.10.2018 22:36, Heiner Kallweit wrote: >> On 09.10.2018 16:40, Chris Clayton wrote: >>> Thanks to Maciej and Heiner for their replies. >>> >>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: >>>> On 07.10.2018 21:36, Chris Clayton wrote: >>>>> Hi again, >>>>> >>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but >>>>> tried it anyway. I can confirm that the >>>>> regression is still present and my network still fails when, after a >>>>> resume from suspend (to ram or disk), I open my >>>>> browser or my mail client. In both those cases the failure is almost >>>>> immediate - e.g. my home page doesn't get displayed >>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so >>>>> quickly but the reported time increases from >>>>> 14-15ms to more than 1000ms. >>>> >>>> You can try comparing chip registers (ethtool -d eth0) in the working >>>> state (before a suspend) and in the broken state (after a resume). >>>> Maybe there will be some obvious in the difference. >>>> >>>> The same goes for the PCI configuration (lspci -d :8168 -vv). >>>> >>> Maciej suggested comparing the output from lspci -vv for the ethernet >>> device. They are identical. >>> >>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre >>> and post suspend. Again, they are identical. >>> Heiner specifically suggested looking at the RxConfig. The value of that is >>> 0x0002870e both pre and post suspend. >>> >> Hmm, this is very weird, especially taking into account that in your original >> report you state that removing the call to rtl_init_rxcfg() from >> rtl_hw_start() >> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and >> register values seem to be the same before and after resume. So how can the >> chip behave differently? >> So far my best guess is that some chip quirk causes it to accept writes to >> register RxConfig, but to misinterpret or ignore the written value. >> So far your report is the only one (affecting RTL8411), but we don't know >> whether other chip versions are affected too. > > Also, it is interesting that even if one removes a call to > rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get > written to moments later by rtl_set_rx_mode(). > > The only chip accesses in the meantime seems to be a write to TxConfig by > rtl_set_tx_config_registers() and then a read of RxConfig plus two writes > to MAR0 earlier in rtl_set_rx_mode(). > > My proposals are: > 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" > in rtl_hw_start(). > Maybe the chip does not like sometimes that RxConfig is written before > TxConfig. > This change made no difference. Networking still dies if I open a browser or leave ping running long enough. > 2) Check the original value of RxConfig (after a resume) before > rtl_init_rxcfg() overwrites it (compile tested only): > --- r8169.c.ori > +++ r8169.c > @@ -5155,6 +5155,9 @@ > /* Initially a 10 us delay. Turned it into a PCI commit. - FR */ > RTL_R8(tp, IntrMask); > RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb); > + > + pr_notice("RxConfig before init was %.8x\n", > + (unsigned int)RTL_R32(tp, RxConfig)); > rtl_init_rxcfg(tp); > rtl_set_tx_config_registers(tp); > > > This should be the value that you got when you removed the call to > rtl_init_rxcfg() for testing. > Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() > writes (under the "default:" label for your NIC model). > This might be more interesting. Through a combination of viewing the output from pr_notice() and the output from "ethtool -d", I can see RxConfig with the following values During boot:0x00028700 Before suspend: 0x0002870e During resume: 0x00024000 Post resume:0x0002870e As I did with 4.18.10 early on in the process, I removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, installed and rebooted. Now I see the following values: During boot:0x00028700 Before suspend: 0x0002870e During resume: 0x00024000 Post resume:0x0002400e As with 4.18.10, networking now appears to be stable after the resume. Starting a browser results in my homepage being displayed and I've spent a few minutes surfing with no interruptions. Similarly, ping runs without stopping. I simply don't know enough to know what might now be enabled or disabled by this change in value, but hopefully it will provide a clue to someone as to what is going on. Chris > Hope this helps, > Maciej >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
OK, right kernel/module used this time. Please see findings below. On 10/10/2018 01:24, Maciej S. Szmigiero wrote: > On 09.10.2018 22:36, Heiner Kallweit wrote: >> On 09.10.2018 16:40, Chris Clayton wrote: >>> Thanks to Maciej and Heiner for their replies. >>> >>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: >>>> On 07.10.2018 21:36, Chris Clayton wrote: >>>>> Hi again, >>>>> >>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but >>>>> tried it anyway. I can confirm that the >>>>> regression is still present and my network still fails when, after a >>>>> resume from suspend (to ram or disk), I open my >>>>> browser or my mail client. In both those cases the failure is almost >>>>> immediate - e.g. my home page doesn't get displayed >>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so >>>>> quickly but the reported time increases from >>>>> 14-15ms to more than 1000ms. >>>> >>>> You can try comparing chip registers (ethtool -d eth0) in the working >>>> state (before a suspend) and in the broken state (after a resume). >>>> Maybe there will be some obvious in the difference. >>>> >>>> The same goes for the PCI configuration (lspci -d :8168 -vv). >>>> >>> Maciej suggested comparing the output from lspci -vv for the ethernet >>> device. They are identical. >>> >>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre >>> and post suspend. Again, they are identical. >>> Heiner specifically suggested looking at the RxConfig. The value of that is >>> 0x0002870e both pre and post suspend. >>> >> Hmm, this is very weird, especially taking into account that in your original >> report you state that removing the call to rtl_init_rxcfg() from >> rtl_hw_start() >> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and >> register values seem to be the same before and after resume. So how can the >> chip behave differently? >> So far my best guess is that some chip quirk causes it to accept writes to >> register RxConfig, but to misinterpret or ignore the written value. >> So far your report is the only one (affecting RTL8411), but we don't know >> whether other chip versions are affected too. > > Also, it is interesting that even if one removes a call to > rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get > written to moments later by rtl_set_rx_mode(). > > The only chip accesses in the meantime seems to be a write to TxConfig by > rtl_set_tx_config_registers() and then a read of RxConfig plus two writes > to MAR0 earlier in rtl_set_rx_mode(). > > My proposals are: > 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" > in rtl_hw_start(). > Maybe the chip does not like sometimes that RxConfig is written before > TxConfig. > This change made no difference. Networking still dies if I open a browser or leave ping running long enough. > 2) Check the original value of RxConfig (after a resume) before > rtl_init_rxcfg() overwrites it (compile tested only): > --- r8169.c.ori > +++ r8169.c > @@ -5155,6 +5155,9 @@ > /* Initially a 10 us delay. Turned it into a PCI commit. - FR */ > RTL_R8(tp, IntrMask); > RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb); > + > + pr_notice("RxConfig before init was %.8x\n", > + (unsigned int)RTL_R32(tp, RxConfig)); > rtl_init_rxcfg(tp); > rtl_set_tx_config_registers(tp); > > > This should be the value that you got when you removed the call to > rtl_init_rxcfg() for testing. > Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() > writes (under the "default:" label for your NIC model). > This might be more interesting. Through a combination of viewing the output from pr_notice() and the output from "ethtool -d", I can see RxConfig with the following values During boot:0x00028700 Before suspend: 0x0002870e During resume: 0x00024000 Post resume:0x0002870e As I did with 4.18.10 early on in the process, I removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, installed and rebooted. Now I see the following values: During boot:0x00028700 Before suspend: 0x0002870e During resume: 0x00024000 Post resume:0x0002400e As with 4.18.10, networking now appears to be stable after the resume. Starting a browser results in my homepage being displayed and I've spent a few minutes surfing with no interruptions. Similarly, ping runs without stopping. I simply don't know enough to know what might now be enabled or disabled by this change in value, but hopefully it will provide a clue to someone as to what is going on. Chris > Hope this helps, > Maciej >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Too late at night to be doing this stuff. Clicked send instead of saving a draft. Sorry, please ignore. On 10/10/2018 23:30, Chris Clayton wrote: > OK, right kernel/module used this time. Please see findings below. > > On 10/10/2018 01:24, Maciej S. Szmigiero wrote: >> On 09.10.2018 22:36, Heiner Kallweit wrote: >>> On 09.10.2018 16:40, Chris Clayton wrote: >>>> Thanks to Maciej and Heiner for their replies. >>>> >>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: >>>>> On 07.10.2018 21:36, Chris Clayton wrote: >>>>>> Hi again, >>>>>> >>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, >>>>>> but tried it anyway. I can confirm that the >>>>>> regression is still present and my network still fails when, after a >>>>>> resume from suspend (to ram or disk), I open my >>>>>> browser or my mail client. In both those cases the failure is almost >>>>>> immediate - e.g. my home page doesn't get displayed >>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite >>>>>> so quickly but the reported time increases from >>>>>> 14-15ms to more than 1000ms. >>>>> >>>>> You can try comparing chip registers (ethtool -d eth0) in the working >>>>> state (before a suspend) and in the broken state (after a resume). >>>>> Maybe there will be some obvious in the difference. >>>>> >>>>> The same goes for the PCI configuration (lspci -d :8168 -vv). >>>>> >>>> Maciej suggested comparing the output from lspci -vv for the ethernet >>>> device. They are identical. >>>> >>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" >>>> pre and post suspend. Again, they are identical. >>>> Heiner specifically suggested looking at the RxConfig. The value of that >>>> is 0x0002870e both pre and post suspend. >>>> >>> Hmm, this is very weird, especially taking into account that in your >>> original >>> report you state that removing the call to rtl_init_rxcfg() from >>> rtl_hw_start() >>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and >>> register values seem to be the same before and after resume. So how can the >>> chip behave differently? >>> So far my best guess is that some chip quirk causes it to accept writes to >>> register RxConfig, but to misinterpret or ignore the written value. >>> So far your report is the only one (affecting RTL8411), but we don't know >>> whether other chip versions are affected too. >> >> Also, it is interesting that even if one removes a call to >> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get >> written to moments later by rtl_set_rx_mode(). >> >> The only chip accesses in the meantime seems to be a write to TxConfig by >> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes >> to MAR0 earlier in rtl_set_rx_mode(). >> >> My proposals are: >> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" >> in rtl_hw_start(). >> Maybe the chip does not like sometimes that RxConfig is written before >> TxConfig. >> > > This change made no difference. Networking still dies if I open a browser or > leave ping running long enough. > >> 2) Check the original value of RxConfig (after a resume) before >> rtl_init_rxcfg() overwrites it (compile tested only): >> --- r8169.c.ori >> +++ r8169.c >> @@ -5155,6 +5155,9 @@ >> /* Initially a 10 us delay. Turned it into a PCI commit. - FR */ >> RTL_R8(tp, IntrMask); >> RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb); >> + >> +pr_notice("RxConfig before init was %.8x\n", >> +(unsigned int)RTL_R32(tp, RxConfig)); >> rtl_init_rxcfg(tp); >> rtl_set_tx_config_registers(tp); >> >> >> This should be the value that you got when you removed the call to >> rtl_init_rxcfg() for testing. >> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() >> writes (under the "default:" label for your NIC model). > > This might be more interesting. Through combination of viewing the output > from pr_notice() and the output from "ethtool > -d", I can see RxConfig with the following values > > During boot:0x00028700 > Before suspend: 0x0002870e > During resume: 0x00024000 > Post resume:0x0002870e > > I then removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, > installed and rebooted. Now I see the > following values: > > During boot:0x00028700 > Before suspend: 0x0002870e > During resume: 0x00024000 > Post resume:0x0002870e > >> >> Hope this helps, >> Maciej >>
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Too late at night to be doing this stuff. Clicked send instead of saving a draft. Sorry, please ignore. On 10/10/2018 23:30, Chris Clayton wrote: > OK, right kernel/module used this time. Please see findings below. > > On 10/10/2018 01:24, Maciej S. Szmigiero wrote: >> On 09.10.2018 22:36, Heiner Kallweit wrote: >>> On 09.10.2018 16:40, Chris Clayton wrote: >>>> Thanks to Maciej and Heiner for their replies. >>>> >>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: >>>>> On 07.10.2018 21:36, Chris Clayton wrote: >>>>>> Hi again, >>>>>> >>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, >>>>>> but tried it anyway. I can confirm that the >>>>>> regression is still present and my network still fails when, after a >>>>>> resume from suspend (to ram or disk), I open my >>>>>> browser or my mail client. In both those cases the failure is almost >>>>>> immediate - e.g. my home page doesn't get displayed >>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite >>>>>> so quickly but the reported time increases from >>>>>> 14-15ms to more than 1000ms. >>>>> >>>>> You can try comparing chip registers (ethtool -d eth0) in the working >>>>> state (before a suspend) and in the broken state (after a resume). >>>>> Maybe there will be some obvious in the difference. >>>>> >>>>> The same goes for the PCI configuration (lspci -d :8168 -vv). >>>>> >>>> Maciej suggested comparing the output from lspci -vv for the ethernet >>>> device. They are identical. >>>> >>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" >>>> pre and post suspend. Again, they are identical. >>>> Heiner specifically suggested looking at the RxConfig. The value of that >>>> is 0x0002870e both pre and post suspend. >>>> >>> Hmm, this is very weird, especially taking into account that in your >>> original >>> report you state that removing the call to rtl_init_rxcfg() from >>> rtl_hw_start() >>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and >>> register values seem to be the same before and after resume. So how can the >>> chip behave differently? >>> So far my best guess is that some chip quirk causes it to accept writes to >>> register RxConfig, but to misinterpret or ignore the written value. >>> So far your report is the only one (affecting RTL8411), but we don't know >>> whether other chip versions are affected too. >> >> Also, it is interesting that even if one removes a call to >> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get >> written to moments later by rtl_set_rx_mode(). >> >> The only chip accesses in the meantime seems to be a write to TxConfig by >> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes >> to MAR0 earlier in rtl_set_rx_mode(). >> >> My proposals are: >> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" >> in rtl_hw_start(). >> Maybe the chip does not like sometimes that RxConfig is written before >> TxConfig. >> > > This change made no difference. Networking still dies if I open a browser or > leave ping running long enough. > >> 2) Check the original value of RxConfig (after a resume) before >> rtl_init_rxcfg() overwrites it (compile tested only): >> --- r8169.c.ori >> +++ r8169.c >> @@ -5155,6 +5155,9 @@ >> /* Initially a 10 us delay. Turned it into a PCI commit. - FR */ >> RTL_R8(tp, IntrMask); >> RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb); >> + >> +pr_notice("RxConfig before init was %.8x\n", >> +(unsigned int)RTL_R32(tp, RxConfig)); >> rtl_init_rxcfg(tp); >> rtl_set_tx_config_registers(tp); >> >> >> This should be the value that you got when you removed the call to >> rtl_init_rxcfg() for testing. >> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() >> writes (under the "default:" label for your NIC model). > > This might be more interesting. Through combination of viewing the output > from pr_notice() and the output from "ethtool > -d", I can see RxConfig with the following values > > During boot:0x00028700 > Before suspend: 0x0002870e > During resume: 0x00024000 > Post resume:0x0002870e > > I then removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, > installed and rebooted. Now I see the > following values: > > During boot:0x00028700 > Before suspend: 0x0002870e > During resume: 0x00024000 > Post resume:0x0002870e > >> >> Hope this helps, >> Maciej >>
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
OK, right kernel/module used this time. Please see findings below. On 10/10/2018 01:24, Maciej S. Szmigiero wrote: > On 09.10.2018 22:36, Heiner Kallweit wrote: >> On 09.10.2018 16:40, Chris Clayton wrote: >>> Thanks to Maciej and Heiner for their replies. >>> >>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: >>>> On 07.10.2018 21:36, Chris Clayton wrote: >>>>> Hi again, >>>>> >>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but >>>>> tried it anyway. I can confirm that the >>>>> regression is still present and my network still fails when, after a >>>>> resume from suspend (to ram or disk), I open my >>>>> browser or my mail client. In both those cases the failure is almost >>>>> immediate - e.g. my home page doesn't get displayed >>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so >>>>> quickly but the reported time increases from >>>>> 14-15ms to more than 1000ms. >>>> >>>> You can try comparing chip registers (ethtool -d eth0) in the working >>>> state (before a suspend) and in the broken state (after a resume). >>>> Maybe there will be some obvious in the difference. >>>> >>>> The same goes for the PCI configuration (lspci -d :8168 -vv). >>>> >>> Maciej suggested comparing the output from lspci -vv for the ethernet >>> device. They are identical. >>> >>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre >>> and post suspend. Again, they are identical. >>> Heiner specifically suggested looking at the RxConfig. The value of that is >>> 0x0002870e both pre and post suspend. >>> >> Hmm, this is very weird, especially taking into account that in your original >> report you state that removing the call to rtl_init_rxcfg() from >> rtl_hw_start() >> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and >> register values seem to be the same before and after resume. So how can the >> chip behave differently? >> So far my best guess is that some chip quirk causes it to accept writes to >> register RxConfig, but to misinterpret or ignore the written value. >> So far your report is the only one (affecting RTL8411), but we don't know >> whether other chip versions are affected too. > > Also, it is interesting that even if one removes a call to > rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get > written to moments later by rtl_set_rx_mode(). > > The only chip accesses in the meantime seems to be a write to TxConfig by > rtl_set_tx_config_registers() and then a read of RxConfig plus two writes > to MAR0 earlier in rtl_set_rx_mode(). > > My proposals are: > 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" > in rtl_hw_start(). > Maybe the chip does not like sometimes that RxConfig is written before > TxConfig. > This change made no difference. Networking still dies if I open a browser or leave ping running long enough. > 2) Check the original value of RxConfig (after a resume) before > rtl_init_rxcfg() overwrites it (compile tested only): > --- r8169.c.ori > +++ r8169.c > @@ -5155,6 +5155,9 @@ > /* Initially a 10 us delay. Turned it into a PCI commit. - FR */ > RTL_R8(tp, IntrMask); > RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb); > + > + pr_notice("RxConfig before init was %.8x\n", > + (unsigned int)RTL_R32(tp, RxConfig)); > rtl_init_rxcfg(tp); > rtl_set_tx_config_registers(tp); > > > This should be the value that you got when you removed the call to > rtl_init_rxcfg() for testing. > Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() > writes (under the "default:" label for your NIC model). This might be more interesting. Through combination of viewing the output from pr_notice() and the output from "ethtool -d", I can see RxConfig with the following values During boot:0x00028700 Before suspend: 0x0002870e During resume: 0x00024000 Post resume:0x0002870e I then removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, installed and rebooted. Now I see the following values: During boot:0x00028700 Before suspend: 0x0002870e During resume: 0x00024000 Post resume:0x0002870e > > Hope this helps, > Maciej >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
OK, right kernel/module used this time. Please see findings below. On 10/10/2018 01:24, Maciej S. Szmigiero wrote: > On 09.10.2018 22:36, Heiner Kallweit wrote: >> On 09.10.2018 16:40, Chris Clayton wrote: >>> Thanks to Maciej and Heiner for their replies. >>> >>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: >>>> On 07.10.2018 21:36, Chris Clayton wrote: >>>>> Hi again, >>>>> >>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but >>>>> tried it anyway. I can confirm that the >>>>> regression is still present and my network still fails when, after a >>>>> resume from suspend (to ram or disk), I open my >>>>> browser or my mail client. In both those cases the failure is almost >>>>> immediate - e.g. my home page doesn't get displayed >>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so >>>>> quickly but the reported time increases from >>>>> 14-15ms to more than 1000ms. >>>> >>>> You can try comparing chip registers (ethtool -d eth0) in the working >>>> state (before a suspend) and in the broken state (after a resume). >>>> Maybe there will be some obvious in the difference. >>>> >>>> The same goes for the PCI configuration (lspci -d :8168 -vv). >>>> >>> Maciej suggested comparing the output from lspci -vv for the ethernet >>> device. They are identical. >>> >>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre >>> and post suspend. Again, they are identical. >>> Heiner specifically suggested looking at the RxConfig. The value of that is >>> 0x0002870e both pre and post suspend. >>> >> Hmm, this is very weird, especially taking into account that in your original >> report you state that removing the call to rtl_init_rxcfg() from >> rtl_hw_start() >> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and >> register values seem to be the same before and after resume. So how can the >> chip behave differently? >> So far my best guess is that some chip quirk causes it to accept writes to >> register RxConfig, but to misinterpret or ignore the written value. >> So far your report is the only one (affecting RTL8411), but we don't know >> whether other chip versions are affected too. > > Also, it is interesting that even if one removes a call to > rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get > written to moments later by rtl_set_rx_mode(). > > The only chip accesses in the meantime seems to be a write to TxConfig by > rtl_set_tx_config_registers() and then a read of RxConfig plus two writes > to MAR0 earlier in rtl_set_rx_mode(). > > My proposals are: > 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" > in rtl_hw_start(). > Maybe the chip does not like sometimes that RxConfig is written before > TxConfig. > This change made no difference. Networking still dies if I open a browser or leave ping running long enough. > 2) Check the original value of RxConfig (after a resume) before > rtl_init_rxcfg() overwrites it (compile tested only): > --- r8169.c.ori > +++ r8169.c > @@ -5155,6 +5155,9 @@ > /* Initially a 10 us delay. Turned it into a PCI commit. - FR */ > RTL_R8(tp, IntrMask); > RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb); > + > + pr_notice("RxConfig before init was %.8x\n", > + (unsigned int)RTL_R32(tp, RxConfig)); > rtl_init_rxcfg(tp); > rtl_set_tx_config_registers(tp); > > > This should be the value that you got when you removed the call to > rtl_init_rxcfg() for testing. > Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() > writes (under the "default:" label for your NIC model). This might be more interesting. Through combination of viewing the output from pr_notice() and the output from "ethtool -d", I can see RxConfig with the following values During boot:0x00028700 Before suspend: 0x0002870e During resume: 0x00024000 Post resume:0x0002870e I then removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, installed and rebooted. Now I see the following values: During boot:0x00028700 Before suspend: 0x0002870e During resume: 0x00024000 Post resume:0x0002870e > > Hope this helps, > Maciej >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Sorry, I forgot that editing r8169.c and rebuilding would result in rc7+, so I tested the wrong kernel/module to get the results I provided below. That, however, may make the results more interesting because they happened with a virgin rc7 kernel/module. I'll test your proposals properly later. Chris On 10/10/2018 09:09, Chris Clayton wrote: > > > On 10/10/2018 01:24, Maciej S. Szmigiero wrote: >> On 09.10.2018 22:36, Heiner Kallweit wrote: >>> On 09.10.2018 16:40, Chris Clayton wrote: >>>> Thanks to Maciej and Heiner for their replies. >>>> >>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: >>>>> On 07.10.2018 21:36, Chris Clayton wrote: >>>>>> Hi again, >>>>>> >>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, >>>>>> but tried it anyway. I can confirm that the >>>>>> regression is still present and my network still fails when, after a >>>>>> resume from suspend (to ram or disk), I open my >>>>>> browser or my mail client. In both those cases the failure is almost >>>>>> immediate - e.g. my home page doesn't get displayed >>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite >>>>>> so quickly but the reported time increases from >>>>>> 14-15ms to more than 1000ms. >>>>> >>>>> You can try comparing chip registers (ethtool -d eth0) in the working >>>>> state (before a suspend) and in the broken state (after a resume). >>>>> Maybe there will be some obvious in the difference. >>>>> >>>>> The same goes for the PCI configuration (lspci -d :8168 -vv). >>>>> >>>> Maciej suggested comparing the output from lspci -vv for the ethernet >>>> device. They are identical. >>>> >>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" >>>> pre and post suspend. Again, they are identical. >>>> Heiner specifically suggested looking at the RxConfig. The value of that >>>> is 0x0002870e both pre and post suspend. >>>> >>> Hmm, this is very weird, especially taking into account that in your >>> original >>> report you state that removing the call to rtl_init_rxcfg() from >>> rtl_hw_start() >>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and >>> register values seem to be the same before and after resume. So how can the >>> chip behave differently? >>> So far my best guess is that some chip quirk causes it to accept writes to >>> register RxConfig, but to misinterpret or ignore the written value. >>> So far your report is the only one (affecting RTL8411), but we don't know >>> whether other chip versions are affected too. >> >> Also, it is interesting that even if one removes a call to >> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get >> written to moments later by rtl_set_rx_mode(). >> >> The only chip accesses in the meantime seems to be a write to TxConfig by >> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes >> to MAR0 earlier in rtl_set_rx_mode(). >> >> My proposals are: >> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" >> in rtl_hw_start(). >> Maybe the chip does not like sometimes that RxConfig is written before >> TxConfig. >> > After testing your first proposal, which made no difference, I founf the > following in dmesg in the output from dmesg: > > [ 761.999468] [ cut here ] > [ 761.999471] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out > [ 761.999483] WARNING: CPU: 0 PID: 8938 at net/sched/sch_generic.c:461 > dev_watchdog+0x1e9/0x1f0 > [ 761.999484] Modules linked in: btusb btintel r8169 rfcomm bnep > iptable_filter xt_conntrack iptable_nat ipt_MASQUERADE > nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv4 uvcvideo videobuf2_vmalloc > videobuf2_memops snd_hda_codec_via > videobuf2_v4l2 snd_hda_codec_hdmi snd_hda_codec_generic videobuf2_common > usbhid realtek coretemp snd_hda_intel hwmon > snd_hda_codec x86_pkg_temp_thermal snd_hwdep libphy snd_hda_core [last > unloaded: btintel] > [ 761.999503] CPU: 0 PID: 8938 Comm: kworker/0:0 Not tainted 4.19.0-rc7 #328 > [ 761.999504] Hardware name: Notebook W65_67SZ > /W65_67SZ >, BIOS 1.03.05 02/26/2014 > [ 761.999508] Workqu
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Sorry, I forgot that editing r8169.c and rebuilding would result in rc7+, so I tested the wrong kernel/module to get the results I provided below. That, however, may make the results more interesting because they happened with a virgin rc7 kernel/module. I'll test your proposals properly later. Chris On 10/10/2018 09:09, Chris Clayton wrote: > > > On 10/10/2018 01:24, Maciej S. Szmigiero wrote: >> On 09.10.2018 22:36, Heiner Kallweit wrote: >>> On 09.10.2018 16:40, Chris Clayton wrote: >>>> Thanks to Maciej and Heiner for their replies. >>>> >>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: >>>>> On 07.10.2018 21:36, Chris Clayton wrote: >>>>>> Hi again, >>>>>> >>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, >>>>>> but tried it anyway. I can confirm that the >>>>>> regression is still present and my network still fails when, after a >>>>>> resume from suspend (to ram or disk), I open my >>>>>> browser or my mail client. In both those cases the failure is almost >>>>>> immediate - e.g. my home page doesn't get displayed >>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite >>>>>> so quickly but the reported time increases from >>>>>> 14-15ms to more than 1000ms. >>>>> >>>>> You can try comparing chip registers (ethtool -d eth0) in the working >>>>> state (before a suspend) and in the broken state (after a resume). >>>>> Maybe there will be some obvious in the difference. >>>>> >>>>> The same goes for the PCI configuration (lspci -d :8168 -vv). >>>>> >>>> Maciej suggested comparing the output from lspci -vv for the ethernet >>>> device. They are identical. >>>> >>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" >>>> pre and post suspend. Again, they are identical. >>>> Heiner specifically suggested looking at the RxConfig. The value of that >>>> is 0x0002870e both pre and post suspend. >>>> >>> Hmm, this is very weird, especially taking into account that in your >>> original >>> report you state that removing the call to rtl_init_rxcfg() from >>> rtl_hw_start() >>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and >>> register values seem to be the same before and after resume. So how can the >>> chip behave differently? >>> So far my best guess is that some chip quirk causes it to accept writes to >>> register RxConfig, but to misinterpret or ignore the written value. >>> So far your report is the only one (affecting RTL8411), but we don't know >>> whether other chip versions are affected too. >> >> Also, it is interesting that even if one removes a call to >> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get >> written to moments later by rtl_set_rx_mode(). >> >> The only chip accesses in the meantime seems to be a write to TxConfig by >> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes >> to MAR0 earlier in rtl_set_rx_mode(). >> >> My proposals are: >> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" >> in rtl_hw_start(). >> Maybe the chip does not like sometimes that RxConfig is written before >> TxConfig. >> > After testing your first proposal, which made no difference, I founf the > following in dmesg in the output from dmesg: > > [ 761.999468] [ cut here ] > [ 761.999471] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out > [ 761.999483] WARNING: CPU: 0 PID: 8938 at net/sched/sch_generic.c:461 > dev_watchdog+0x1e9/0x1f0 > [ 761.999484] Modules linked in: btusb btintel r8169 rfcomm bnep > iptable_filter xt_conntrack iptable_nat ipt_MASQUERADE > nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv4 uvcvideo videobuf2_vmalloc > videobuf2_memops snd_hda_codec_via > videobuf2_v4l2 snd_hda_codec_hdmi snd_hda_codec_generic videobuf2_common > usbhid realtek coretemp snd_hda_intel hwmon > snd_hda_codec x86_pkg_temp_thermal snd_hwdep libphy snd_hda_core [last > unloaded: btintel] > [ 761.999503] CPU: 0 PID: 8938 Comm: kworker/0:0 Not tainted 4.19.0-rc7 #328 > [ 761.999504] Hardware name: Notebook W65_67SZ > /W65_67SZ >, BIOS 1.03.05 02/26/2014 > [ 761.999508] Workqu
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 10/10/2018 01:24, Maciej S. Szmigiero wrote: > On 09.10.2018 22:36, Heiner Kallweit wrote: >> On 09.10.2018 16:40, Chris Clayton wrote: >>> Thanks to Maciej and Heiner for their replies. >>> >>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: >>>> On 07.10.2018 21:36, Chris Clayton wrote: >>>>> Hi again, >>>>> >>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but >>>>> tried it anyway. I can confirm that the >>>>> regression is still present and my network still fails when, after a >>>>> resume from suspend (to ram or disk), I open my >>>>> browser or my mail client. In both those cases the failure is almost >>>>> immediate - e.g. my home page doesn't get displayed >>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so >>>>> quickly but the reported time increases from >>>>> 14-15ms to more than 1000ms. >>>> >>>> You can try comparing chip registers (ethtool -d eth0) in the working >>>> state (before a suspend) and in the broken state (after a resume). >>>> Maybe there will be some obvious in the difference. >>>> >>>> The same goes for the PCI configuration (lspci -d :8168 -vv). >>>> >>> Maciej suggested comparing the output from lspci -vv for the ethernet >>> device. They are identical. >>> >>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre >>> and post suspend. Again, they are identical. >>> Heiner specifically suggested looking at the RxConfig. The value of that is >>> 0x0002870e both pre and post suspend. >>> >> Hmm, this is very weird, especially taking into account that in your original >> report you state that removing the call to rtl_init_rxcfg() from >> rtl_hw_start() >> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and >> register values seem to be the same before and after resume. So how can the >> chip behave differently? >> So far my best guess is that some chip quirk causes it to accept writes to >> register RxConfig, but to misinterpret or ignore the written value. >> So far your report is the only one (affecting RTL8411), but we don't know >> whether other chip versions are affected too. > > Also, it is interesting that even if one removes a call to > rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get > written to moments later by rtl_set_rx_mode(). > > The only chip accesses in the meantime seems to be a write to TxConfig by > rtl_set_tx_config_registers() and then a read of RxConfig plus two writes > to MAR0 earlier in rtl_set_rx_mode(). > > My proposals are: > 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" > in rtl_hw_start(). > Maybe the chip does not like sometimes that RxConfig is written before > TxConfig. > After testing your first proposal, which made no difference, I founf the following in dmesg in the output from dmesg: [ 761.999468] [ cut here ] [ 761.999471] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out [ 761.999483] WARNING: CPU: 0 PID: 8938 at net/sched/sch_generic.c:461 dev_watchdog+0x1e9/0x1f0 [ 761.999484] Modules linked in: btusb btintel r8169 rfcomm bnep iptable_filter xt_conntrack iptable_nat ipt_MASQUERADE nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv4 uvcvideo videobuf2_vmalloc videobuf2_memops snd_hda_codec_via videobuf2_v4l2 snd_hda_codec_hdmi snd_hda_codec_generic videobuf2_common usbhid realtek coretemp snd_hda_intel hwmon snd_hda_codec x86_pkg_temp_thermal snd_hwdep libphy snd_hda_core [last unloaded: btintel] [ 761.999503] CPU: 0 PID: 8938 Comm: kworker/0:0 Not tainted 4.19.0-rc7 #328 [ 761.999504] Hardware name: Notebook W65_67SZ /W65_67SZ , BIOS 1.03.05 02/26/2014 [ 761.999508] Workqueue: events rtl_task [r8169] [ 761.999510] RIP: 0010:dev_watchdog+0x1e9/0x1f0 [ 761.999512] Code: 00 48 63 4d e8 eb 99 4c 89 ef c6 05 b6 13 a6 00 01 e8 1b c7 fd ff 89 d9 4c 89 ee 48 c7 c7 40 53 e1 81 48 89 c2 e8 ae f4 a3 ff <0f> 0b eb c0 0f 1f 00 48 c7 47 08 00 00 00 00 48 c7 07 00 00 00 00 [ 761.999513] RSP: 0018:88040f803e98 EFLAGS: 00010282 [ 761.999514] RAX: RBX: RCX: 0006 [ 761.999516] RDX: 0007 RSI: 0096 RDI: 88040f8153d0 [ 761.999517] RBP: 88040ca9a3b8 R08: 813565f0 R09: 034e [ 761.999517] R10: 0007 R11: R12: 88040ca9a39c [ 761.999518] R13: 880
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 10/10/2018 01:24, Maciej S. Szmigiero wrote: > On 09.10.2018 22:36, Heiner Kallweit wrote: >> On 09.10.2018 16:40, Chris Clayton wrote: >>> Thanks to Maciej and Heiner for their replies. >>> >>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: >>>> On 07.10.2018 21:36, Chris Clayton wrote: >>>>> Hi again, >>>>> >>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but >>>>> tried it anyway. I can confirm that the >>>>> regression is still present and my network still fails when, after a >>>>> resume from suspend (to ram or disk), I open my >>>>> browser or my mail client. In both those cases the failure is almost >>>>> immediate - e.g. my home page doesn't get displayed >>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so >>>>> quickly but the reported time increases from >>>>> 14-15ms to more than 1000ms. >>>> >>>> You can try comparing chip registers (ethtool -d eth0) in the working >>>> state (before a suspend) and in the broken state (after a resume). >>>> Maybe there will be some obvious in the difference. >>>> >>>> The same goes for the PCI configuration (lspci -d :8168 -vv). >>>> >>> Maciej suggested comparing the output from lspci -vv for the ethernet >>> device. They are identical. >>> >>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre >>> and post suspend. Again, they are identical. >>> Heiner specifically suggested looking at the RxConfig. The value of that is >>> 0x0002870e both pre and post suspend. >>> >> Hmm, this is very weird, especially taking into account that in your original >> report you state that removing the call to rtl_init_rxcfg() from >> rtl_hw_start() >> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and >> register values seem to be the same before and after resume. So how can the >> chip behave differently? >> So far my best guess is that some chip quirk causes it to accept writes to >> register RxConfig, but to misinterpret or ignore the written value. >> So far your report is the only one (affecting RTL8411), but we don't know >> whether other chip versions are affected too. > > Also, it is interesting that even if one removes a call to > rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get > written to moments later by rtl_set_rx_mode(). > > The only chip accesses in the meantime seems to be a write to TxConfig by > rtl_set_tx_config_registers() and then a read of RxConfig plus two writes > to MAR0 earlier in rtl_set_rx_mode(). > > My proposals are: > 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" > in rtl_hw_start(). > Maybe the chip does not like sometimes that RxConfig is written before > TxConfig. > After testing your first proposal, which made no difference, I founf the following in dmesg in the output from dmesg: [ 761.999468] [ cut here ] [ 761.999471] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out [ 761.999483] WARNING: CPU: 0 PID: 8938 at net/sched/sch_generic.c:461 dev_watchdog+0x1e9/0x1f0 [ 761.999484] Modules linked in: btusb btintel r8169 rfcomm bnep iptable_filter xt_conntrack iptable_nat ipt_MASQUERADE nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv4 uvcvideo videobuf2_vmalloc videobuf2_memops snd_hda_codec_via videobuf2_v4l2 snd_hda_codec_hdmi snd_hda_codec_generic videobuf2_common usbhid realtek coretemp snd_hda_intel hwmon snd_hda_codec x86_pkg_temp_thermal snd_hwdep libphy snd_hda_core [last unloaded: btintel] [ 761.999503] CPU: 0 PID: 8938 Comm: kworker/0:0 Not tainted 4.19.0-rc7 #328 [ 761.999504] Hardware name: Notebook W65_67SZ /W65_67SZ , BIOS 1.03.05 02/26/2014 [ 761.999508] Workqueue: events rtl_task [r8169] [ 761.999510] RIP: 0010:dev_watchdog+0x1e9/0x1f0 [ 761.999512] Code: 00 48 63 4d e8 eb 99 4c 89 ef c6 05 b6 13 a6 00 01 e8 1b c7 fd ff 89 d9 4c 89 ee 48 c7 c7 40 53 e1 81 48 89 c2 e8 ae f4 a3 ff <0f> 0b eb c0 0f 1f 00 48 c7 47 08 00 00 00 00 48 c7 07 00 00 00 00 [ 761.999513] RSP: 0018:88040f803e98 EFLAGS: 00010282 [ 761.999514] RAX: RBX: RCX: 0006 [ 761.999516] RDX: 0007 RSI: 0096 RDI: 88040f8153d0 [ 761.999517] RBP: 88040ca9a3b8 R08: 813565f0 R09: 034e [ 761.999517] R10: 0007 R11: R12: 88040ca9a39c [ 761.999518] R13: 880
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 09/10/2018 22:39, Heiner Kallweit wrote: > On 09.10.2018 16:40, Chris Clayton wrote: >> Thanks to Maciej and Heiner for their replies. >> >> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: >>> On 07.10.2018 21:36, Chris Clayton wrote: >>>> Hi again, >>>> >>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but >>>> tried it anyway. I can confirm that the >>>> regression is still present and my network still fails when, after a >>>> resume from suspend (to ram or disk), I open my >>>> browser or my mail client. In both those cases the failure is almost >>>> immediate - e.g. my home page doesn't get displayed >>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so >>>> quickly but the reported time increases from >>>> 14-15ms to more than 1000ms. >>> >>> You can try comparing chip registers (ethtool -d eth0) in the working >>> state (before a suspend) and in the broken state (after a resume). >>> Maybe there will be some obvious in the difference. >>> >>> The same goes for the PCI configuration (lspci -d :8168 -vv). >>> >> Maciej suggested comparing the output from lspci -vv for the ethernet >> device. They are identical. >> >> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre >> and post suspend. Again, they are identical. >> Heiner specifically suggested looking at the RxConfig. The value of that is >> 0x0002870e both pre and post suspend. >> >> I've attached files I redirected the outputs to. >> >> Please don't hesitate to ask for any other information needed to solve this >> problem. In the meantime, I've now got >> scripts that stop the network during suspend and restart it during resume. >> (Those scripts were removed whilst I gathered >> the diagnostics shown in the attachments.) >> > I'd like to check whether it may be a timing issue. The following > experimental patch > adds a PCI commit after writing register ChipCmd. Could you please check > whether > it changes anything? > > diff --git a/drivers/net/ethernet/realtek/r8169.c > b/drivers/net/ethernet/realtek/r8169.c > index 7d3f671e1..f3c359492 100644 > --- a/drivers/net/ethernet/realtek/r8169.c > +++ b/drivers/net/ethernet/realtek/r8169.c > @@ -4641,6 +4641,7 @@ static void rtl_hw_start(struct rtl8169_private *tp) > /* Initially a 10 us delay. Turned it into a PCI commit. - FR */ > RTL_R8(tp, IntrMask); > RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb); > + RTL_R8(tp, ChipCmd); > rtl_init_rxcfg(tp); > rtl_set_tx_config_registers(tp); > > Sorry, this patch doesn't make any difference - my network still fails. After a suspend/resume my browsers (chromium and firefox) both fail to open my home page (https://www.google.co.uk). The ping time for one of my ISP's name servers increases from 14-15ms to more than 1000ms, although it after a few pings it does reduce. As the screen grab below shows, the network does eventually fail $ ping NS1 PING ns1 (90.207.238.97): 56 data bytes 64 bytes from 90.207.238.97: icmp_seq=0 ttl=251 time=1017.289 ms 64 bytes from 90.207.238.97: icmp_seq=1 ttl=251 time=1018.051 ms 64 bytes from 90.207.238.97: icmp_seq=2 ttl=251 time=1015.271 ms 64 bytes from 90.207.238.97: icmp_seq=3 ttl=251 time=1015.495 ms 64 bytes from 90.207.238.97: icmp_seq=6 ttl=251 time=1015.646 ms 64 bytes from 90.207.238.97: icmp_seq=7 ttl=251 time=1022.609 ms 64 bytes from 90.207.238.97: icmp_seq=8 ttl=251 time=1015.612 ms 64 bytes from 90.207.238.97: icmp_seq=10 ttl=251 time=1015.551 ms 64 bytes from 90.207.238.97: icmp_seq=12 ttl=251 time=1015.446 ms 64 bytes from 90.207.238.97: icmp_seq=13 ttl=251 time=1015.657 ms 64 bytes from 90.207.238.97: icmp_seq=14 ttl=251 time=1015.614 ms 64 bytes from 90.207.238.97: icmp_seq=15 ttl=251 time=1015.651 ms 64 bytes from 90.207.238.97: icmp_seq=17 ttl=251 time=1015.459 ms 64 bytes from 90.207.238.97: icmp_seq=18 ttl=251 time=1015.443 ms 64 bytes from 90.207.238.97: icmp_seq=19 ttl=251 time=1015.936 ms 64 bytes from 90.207.238.97: icmp_seq=20 ttl=251 time=1015.681 ms 64 bytes from 90.207.238.97: icmp_seq=22 ttl=251 time=1015.410 ms 64 bytes from 90.207.238.97: icmp_seq=23 ttl=251 time=1015.487 ms 64 bytes from 90.207.238.97: icmp_seq=24 ttl=251 time=1016.169 ms 64 bytes from 90.207.238.97: icmp_seq=25 ttl=251 time=1015.659 ms 64 bytes from 90.207.238.97: icmp_seq=26 ttl=251 time=14.606 ms 64 bytes from 90.207.238.97: icmp_seq=30 ttl=251 time=32.765 ms 64 bytes from 90.207.238.97: icmp_seq=31 ttl=251 time=115.052 ms 64 bytes from 90.207.238.97: icmp_seq=33 ttl=251 time=757.115 ms 64 bytes from 90.207.238.97
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 09/10/2018 22:39, Heiner Kallweit wrote: > On 09.10.2018 16:40, Chris Clayton wrote: >> Thanks to Maciej and Heiner for their replies. >> >> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: >>> On 07.10.2018 21:36, Chris Clayton wrote: >>>> Hi again, >>>> >>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but >>>> tried it anyway. I can confirm that the >>>> regression is still present and my network still fails when, after a >>>> resume from suspend (to ram or disk), I open my >>>> browser or my mail client. In both those cases the failure is almost >>>> immediate - e.g. my home page doesn't get displayed >>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so >>>> quickly but the reported time increases from >>>> 14-15ms to more than 1000ms. >>> >>> You can try comparing chip registers (ethtool -d eth0) in the working >>> state (before a suspend) and in the broken state (after a resume). >>> Maybe there will be some obvious in the difference. >>> >>> The same goes for the PCI configuration (lspci -d :8168 -vv). >>> >> Maciej suggested comparing the output from lspci -vv for the ethernet >> device. They are identical. >> >> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre >> and post suspend. Again, they are identical. >> Heiner specifically suggested looking at the RxConfig. The value of that is >> 0x0002870e both pre and post suspend. >> >> I've attached files I redirected the outputs to. >> >> Please don't hesitate to ask for any other information needed to solve this >> problem. In the meantime, I've now got >> scripts that stop the network during suspend and restart it during resume. >> (Those scripts were removed whilst I gathered >> the diagnostics shown in the attachments.) >> > I'd like to check whether it may be a timing issue. The following > experimental patch > adds a PCI commit after writing register ChipCmd. Could you please check > whether > it changes anything? > > diff --git a/drivers/net/ethernet/realtek/r8169.c > b/drivers/net/ethernet/realtek/r8169.c > index 7d3f671e1..f3c359492 100644 > --- a/drivers/net/ethernet/realtek/r8169.c > +++ b/drivers/net/ethernet/realtek/r8169.c > @@ -4641,6 +4641,7 @@ static void rtl_hw_start(struct rtl8169_private *tp) > /* Initially a 10 us delay. Turned it into a PCI commit. - FR */ > RTL_R8(tp, IntrMask); > RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb); > + RTL_R8(tp, ChipCmd); > rtl_init_rxcfg(tp); > rtl_set_tx_config_registers(tp); > > Sorry, this patch doesn't make any difference - my network still fails. After a suspend/resume my browsers (chromium and firefox) both fail to open my home page (https://www.google.co.uk). The ping time for one of my ISP's name servers increases from 14-15ms to more than 1000ms, although it after a few pings it does reduce. As the screen grab below shows, the network does eventually fail $ ping NS1 PING ns1 (90.207.238.97): 56 data bytes 64 bytes from 90.207.238.97: icmp_seq=0 ttl=251 time=1017.289 ms 64 bytes from 90.207.238.97: icmp_seq=1 ttl=251 time=1018.051 ms 64 bytes from 90.207.238.97: icmp_seq=2 ttl=251 time=1015.271 ms 64 bytes from 90.207.238.97: icmp_seq=3 ttl=251 time=1015.495 ms 64 bytes from 90.207.238.97: icmp_seq=6 ttl=251 time=1015.646 ms 64 bytes from 90.207.238.97: icmp_seq=7 ttl=251 time=1022.609 ms 64 bytes from 90.207.238.97: icmp_seq=8 ttl=251 time=1015.612 ms 64 bytes from 90.207.238.97: icmp_seq=10 ttl=251 time=1015.551 ms 64 bytes from 90.207.238.97: icmp_seq=12 ttl=251 time=1015.446 ms 64 bytes from 90.207.238.97: icmp_seq=13 ttl=251 time=1015.657 ms 64 bytes from 90.207.238.97: icmp_seq=14 ttl=251 time=1015.614 ms 64 bytes from 90.207.238.97: icmp_seq=15 ttl=251 time=1015.651 ms 64 bytes from 90.207.238.97: icmp_seq=17 ttl=251 time=1015.459 ms 64 bytes from 90.207.238.97: icmp_seq=18 ttl=251 time=1015.443 ms 64 bytes from 90.207.238.97: icmp_seq=19 ttl=251 time=1015.936 ms 64 bytes from 90.207.238.97: icmp_seq=20 ttl=251 time=1015.681 ms 64 bytes from 90.207.238.97: icmp_seq=22 ttl=251 time=1015.410 ms 64 bytes from 90.207.238.97: icmp_seq=23 ttl=251 time=1015.487 ms 64 bytes from 90.207.238.97: icmp_seq=24 ttl=251 time=1016.169 ms 64 bytes from 90.207.238.97: icmp_seq=25 ttl=251 time=1015.659 ms 64 bytes from 90.207.238.97: icmp_seq=26 ttl=251 time=14.606 ms 64 bytes from 90.207.238.97: icmp_seq=30 ttl=251 time=32.765 ms 64 bytes from 90.207.238.97: icmp_seq=31 ttl=251 time=115.052 ms 64 bytes from 90.207.238.97: icmp_seq=33 ttl=251 time=757.115 ms 64 bytes from 90.207.238.97
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Thanks to Maciej and Heiner for their replies. On 09/10/2018 13:32, Maciej S. Szmigiero wrote: > On 07.10.2018 21:36, Chris Clayton wrote: >> Hi again, >> >> I didn't think there was anything in 4.19-rc7 to fix this regression, but >> tried it anyway. I can confirm that the >> regression is still present and my network still fails when, after a resume >> from suspend (to ram or disk), I open my >> browser or my mail client. In both those cases the failure is almost >> immediate - e.g. my home page doesn't get displayed >> in the browser. Pinging one of my ISPs name servers doesn't fail quite so >> quickly but the reported time increases from >> 14-15ms to more than 1000ms. > > You can try comparing chip registers (ethtool -d eth0) in the working > state (before a suspend) and in the broken state (after a resume). > Maybe there will be some obvious in the difference. > > The same goes for the PCI configuration (lspci -d :8168 -vv). > Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical. Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical. Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend. I've attached files I redirected the outputs to. Please don't hesitate to ask for any other information needed to solve this problem. In the meantime, I've now got scripts that stop the network during suspend and restart it during resume. (Those scripts were removed whilst I gathered the diagnostics shown in the attachments.) Chris >> Chris > > Maciej > ethtool -d eth0 === RealTek RTL8411 registers: 0x00: MAC Address 80:fa:5b:08:d0:3d 0x08: Multicast Address Filter 0x 0x0080 0x10: Dump Tally Counter Command 0x0c2ec000 0x0004 0x20: Tx Normal Priority Ring Addr 0x07a0a000 0x0004 0x28: Tx High Priority Ring Addr 0x 0x 0x30: Flash memory read/write 0x 0x34: Early Rx Byte Count 0 0x36: Early Rx Status 0x00 0x37: Command 0x0c Rx on, Tx on 0x3C: Interrupt Mask 0x803f SERR LinkChg RxNoBuf TxErr TxOK RxErr RxOK 0x3E: Interrupt Status0x 0x40: Tx Configuration0x4b800f80 0x44: Rx Configuration0x0002870e 0x48: Timer count 0x 0x4C: Missed packet counter 0x00 0x50: EEPROM Command0x10 0x51: Config 0 0x00 0x52: Config 1 0xcf 0x53: Config 2 0x3c 0x54: Config 3 0x60 0x55: Config 4 0x10 0x56: Config 5 0x02 0x58: Timer interrupt 0x 0x5C: Multiple Interrupt Select 0x 0x60: PHY access 0x80040de1 0x64: TBI control and status 0x2701 0x68: TBI Autonegotiation advertisement (ANAR)0xf70c 0x6A: TBI Link partner ability (LPAR) 0x0002 0x6C: PHY status0xeb 0x84: PM wakeup frame 00x 0x 0x8C: PM wakeup frame 10x 0x 0x94: PM wakeup frame 2 (low) 0x 0x 0x9C: PM wakeup frame 2 (high) 0x 0x 0xA4: PM wakeup frame 3 (low) 0x 0x 0xAC: PM wakeup frame 3 (high) 0x 0x 0xB4: PM wakeup frame 4 (low) 0x 0x 0xBC: PM wakeup frame 4 (high) 0x 0x 0xC4: Wakeup frame 0 CRC 0x 0xC6: Wakeup frame 1 CRC 0x 0xC8: Wakeup frame 2 CRC 0x 0xCA: Wakeup frame 3 CRC 0x 0xCC: Wakeup frame 4 CRC 0x 0xDA: RX packet maximum size 0x4000 0xE0: C+ Command 0x20e1 VLAN de-tagging RX checksumming 0xE2: Interrupt Mitigation0x5151 TxTimer: 5 TxPackets: 1 RxTimer: 5 RxPackets: 1 0xE4: Rx Ring Addr 0x07935000 0x0004 0xEC: Early Tx threshold0x27 0xF0: Func Event 0x0040003f 0xF4: Func Event Mask 0x 0xF8: Func Preset State 0x00031eff 0xFC: Func Force Event
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Thanks to Maciej and Heiner for their replies. On 09/10/2018 13:32, Maciej S. Szmigiero wrote: > On 07.10.2018 21:36, Chris Clayton wrote: >> Hi again, >> >> I didn't think there was anything in 4.19-rc7 to fix this regression, but >> tried it anyway. I can confirm that the >> regression is still present and my network still fails when, after a resume >> from suspend (to ram or disk), I open my >> browser or my mail client. In both those cases the failure is almost >> immediate - e.g. my home page doesn't get displayed >> in the browser. Pinging one of my ISPs name servers doesn't fail quite so >> quickly but the reported time increases from >> 14-15ms to more than 1000ms. > > You can try comparing chip registers (ethtool -d eth0) in the working > state (before a suspend) and in the broken state (after a resume). > Maybe there will be some obvious in the difference. > > The same goes for the PCI configuration (lspci -d :8168 -vv). > Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical. Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical. Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend. I've attached files I redirected the outputs to. Please don't hesitate to ask for any other information needed to solve this problem. In the meantime, I've now got scripts that stop the network during suspend and restart it during resume. (Those scripts were removed whilst I gathered the diagnostics shown in the attachments.) Chris >> Chris > > Maciej > ethtool -d eth0 === RealTek RTL8411 registers: 0x00: MAC Address 80:fa:5b:08:d0:3d 0x08: Multicast Address Filter 0x 0x0080 0x10: Dump Tally Counter Command 0x0c2ec000 0x0004 0x20: Tx Normal Priority Ring Addr 0x07a0a000 0x0004 0x28: Tx High Priority Ring Addr 0x 0x 0x30: Flash memory read/write 0x 0x34: Early Rx Byte Count 0 0x36: Early Rx Status 0x00 0x37: Command 0x0c Rx on, Tx on 0x3C: Interrupt Mask 0x803f SERR LinkChg RxNoBuf TxErr TxOK RxErr RxOK 0x3E: Interrupt Status0x 0x40: Tx Configuration0x4b800f80 0x44: Rx Configuration0x0002870e 0x48: Timer count 0x 0x4C: Missed packet counter 0x00 0x50: EEPROM Command0x10 0x51: Config 0 0x00 0x52: Config 1 0xcf 0x53: Config 2 0x3c 0x54: Config 3 0x60 0x55: Config 4 0x10 0x56: Config 5 0x02 0x58: Timer interrupt 0x 0x5C: Multiple Interrupt Select 0x 0x60: PHY access 0x80040de1 0x64: TBI control and status 0x2701 0x68: TBI Autonegotiation advertisement (ANAR)0xf70c 0x6A: TBI Link partner ability (LPAR) 0x0002 0x6C: PHY status0xeb 0x84: PM wakeup frame 00x 0x 0x8C: PM wakeup frame 10x 0x 0x94: PM wakeup frame 2 (low) 0x 0x 0x9C: PM wakeup frame 2 (high) 0x 0x 0xA4: PM wakeup frame 3 (low) 0x 0x 0xAC: PM wakeup frame 3 (high) 0x 0x 0xB4: PM wakeup frame 4 (low) 0x 0x 0xBC: PM wakeup frame 4 (high) 0x 0x 0xC4: Wakeup frame 0 CRC 0x 0xC6: Wakeup frame 1 CRC 0x 0xC8: Wakeup frame 2 CRC 0x 0xCA: Wakeup frame 3 CRC 0x 0xCC: Wakeup frame 4 CRC 0x 0xDA: RX packet maximum size 0x4000 0xE0: C+ Command 0x20e1 VLAN de-tagging RX checksumming 0xE2: Interrupt Mitigation0x5151 TxTimer: 5 TxPackets: 1 RxTimer: 5 RxPackets: 1 0xE4: Rx Ring Addr 0x07935000 0x0004 0xEC: Early Tx threshold0x27 0xF0: Func Event 0x0040003f 0xF4: Func Event Mask 0x 0xF8: Func Preset State 0x00031eff 0xFC: Func Force Event
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Hi again, I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from 14-15ms to more than 1000ms. Chris On 04/10/2018 09:41, Chris Clayton wrote: > Hi Heiner, > > Here's the reply to your questions. Sorry for the delay. > > On 28/09/2018 23:13, Heiner Kallweit wrote: >> On 29.09.2018 00:00, Chris Clayton wrote: >>> Thanks Maciej. >>> >>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote: >>>> Hi, >>>> >>>>> Hi, >>>>> >>>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing >>>>> network problems after resuming from a >>>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK. >>>>> >>>>> The pattern of the problem is that when I first boot, the network is >>>>> fine. But, after resume from suspend I find that >>>>> the time taken for a ping of one of my ISP's nameservers increases from >>>>> 14-15ms to more than 1000ms. Moreover, when I >>>>> open a browser (chromium or firefox), it fails to retrieve my home page >>>>> (https://www.google.co.uk) and pings of the >>>>> nameserver fail with the message "Destination Host Unreachable". Often, I >>>>> can revive the network by stopping it with >>>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 >>>>> module and load it again. >>>> >>>> Please have a look at the following thread: >>>> https://lkml.org/lkml/2018/9/25/1118 >>>> >>> >>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the >>> problem is not solved by it. Similarly, I applied >>> Heiner's patch to the 4.19, but again the problem is not solved. >>> >> I think we talk about two different issues here. The one the fix is for has >> no link to suspend/resume. >> >> Chris, the lspci output doesn't provide enough detail to determine the exact >> chip version. >> Can you provide the dmesg part with the XID? > > $ dmesg | grep r8169 > [5.274938] libphy: r8169: probed > [5.276563] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID > 48800800, IRQ 29 > [5.278158] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, > tx checksumming: ko] > [9.275275] RTL8211E Gigabit Ethernet r8169-502:00: attached PHY driver > [RTL8211E Gigabit Ethernet] > (mii_bus:phy_addr=r8169-502:00, irq=IGNORE) > [9.460876] r8169 :05:00.2 eth0: No native access to PCI extended > config space, falling back to CSI > [ 11.005336] r8169 :05:00.2 eth0: Link is Up - 100Mbps/Full - flow > control rx/tx > >> According to your lspci output neither MSI nor MSI-X is active. >> Do you have to use nomsi for whatever reason? >> > > No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% > sure that it used to be - I've no idea how > it got dropped. If I'm not sure about an option, I start by taking the > recommendation in the kconfig help. Help on MSI > has a very clear "say Y". I've re-enabled it now. > > Chris > >> Heiner >> >>>> Maciej >>>> >>> Chris >>> >> >>
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Hi again, I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from 14-15ms to more than 1000ms. Chris On 04/10/2018 09:41, Chris Clayton wrote: > Hi Heiner, > > Here's the reply to your questions. Sorry for the delay. > > On 28/09/2018 23:13, Heiner Kallweit wrote: >> On 29.09.2018 00:00, Chris Clayton wrote: >>> Thanks Maciej. >>> >>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote: >>>> Hi, >>>> >>>>> Hi, >>>>> >>>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing >>>>> network problems after resuming from a >>>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK. >>>>> >>>>> The pattern of the problem is that when I first boot, the network is >>>>> fine. But, after resume from suspend I find that >>>>> the time taken for a ping of one of my ISP's nameservers increases from >>>>> 14-15ms to more than 1000ms. Moreover, when I >>>>> open a browser (chromium or firefox), it fails to retrieve my home page >>>>> (https://www.google.co.uk) and pings of the >>>>> nameserver fail with the message "Destination Host Unreachable". Often, I >>>>> can revive the network by stopping it with >>>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 >>>>> module and load it again. >>>> >>>> Please have a look at the following thread: >>>> https://lkml.org/lkml/2018/9/25/1118 >>>> >>> >>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the >>> problem is not solved by it. Similarly, I applied >>> Heiner's patch to the 4.19, but again the problem is not solved. >>> >> I think we talk about two different issues here. The one the fix is for has >> no link to suspend/resume. >> >> Chris, the lspci output doesn't provide enough detail to determine the exact >> chip version. >> Can you provide the dmesg part with the XID? > > $ dmesg | grep r8169 > [5.274938] libphy: r8169: probed > [5.276563] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID > 48800800, IRQ 29 > [5.278158] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, > tx checksumming: ko] > [9.275275] RTL8211E Gigabit Ethernet r8169-502:00: attached PHY driver > [RTL8211E Gigabit Ethernet] > (mii_bus:phy_addr=r8169-502:00, irq=IGNORE) > [9.460876] r8169 :05:00.2 eth0: No native access to PCI extended > config space, falling back to CSI > [ 11.005336] r8169 :05:00.2 eth0: Link is Up - 100Mbps/Full - flow > control rx/tx > >> According to your lspci output neither MSI nor MSI-X is active. >> Do you have to use nomsi for whatever reason? >> > > No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% > sure that it used to be - I've no idea how > it got dropped. If I'm not sure about an option, I start by taking the > recommendation in the kconfig help. Help on MSI > has a very clear "say Y". I've re-enabled it now. > > Chris > >> Heiner >> >>>> Maciej >>>> >>> Chris >>> >> >>
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Hi Heiner, Here's the reply to your questions. Sorry for the delay. On 28/09/2018 23:13, Heiner Kallweit wrote: > On 29.09.2018 00:00, Chris Clayton wrote: >> Thanks Maciej. >> >> On 28/09/2018 16:54, Maciej S. Szmigiero wrote: >>> Hi, >>> >>>> Hi, >>>> >>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing >>>> network problems after resuming from a >>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK. >>>> >>>> The pattern of the problem is that when I first boot, the network is fine. >>>> But, after resume from suspend I find that >>>> the time taken for a ping of one of my ISP's nameservers increases from >>>> 14-15ms to more than 1000ms. Moreover, when I >>>> open a browser (chromium or firefox), it fails to retrieve my home page >>>> (https://www.google.co.uk) and pings of the >>>> nameserver fail with the message "Destination Host Unreachable". Often, I >>>> can revive the network by stopping it with >>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 >>>> module and load it again. >>> >>> Please have a look at the following thread: >>> https://lkml.org/lkml/2018/9/25/1118 >>> >> >> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem >> is not solved by it. Similarly, I applied >> Heiner's patch to the 4.19, but again the problem is not solved. >> > I think we talk about two different issues here. The one the fix is for has > no link to suspend/resume. > > Chris, the lspci output doesn't provide enough detail to determine the exact > chip version. > Can you provide the dmesg part with the XID? $ dmesg | grep r8169 [5.274938] libphy: r8169: probed [5.276563] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 29 [5.278158] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko] [9.275275] RTL8211E Gigabit Ethernet r8169-502:00: attached PHY driver [RTL8211E Gigabit Ethernet] (mii_bus:phy_addr=r8169-502:00, irq=IGNORE) [9.460876] r8169 :05:00.2 eth0: No native access to PCI extended config space, falling back to CSI [ 11.005336] r8169 :05:00.2 eth0: Link is Up - 100Mbps/Full - flow control rx/tx > According to your lspci output neither MSI nor MSI-X is active. > Do you have to use nomsi for whatever reason? > No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% sure that it used to be - I've no idea how it got dropped. If I'm not sure about an option, I start by taking the recommendation in the kconfig help. Help on MSI has a very clear "say Y". I've re-enabled it now. Chris > Heiner > >>> Maciej >>> >> Chris >> > >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Hi Heiner, Here's the reply to your questions. Sorry for the delay. On 28/09/2018 23:13, Heiner Kallweit wrote: > On 29.09.2018 00:00, Chris Clayton wrote: >> Thanks Maciej. >> >> On 28/09/2018 16:54, Maciej S. Szmigiero wrote: >>> Hi, >>> >>>> Hi, >>>> >>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing >>>> network problems after resuming from a >>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK. >>>> >>>> The pattern of the problem is that when I first boot, the network is fine. >>>> But, after resume from suspend I find that >>>> the time taken for a ping of one of my ISP's nameservers increases from >>>> 14-15ms to more than 1000ms. Moreover, when I >>>> open a browser (chromium or firefox), it fails to retrieve my home page >>>> (https://www.google.co.uk) and pings of the >>>> nameserver fail with the message "Destination Host Unreachable". Often, I >>>> can revive the network by stopping it with >>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 >>>> module and load it again. >>> >>> Please have a look at the following thread: >>> https://lkml.org/lkml/2018/9/25/1118 >>> >> >> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem >> is not solved by it. Similarly, I applied >> Heiner's patch to the 4.19, but again the problem is not solved. >> > I think we talk about two different issues here. The one the fix is for has > no link to suspend/resume. > > Chris, the lspci output doesn't provide enough detail to determine the exact > chip version. > Can you provide the dmesg part with the XID? $ dmesg | grep r8169 [5.274938] libphy: r8169: probed [5.276563] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 29 [5.278158] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko] [9.275275] RTL8211E Gigabit Ethernet r8169-502:00: attached PHY driver [RTL8211E Gigabit Ethernet] (mii_bus:phy_addr=r8169-502:00, irq=IGNORE) [9.460876] r8169 :05:00.2 eth0: No native access to PCI extended config space, falling back to CSI [ 11.005336] r8169 :05:00.2 eth0: Link is Up - 100Mbps/Full - flow control rx/tx > According to your lspci output neither MSI nor MSI-X is active. > Do you have to use nomsi for whatever reason? > No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% sure that it used to be - I've no idea how it got dropped. If I'm not sure about an option, I start by taking the recommendation in the kconfig help. Help on MSI has a very clear "say Y". I've re-enabled it now. Chris > Heiner > >>> Maciej >>> >> Chris >> > >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Sorry, sent by accident. Note to self - don't attempt email until after second cup of coffee. On 29/09/2018 08:25, Chris Clayton wrote: > > > On 28/09/2018 23:13, Heiner Kallweit wrote: >> On 29.09.2018 00:00, Chris Clayton wrote: >>> Thanks Maciej. >>> >>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote: >>>> Hi, >>>> >>>>> Hi, >>>>> >>>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing >>>>> network problems after resuming from a >>>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK. >>>>> >>>>> The pattern of the problem is that when I first boot, the network is >>>>> fine. But, after resume from suspend I find that >>>>> the time taken for a ping of one of my ISP's nameservers increases from >>>>> 14-15ms to more than 1000ms. Moreover, when I >>>>> open a browser (chromium or firefox), it fails to retrieve my home page >>>>> (https://www.google.co.uk) and pings of the >>>>> nameserver fail with the message "Destination Host Unreachable". Often, I >>>>> can revive the network by stopping it with >>>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 >>>>> module and load it again. >>>> >>>> Please have a look at the following thread: >>>> https://lkml.org/lkml/2018/9/25/1118 >>>> >>> >>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the >>> problem is not solved by it. Similarly, I applied >>> Heiner's patch to the 4.19, but again the problem is not solved. >>> >> I think we talk about two different issues here. The one the fix is for has >> no link to suspend/resume. >> >> Chris, the lspci output doesn't provide enough detail to determine the exact >> chip version. >> Can you provide the dmesg part with the XID? I meant to say that I have now re-enabled MSI in 4.18.7 - the latest stable series kernel in which eth0 continues to function reliably after a suspend/resume cycle. The second dmesg output below is taken from that kernel. The first one was from an up-to-date 4.19 kernel > > $ dmesg | grep -i r8169 > [5.320679] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded > [5.321432] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM > control > [5.322892] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID > 48800800, IRQ 19 > [5.323786] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, > tx checksumming: ko] > [ 10.232077] r8169 :05:00.2 eth0: No native access to PCI extended > config space, falling back to CSI > [ 10.235218] r8169 :05:00.2 eth0: link down > [ 11.717460] r8169 :05:00.2 eth0: link up > > $ dmesg | grep -i r8169 > [5.208040] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded > [5.208677] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM > control > [5.210066] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID > 48800800, IRQ 29 > [5.210676] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, > tx checksumming: ko] > [ 10.456081] r8169 :05:00.2 eth0: No native access to PCI extended > config space, falling back to CSI > [ 10.459217] r8169 :05:00.2 eth0: link down > [ 10.459880] r8169 :05:00.2 eth0: link down > [ 12.015158] r8169 :05:00.2 eth0: link up > > >> According to your lspci output neither MSI nor MSI-X is active. >> Do you have to use nomsi for whatever reason? > > No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% > sure that it used to be - I've no idea how > it got dropped. If I'm not sure about an option, I start by taking the > recommendation in the kconfig help. Help on MSI > has a very clear "say Y". As I said above I have re-enabled MSI. > >> >> Heiner >> >>>> Maciej >>>> >>> Chris >>> >>
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Sorry, sent by accident. Note to self - don't attempt email until after second cup of coffee. On 29/09/2018 08:25, Chris Clayton wrote: > > > On 28/09/2018 23:13, Heiner Kallweit wrote: >> On 29.09.2018 00:00, Chris Clayton wrote: >>> Thanks Maciej. >>> >>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote: >>>> Hi, >>>> >>>>> Hi, >>>>> >>>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing >>>>> network problems after resuming from a >>>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK. >>>>> >>>>> The pattern of the problem is that when I first boot, the network is >>>>> fine. But, after resume from suspend I find that >>>>> the time taken for a ping of one of my ISP's nameservers increases from >>>>> 14-15ms to more than 1000ms. Moreover, when I >>>>> open a browser (chromium or firefox), it fails to retrieve my home page >>>>> (https://www.google.co.uk) and pings of the >>>>> nameserver fail with the message "Destination Host Unreachable". Often, I >>>>> can revive the network by stopping it with >>>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 >>>>> module and load it again. >>>> >>>> Please have a look at the following thread: >>>> https://lkml.org/lkml/2018/9/25/1118 >>>> >>> >>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the >>> problem is not solved by it. Similarly, I applied >>> Heiner's patch to the 4.19, but again the problem is not solved. >>> >> I think we talk about two different issues here. The one the fix is for has >> no link to suspend/resume. >> >> Chris, the lspci output doesn't provide enough detail to determine the exact >> chip version. >> Can you provide the dmesg part with the XID? I meant to say that I have now re-enabled MSI in 4.18.7 - the latest stable series kernel in which eth0 continues to function reliably after a suspend/resume cycle. The second dmesg output below is taken from that kernel. The first one was from an up-to-date 4.19 kernel > > $ dmesg | grep -i r8169 > [5.320679] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded > [5.321432] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM > control > [5.322892] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID > 48800800, IRQ 19 > [5.323786] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, > tx checksumming: ko] > [ 10.232077] r8169 :05:00.2 eth0: No native access to PCI extended > config space, falling back to CSI > [ 10.235218] r8169 :05:00.2 eth0: link down > [ 11.717460] r8169 :05:00.2 eth0: link up > > $ dmesg | grep -i r8169 > [5.208040] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded > [5.208677] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM > control > [5.210066] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID > 48800800, IRQ 29 > [5.210676] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, > tx checksumming: ko] > [ 10.456081] r8169 :05:00.2 eth0: No native access to PCI extended > config space, falling back to CSI > [ 10.459217] r8169 :05:00.2 eth0: link down > [ 10.459880] r8169 :05:00.2 eth0: link down > [ 12.015158] r8169 :05:00.2 eth0: link up > > >> According to your lspci output neither MSI nor MSI-X is active. >> Do you have to use nomsi for whatever reason? > > No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% > sure that it used to be - I've no idea how > it got dropped. If I'm not sure about an option, I start by taking the > recommendation in the kconfig help. Help on MSI > has a very clear "say Y". As I said above I have re-enabled MSI. > >> >> Heiner >> >>>> Maciej >>>> >>> Chris >>> >>
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 28/09/2018 23:13, Heiner Kallweit wrote: > On 29.09.2018 00:00, Chris Clayton wrote: >> Thanks Maciej. >> >> On 28/09/2018 16:54, Maciej S. Szmigiero wrote: >>> Hi, >>> >>>> Hi, >>>> >>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing >>>> network problems after resuming from a >>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK. >>>> >>>> The pattern of the problem is that when I first boot, the network is fine. >>>> But, after resume from suspend I find that >>>> the time taken for a ping of one of my ISP's nameservers increases from >>>> 14-15ms to more than 1000ms. Moreover, when I >>>> open a browser (chromium or firefox), it fails to retrieve my home page >>>> (https://www.google.co.uk) and pings of the >>>> nameserver fail with the message "Destination Host Unreachable". Often, I >>>> can revive the network by stopping it with >>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 >>>> module and load it again. >>> >>> Please have a look at the following thread: >>> https://lkml.org/lkml/2018/9/25/1118 >>> >> >> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem >> is not solved by it. Similarly, I applied >> Heiner's patch to the 4.19, but again the problem is not solved. >> > I think we talk about two different issues here. The one the fix is for has > no link to suspend/resume. > > Chris, the lspci output doesn't provide enough detail to determine the exact > chip version. > Can you provide the dmesg part with the XID? $ dmesg | grep -i r8169 [5.320679] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [5.321432] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM control [5.322892] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 19 [5.323786] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko] [ 10.232077] r8169 :05:00.2 eth0: No native access to PCI extended config space, falling back to CSI [ 10.235218] r8169 :05:00.2 eth0: link down [ 11.717460] r8169 :05:00.2 eth0: link up $ dmesg | grep -i r8169 [5.208040] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [5.208677] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM control [5.210066] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 29 [5.210676] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko] [ 10.456081] r8169 :05:00.2 eth0: No native access to PCI extended config space, falling back to CSI [ 10.459217] r8169 :05:00.2 eth0: link down [ 10.459880] r8169 :05:00.2 eth0: link down [ 12.015158] r8169 :05:00.2 eth0: link up > According to your lspci output neither MSI nor MSI-X is active. > Do you have to use nomsi for whatever reason? No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% sure that it used to be - I've no idea how it got dropped. If I'm not sure about an option, I start by taking the recommendation in the kconfig help. Help on MSI has a very clear "say Y". > > Heiner > >>> Maciej >>> >> Chris >> >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 28/09/2018 23:13, Heiner Kallweit wrote: > On 29.09.2018 00:00, Chris Clayton wrote: >> Thanks Maciej. >> >> On 28/09/2018 16:54, Maciej S. Szmigiero wrote: >>> Hi, >>> >>>> Hi, >>>> >>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing >>>> network problems after resuming from a >>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK. >>>> >>>> The pattern of the problem is that when I first boot, the network is fine. >>>> But, after resume from suspend I find that >>>> the time taken for a ping of one of my ISP's nameservers increases from >>>> 14-15ms to more than 1000ms. Moreover, when I >>>> open a browser (chromium or firefox), it fails to retrieve my home page >>>> (https://www.google.co.uk) and pings of the >>>> nameserver fail with the message "Destination Host Unreachable". Often, I >>>> can revive the network by stopping it with >>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 >>>> module and load it again. >>> >>> Please have a look at the following thread: >>> https://lkml.org/lkml/2018/9/25/1118 >>> >> >> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem >> is not solved by it. Similarly, I applied >> Heiner's patch to the 4.19, but again the problem is not solved. >> > I think we talk about two different issues here. The one the fix is for has > no link to suspend/resume. > > Chris, the lspci output doesn't provide enough detail to determine the exact > chip version. > Can you provide the dmesg part with the XID? $ dmesg | grep -i r8169 [5.320679] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [5.321432] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM control [5.322892] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 19 [5.323786] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko] [ 10.232077] r8169 :05:00.2 eth0: No native access to PCI extended config space, falling back to CSI [ 10.235218] r8169 :05:00.2 eth0: link down [ 11.717460] r8169 :05:00.2 eth0: link up $ dmesg | grep -i r8169 [5.208040] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [5.208677] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM control [5.210066] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 29 [5.210676] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko] [ 10.456081] r8169 :05:00.2 eth0: No native access to PCI extended config space, falling back to CSI [ 10.459217] r8169 :05:00.2 eth0: link down [ 10.459880] r8169 :05:00.2 eth0: link down [ 12.015158] r8169 :05:00.2 eth0: link up > According to your lspci output neither MSI nor MSI-X is active. > Do you have to use nomsi for whatever reason? No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% sure that it used to be - I've no idea how it got dropped. If I'm not sure about an option, I start by taking the recommendation in the kconfig help. Help on MSI has a very clear "say Y". > > Heiner > >>> Maciej >>> >> Chris >> >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Thanks Maciej. On 28/09/2018 16:54, Maciej S. Szmigiero wrote: > Hi, > >> Hi, >> >> I upgraded my kernel to 4.18.10 recently and have since been experiencing >> network problems after resuming from a >> suspend to RAM or disk. I previously had 4.18.6 and that was OK. >> >> The pattern of the problem is that when I first boot, the network is fine. >> But, after resume from suspend I find that >> the time taken for a ping of one of my ISP's nameservers increases from >> 14-15ms to more than 1000ms. Moreover, when I >> open a browser (chromium or firefox), it fails to retrieve my home page >> (https://www.google.co.uk) and pings of the >> nameserver fail with the message "Destination Host Unreachable". Often, I >> can revive the network by stopping it with >> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 >> module and load it again. > > Please have a look at the following thread: > https://lkml.org/lkml/2018/9/25/1118 > I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem is not solved by it. Similarly, I applied Heiner's patch to the 4.19, but again the problem is not solved. > Maciej > Chris
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Thanks Maciej. On 28/09/2018 16:54, Maciej S. Szmigiero wrote: > Hi, > >> Hi, >> >> I upgraded my kernel to 4.18.10 recently and have since been experiencing >> network problems after resuming from a >> suspend to RAM or disk. I previously had 4.18.6 and that was OK. >> >> The pattern of the problem is that when I first boot, the network is fine. >> But, after resume from suspend I find that >> the time taken for a ping of one of my ISP's nameservers increases from >> 14-15ms to more than 1000ms. Moreover, when I >> open a browser (chromium or firefox), it fails to retrieve my home page >> (https://www.google.co.uk) and pings of the >> nameserver fail with the message "Destination Host Unreachable". Often, I >> can revive the network by stopping it with >> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 >> module and load it again. > > Please have a look at the following thread: > https://lkml.org/lkml/2018/9/25/1118 > I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem is not solved by it. Similarly, I applied Heiner's patch to the 4.19, but again the problem is not solved. > Maciej > Chris
Re: [PATCH V2] mfd: rtsx: release IRQ during shutdown
On 03/01/18 12:32, Sinan Kaya wrote: > 'Commit cc27b735ad3a ("PCI/portdrv: Turn off PCIe services during > shutdown")' revealed a resource leak in rtsx_pci driver during shutdown. > > Issue shows up as a warning during shutdown as follows: > > remove_proc_entry: removing non-empty directory 'irq/17', leaking at least > 'rtsx_pci' > WARNING: CPU: 0 PID: 1578 at fs/proc/generic.c:572 > remove_proc_entry+0x11d/0x130 > Modules linked in > ... > Call Trace: > unregister_irq_proc > free_desc > irq_free_descs > mp_unmap_irq > acpi_unregister_gsi_apic > acpi_pci_irq_disable > do_pci_disable_device > pci_disable_device > device_shutdown > kernel_restart > Sys_reboot > > Even though rtsx_pci driver implements a shutdown callback, it is not > releasing the interrupt that it registered during probe. This is causing > the ACPI layer to complain that the shared IRQ is in use while freeing > IRQ. > > This code releases the IRQ to prevent resource leak and eliminate the > warning. > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=198141 > Reported-by: Chris Clayton <chris2...@googlemail.com> > Fixes: cc27b735ad3a ("PCI/portdrv: Turn off PCIe services during shutdown") > Signed-off-by: Sinan Kaya <ok...@codeaurora.org> > --- > drivers/mfd/rtsx_pcr.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/mfd/rtsx_pcr.c b/drivers/mfd/rtsx_pcr.c > index 590fb9a..c3ed885 100644 > --- a/drivers/mfd/rtsx_pcr.c > +++ b/drivers/mfd/rtsx_pcr.c > @@ -1543,6 +1543,9 @@ static void rtsx_pci_shutdown(struct pci_dev *pcidev) > rtsx_pci_power_off(pcr, HOST_ENTER_S1); > > pci_disable_device(pcidev); > + free_irq(pcr->irq, (void *)pcr); > + if (pcr->msi_en) > + pci_disable_msi(pcr->pci); > } > > #else /* CONFIG_PM */ I've applied v2 of the patch and built and installed the kernel (-rc6). All I can say, is that my system still closes down without the warning and call trace that the unpatched kernel produces. It's the best I can do by way of a test because I have no idea what the code added in v2 is supposed to achieve and, because my system shuts down (or reboots) moments later, there is no opportunity to check. If that constitutes a valid test: Tested-by: Chris Clayton <chris2...@googlemail.com> >
Re: [PATCH V2] mfd: rtsx: release IRQ during shutdown
On 03/01/18 12:32, Sinan Kaya wrote: > 'Commit cc27b735ad3a ("PCI/portdrv: Turn off PCIe services during > shutdown")' revealed a resource leak in rtsx_pci driver during shutdown. > > Issue shows up as a warning during shutdown as follows: > > remove_proc_entry: removing non-empty directory 'irq/17', leaking at least > 'rtsx_pci' > WARNING: CPU: 0 PID: 1578 at fs/proc/generic.c:572 > remove_proc_entry+0x11d/0x130 > Modules linked in > ... > Call Trace: > unregister_irq_proc > free_desc > irq_free_descs > mp_unmap_irq > acpi_unregister_gsi_apic > acpi_pci_irq_disable > do_pci_disable_device > pci_disable_device > device_shutdown > kernel_restart > Sys_reboot > > Even though rtsx_pci driver implements a shutdown callback, it is not > releasing the interrupt that it registered during probe. This is causing > the ACPI layer to complain that the shared IRQ is in use while freeing > IRQ. > > This code releases the IRQ to prevent resource leak and eliminate the > warning. > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=198141 > Reported-by: Chris Clayton > Fixes: cc27b735ad3a ("PCI/portdrv: Turn off PCIe services during shutdown") > Signed-off-by: Sinan Kaya > --- > drivers/mfd/rtsx_pcr.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/mfd/rtsx_pcr.c b/drivers/mfd/rtsx_pcr.c > index 590fb9a..c3ed885 100644 > --- a/drivers/mfd/rtsx_pcr.c > +++ b/drivers/mfd/rtsx_pcr.c > @@ -1543,6 +1543,9 @@ static void rtsx_pci_shutdown(struct pci_dev *pcidev) > rtsx_pci_power_off(pcr, HOST_ENTER_S1); > > pci_disable_device(pcidev); > + free_irq(pcr->irq, (void *)pcr); > + if (pcr->msi_en) > + pci_disable_msi(pcr->pci); > } > > #else /* CONFIG_PM */ I've applied v2 of the patch and built and installed the kernel (-rc6). All I can say, is that my system still closes down without the warning and call trace that the unpatched kernel produces. It's the best I can do by way of a test because I have no idea what the code added in v2 is supposed to achieve and, because my system shuts down (or reboots) moments later, there is no opportunity to check. If that constitutes a valid test: Tested-by: Chris Clayton >
Re: Oops on 4.15-rc[123] on shutdown/reboot
On 11/12/17 17:17, Bjorn Helgaas wrote: > [+cc linux-pci] > > On Mon, Dec 11, 2017 at 11:29:50AM -0500, Sinan Kaya wrote: >> Hi Chris, >> >>> >>> I'm more than happy to provide additional diagnostics and test proposed >>> fixes. As a starter for ten, I've attached the >>> output from 'lspci -v'. If, however, you need to see the backtrace, I'll >>> need some advice on how to capture that. >>> >> >> Can you open a bugzilla and also share the boot log? >> >> There must be something unique about your system. > > Can you attach "lspci -vv" output (as root) to the bugzilla, too? > I've opened the bugzilla report (Bug 198141) and attached the dmesg and lspci -vv outputs to it.
Re: Oops on 4.15-rc[123] on shutdown/reboot
On 11/12/17 17:17, Bjorn Helgaas wrote: > [+cc linux-pci] > > On Mon, Dec 11, 2017 at 11:29:50AM -0500, Sinan Kaya wrote: >> Hi Chris, >> >>> >>> I'm more than happy to provide additional diagnostics and test proposed >>> fixes. As a starter for ten, I've attached the >>> output from 'lspci -v'. If, however, you need to see the backtrace, I'll >>> need some advice on how to capture that. >>> >> >> Can you open a bugzilla and also share the boot log? >> >> There must be something unique about your system. > > Can you attach "lspci -vv" output (as root) to the bugzilla, too? > I've opened the bugzilla report (Bug 198141) and attached the dmesg and lspci -vv outputs to it.
Re: Oops on 4.15-rc[123] on shutdown/reboot
On 11/12/17 17:24, Sinan Kaya wrote: > On 12/11/2017 12:06 PM, Chris Clayton wrote: >> Here's the output of dmesg for 4.15.0-rc3. I'll open a bugzilla later and >> add this and the lspci output that I sent with >> my original repoart. > > This was helpful. I don't see any AER/DPC in your log. It looks like the only > PCIe > portdrv service you have is PME. > > Can we do a quick hack and return immediately from > > static int pcie_pme_probe(struct pcie_device *srv) > > by putting return 0; at the top. > > Same thing in > > static void pcie_pme_remove(struct pcie_device *srv) > > just place a return at the top. > I made those changes (to drivers/pci/pcie/pme.c) and built and installed the kernel. Sorry, but I still get the oops when I reboot. > I'm hoping your problem will go away after this. Then, we can start peeling > the onion. >
Re: Oops on 4.15-rc[123] on shutdown/reboot
On 11/12/17 17:24, Sinan Kaya wrote: > On 12/11/2017 12:06 PM, Chris Clayton wrote: >> Here's the output of dmesg for 4.15.0-rc3. I'll open a bugzilla later and >> add this and the lspci output that I sent with >> my original repoart. > > This was helpful. I don't see any AER/DPC in your log. It looks like the only > PCIe > portdrv service you have is PME. > > Can we do a quick hack and return immediately from > > static int pcie_pme_probe(struct pcie_device *srv) > > by putting return 0; at the top. > > Same thing in > > static void pcie_pme_remove(struct pcie_device *srv) > > just place a return at the top. > I made those changes (to drivers/pci/pcie/pme.c) and built and installed the kernel. Sorry, but I still get the oops when I reboot. > I'm hoping your problem will go away after this. Then, we can start peeling > the onion. >
Re: Oops on 4.15-rc[123] on shutdown/reboot
On 11/12/17 16:29, Sinan Kaya wrote: > Hi Chris, > >> >> I'm more than happy to provide additional diagnostics and test proposed >> fixes. As a starter for ten, I've attached the >> output from 'lspci -v'. If, however, you need to see the backtrace, I'll >> need some advice on how to capture that. >> > > Can you open a bugzilla and also share the boot log? > Here's the output of dmesg for 4.15.0-rc3. I'll open a bugzilla later and add this and the lspci output that I sent with my original repoart. > There must be something unique about your system. > > Sinan > [0.00] Linux version 4.15.0-rc3 (chris@laptop) (gcc version 7.2.1 20171207 (GCC)) #398 SMP PREEMPT Mon Dec 11 07:46:20 GMT 2017 [0.00] Command line: ro root=/dev/sda2 resume=/dev/sda6 rootfstype=ext4 net.ifnames=0 [0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' [0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' [0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' [0.00] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 [0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format. [0.00] e820: BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x0100-0x0009d7ff] usable [0.00] BIOS-e820: [mem 0x0009d800-0x0009] reserved [0.00] BIOS-e820: [mem 0x000e-0x000f] reserved [0.00] BIOS-e820: [mem 0x0010-0xd7216fff] usable [0.00] BIOS-e820: [mem 0xd7217000-0xd721dfff] ACPI NVS [0.00] BIOS-e820: [mem 0xd721e000-0xd7a0cfff] usable [0.00] BIOS-e820: [mem 0xd7a0d000-0xd7ca1fff] reserved [0.00] BIOS-e820: [mem 0xd7ca2000-0xdb4d] usable [0.00] BIOS-e820: [mem 0xdb4e-0xdb82dfff] reserved [0.00] BIOS-e820: [mem 0xdb82e000-0xdb88afff] usable [0.00] BIOS-e820: [mem 0xdb88b000-0xdb9bcfff] ACPI NVS [0.00] BIOS-e820: [mem 0xdb9bd000-0xdbffefff] reserved [0.00] BIOS-e820: [mem 0xdbfff000-0xdbff] usable [0.00] BIOS-e820: [mem 0xdd00-0xdf1f] reserved [0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved [0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved [0.00] BIOS-e820: [mem 0xfed0-0xfed03fff] reserved [0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved [0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved [0.00] BIOS-e820: [mem 0xff00-0x] reserved [0.00] BIOS-e820: [mem 0x0001-0x00041fdf] usable [0.00] NX (Execute Disable) protection: active [0.00] random: fast init done [0.00] SMBIOS 2.7 present. [0.00] DMI: Notebook W65_67SZ/W65_67SZ, BIOS 1.03.05 02/26/2014 [0.00] e820: update [mem 0x-0x0fff] usable ==> reserved [0.00] e820: remove [mem 0x000a-0x000f] usable [0.00] e820: last_pfn = 0x41fe00 max_arch_pfn = 0x4 [0.00] MTRR default type: uncachable [0.00] MTRR fixed ranges enabled: [0.00] 0-9 write-back [0.00] A-B uncachable [0.00] C-C write-protect [0.00] D-E7FFF uncachable [0.00] E8000-F write-protect [0.00] MTRR variable ranges enabled: [0.00] 0 base 00 mask 7C write-back [0.00] 1 base 04 mask 7FE000 write-back [0.00] 2 base 00E000 mask 7FE000 uncachable [0.00] 3 base 00DE00 mask 7FFE00 uncachable [0.00] 4 base 00DD00 mask 7FFF00 uncachable [0.00] 5 base 041FE0 mask 7FFFE0 uncachable [0.00] 6 disabled [0.00] 7 disabled [0.00] 8 disabled [0.00] 9 disabled [0.00] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT [0.00] e820: update [mem 0xdd00-0x] usable ==> reserved [0.00] e820: last_pfn = 0xdc000 max_arch_pfn = 0x4 [0.00] found SMP MP-table at [mem 0x000fd820-0x000fd82f] mapped at [(ptrval)] [0.00] Base memory trampoline at [(ptrval)] 97000 size 24576 [0.00] Using GB pages for direct mapping [0.00] BRK [0x02098000, 0x02098fff] PGTABLE [0.00] BRK [0x02099000, 0x02099fff] PGTABLE [0.00] BRK [0x0209a000, 0x0209afff] PGTABLE [0.00] BRK [0x0209b000, 0x0209bfff] PGTABLE [0.00] BRK [0x0209c000, 0x0209cfff] PGTABLE [0.00] BRK [0x0209d000, 0x0209dfff] PGTABLE [0.00] ACPI: Early
Re: Oops on 4.15-rc[123] on shutdown/reboot
On 11/12/17 16:29, Sinan Kaya wrote: > Hi Chris, > >> >> I'm more than happy to provide additional diagnostics and test proposed >> fixes. As a starter for ten, I've attached the >> output from 'lspci -v'. If, however, you need to see the backtrace, I'll >> need some advice on how to capture that. >> > > Can you open a bugzilla and also share the boot log? > Here's the output of dmesg for 4.15.0-rc3. I'll open a bugzilla later and add this and the lspci output that I sent with my original repoart. > There must be something unique about your system. > > Sinan > [0.00] Linux version 4.15.0-rc3 (chris@laptop) (gcc version 7.2.1 20171207 (GCC)) #398 SMP PREEMPT Mon Dec 11 07:46:20 GMT 2017 [0.00] Command line: ro root=/dev/sda2 resume=/dev/sda6 rootfstype=ext4 net.ifnames=0 [0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' [0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' [0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' [0.00] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 [0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format. [0.00] e820: BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x0100-0x0009d7ff] usable [0.00] BIOS-e820: [mem 0x0009d800-0x0009] reserved [0.00] BIOS-e820: [mem 0x000e-0x000f] reserved [0.00] BIOS-e820: [mem 0x0010-0xd7216fff] usable [0.00] BIOS-e820: [mem 0xd7217000-0xd721dfff] ACPI NVS [0.00] BIOS-e820: [mem 0xd721e000-0xd7a0cfff] usable [0.00] BIOS-e820: [mem 0xd7a0d000-0xd7ca1fff] reserved [0.00] BIOS-e820: [mem 0xd7ca2000-0xdb4d] usable [0.00] BIOS-e820: [mem 0xdb4e-0xdb82dfff] reserved [0.00] BIOS-e820: [mem 0xdb82e000-0xdb88afff] usable [0.00] BIOS-e820: [mem 0xdb88b000-0xdb9bcfff] ACPI NVS [0.00] BIOS-e820: [mem 0xdb9bd000-0xdbffefff] reserved [0.00] BIOS-e820: [mem 0xdbfff000-0xdbff] usable [0.00] BIOS-e820: [mem 0xdd00-0xdf1f] reserved [0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved [0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved [0.00] BIOS-e820: [mem 0xfed0-0xfed03fff] reserved [0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved [0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved [0.00] BIOS-e820: [mem 0xff00-0x] reserved [0.00] BIOS-e820: [mem 0x0001-0x00041fdf] usable [0.00] NX (Execute Disable) protection: active [0.00] random: fast init done [0.00] SMBIOS 2.7 present. [0.00] DMI: Notebook W65_67SZ/W65_67SZ, BIOS 1.03.05 02/26/2014 [0.00] e820: update [mem 0x-0x0fff] usable ==> reserved [0.00] e820: remove [mem 0x000a-0x000f] usable [0.00] e820: last_pfn = 0x41fe00 max_arch_pfn = 0x4 [0.00] MTRR default type: uncachable [0.00] MTRR fixed ranges enabled: [0.00] 0-9 write-back [0.00] A-B uncachable [0.00] C-C write-protect [0.00] D-E7FFF uncachable [0.00] E8000-F write-protect [0.00] MTRR variable ranges enabled: [0.00] 0 base 00 mask 7C write-back [0.00] 1 base 04 mask 7FE000 write-back [0.00] 2 base 00E000 mask 7FE000 uncachable [0.00] 3 base 00DE00 mask 7FFE00 uncachable [0.00] 4 base 00DD00 mask 7FFF00 uncachable [0.00] 5 base 041FE0 mask 7FFFE0 uncachable [0.00] 6 disabled [0.00] 7 disabled [0.00] 8 disabled [0.00] 9 disabled [0.00] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT [0.00] e820: update [mem 0xdd00-0x] usable ==> reserved [0.00] e820: last_pfn = 0xdc000 max_arch_pfn = 0x4 [0.00] found SMP MP-table at [mem 0x000fd820-0x000fd82f] mapped at [(ptrval)] [0.00] Base memory trampoline at [(ptrval)] 97000 size 24576 [0.00] Using GB pages for direct mapping [0.00] BRK [0x02098000, 0x02098fff] PGTABLE [0.00] BRK [0x02099000, 0x02099fff] PGTABLE [0.00] BRK [0x0209a000, 0x0209afff] PGTABLE [0.00] BRK [0x0209b000, 0x0209bfff] PGTABLE [0.00] BRK [0x0209c000, 0x0209cfff] PGTABLE [0.00] BRK [0x0209d000, 0x0209dfff] PGTABLE [0.00] ACPI: Early
Oops on 4.15-rc[123] on shutdown/reboot
I've been getting an oops when shutting down my laptop (with /sbin/halt) or rebooting it (/sbin/reboot or /usr/sbin/kexec). Unfortunately, I can't provide the backtrace because it is on the screen for only a moment before the system shuts down/reboots. I have however, bisected it and the outcome is: cc27b735ad3a75574a6ab1a66ed6b09385e77e5e is the first bad commit commit cc27b735ad3a75574a6ab1a66ed6b09385e77e5e Author: Sinan KayaDate: Wed Oct 25 15:01:02 2017 -0400 PCI/portdrv: Turn off PCIe services during shutdown Some of the PCIe services such as AER are being left enabled during shutdown. This might cause spurious AER errors while SOC is being powered down. Clean up the PCIe services gracefully during shutdown to clear these false positives. Signed-off-by: Sinan Kaya Signed-off-by: Bjorn Helgaas :04 04 5a827d6956c581344a0bf392e30155c337673c1d 76c6a39b53604a0a0a370383c3503f80aa7cbc1e M drivers I'm confident that this is the correct outcome because a kernel built with the preceding commit (6018182d3158505f11103adaee8ffb53424df986) does not oops. Nor does -rc3 with the patch reversed. I'm more than happy to provide additional diagnostics and test proposed fixes. As a starter for ten, I've attached the output from 'lspci -v'. If, however, you need to see the backtrace, I'll need some advice on how to capture that. Chris 00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor DRAM Controller (rev 06) Subsystem: CLEVO/KAPOK Computer Xeon E3-1200 v3/4th Gen Core Processor DRAM Controller Flags: bus master, fast devsel, latency 0 Capabilities: [e0] Vendor Specific Information: Len=0c 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller (rev 06) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0, IRQ 16 Bus: primary=00, secondary=01, subordinate=01, sec-latency=0 I/O behind bridge: None Memory behind bridge: None Prefetchable memory behind bridge: None Capabilities: [88] Subsystem: CLEVO/KAPOK Computer Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller Capabilities: [80] Power Management version 3 Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit- Capabilities: [a0] Express Root Port (Slot+), MSI 00 Kernel driver in use: pcieport 00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor Integrated Graphics Controller (rev 06) (prog-if 00 [VGA controller]) Subsystem: CLEVO/KAPOK Computer 4th Gen Core Processor Integrated Graphics Controller Flags: bus master, fast devsel, latency 0, IRQ 16 Memory at f780 (64-bit, non-prefetchable) [size=4M] Memory at e000 (64-bit, prefetchable) [size=256M] I/O ports at f000 [size=64] [virtual] Expansion ROM at 000c [disabled] [size=128K] Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit- Capabilities: [d0] Power Management version 2 Capabilities: [a4] PCI Advanced Features Kernel driver in use: i915 00:03.0 Audio device: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor HD Audio Controller (rev 06) Subsystem: CLEVO/KAPOK Computer Xeon E3-1200 v3/4th Gen Core Processor HD Audio Controller Flags: bus master, fast devsel, latency 0, IRQ 16 Memory at f7f14000 (64-bit, non-prefetchable) [size=16K] Capabilities: [50] Power Management version 2 Capabilities: [60] MSI: Enable- Count=1/1 Maskable- 64bit- Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00 Kernel driver in use: snd_hda_intel Kernel modules: snd_hda_intel 00:14.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI (rev 05) (prog-if 30 [XHCI]) Subsystem: CLEVO/KAPOK Computer 8 Series/C220 Series Chipset Family USB xHCI Flags: bus master, medium devsel, latency 0, IRQ 16 Memory at f7f0 (64-bit, non-prefetchable) [size=64K] Capabilities: [70] Power Management version 2 Capabilities: [80] MSI: Enable- Count=1/8 Maskable- 64bit+ Kernel driver in use: xhci_hcd 00:16.0 Communication controller: Intel Corporation 8 Series/C220 Series Chipset Family MEI Controller #1 (rev 04) Subsystem: CLEVO/KAPOK Computer 8 Series/C220 Series Chipset Family MEI Controller Flags: bus master, fast devsel, latency 0, IRQ 11 Memory at f7f1e000 (64-bit, non-prefetchable) [size=16] Capabilities: [50] Power Management version 3 Capabilities: [8c] MSI: Enable- Count=1/1 Maskable- 64bit+ 00:1a.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2 (rev 05) (prog-if 20 [EHCI]) Subsystem: CLEVO/KAPOK Computer 8 Series/C220 Series
Oops on 4.15-rc[123] on shutdown/reboot
I've been getting an oops when shutting down my laptop (with /sbin/halt) or rebooting it (/sbin/reboot or /usr/sbin/kexec). Unfortunately, I can't provide the backtrace because it is on the screen for only a moment before the system shuts down/reboots. I have however, bisected it and the outcome is: cc27b735ad3a75574a6ab1a66ed6b09385e77e5e is the first bad commit commit cc27b735ad3a75574a6ab1a66ed6b09385e77e5e Author: Sinan Kaya Date: Wed Oct 25 15:01:02 2017 -0400 PCI/portdrv: Turn off PCIe services during shutdown Some of the PCIe services such as AER are being left enabled during shutdown. This might cause spurious AER errors while SOC is being powered down. Clean up the PCIe services gracefully during shutdown to clear these false positives. Signed-off-by: Sinan Kaya Signed-off-by: Bjorn Helgaas :04 04 5a827d6956c581344a0bf392e30155c337673c1d 76c6a39b53604a0a0a370383c3503f80aa7cbc1e M drivers I'm confident that this is the correct outcome because a kernel built with the preceding commit (6018182d3158505f11103adaee8ffb53424df986) does not oops. Nor does -rc3 with the patch reversed. I'm more than happy to provide additional diagnostics and test proposed fixes. As a starter for ten, I've attached the output from 'lspci -v'. If, however, you need to see the backtrace, I'll need some advice on how to capture that. Chris 00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor DRAM Controller (rev 06) Subsystem: CLEVO/KAPOK Computer Xeon E3-1200 v3/4th Gen Core Processor DRAM Controller Flags: bus master, fast devsel, latency 0 Capabilities: [e0] Vendor Specific Information: Len=0c 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller (rev 06) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0, IRQ 16 Bus: primary=00, secondary=01, subordinate=01, sec-latency=0 I/O behind bridge: None Memory behind bridge: None Prefetchable memory behind bridge: None Capabilities: [88] Subsystem: CLEVO/KAPOK Computer Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller Capabilities: [80] Power Management version 3 Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit- Capabilities: [a0] Express Root Port (Slot+), MSI 00 Kernel driver in use: pcieport 00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor Integrated Graphics Controller (rev 06) (prog-if 00 [VGA controller]) Subsystem: CLEVO/KAPOK Computer 4th Gen Core Processor Integrated Graphics Controller Flags: bus master, fast devsel, latency 0, IRQ 16 Memory at f780 (64-bit, non-prefetchable) [size=4M] Memory at e000 (64-bit, prefetchable) [size=256M] I/O ports at f000 [size=64] [virtual] Expansion ROM at 000c [disabled] [size=128K] Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit- Capabilities: [d0] Power Management version 2 Capabilities: [a4] PCI Advanced Features Kernel driver in use: i915 00:03.0 Audio device: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor HD Audio Controller (rev 06) Subsystem: CLEVO/KAPOK Computer Xeon E3-1200 v3/4th Gen Core Processor HD Audio Controller Flags: bus master, fast devsel, latency 0, IRQ 16 Memory at f7f14000 (64-bit, non-prefetchable) [size=16K] Capabilities: [50] Power Management version 2 Capabilities: [60] MSI: Enable- Count=1/1 Maskable- 64bit- Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00 Kernel driver in use: snd_hda_intel Kernel modules: snd_hda_intel 00:14.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI (rev 05) (prog-if 30 [XHCI]) Subsystem: CLEVO/KAPOK Computer 8 Series/C220 Series Chipset Family USB xHCI Flags: bus master, medium devsel, latency 0, IRQ 16 Memory at f7f0 (64-bit, non-prefetchable) [size=64K] Capabilities: [70] Power Management version 2 Capabilities: [80] MSI: Enable- Count=1/8 Maskable- 64bit+ Kernel driver in use: xhci_hcd 00:16.0 Communication controller: Intel Corporation 8 Series/C220 Series Chipset Family MEI Controller #1 (rev 04) Subsystem: CLEVO/KAPOK Computer 8 Series/C220 Series Chipset Family MEI Controller Flags: bus master, fast devsel, latency 0, IRQ 11 Memory at f7f1e000 (64-bit, non-prefetchable) [size=16] Capabilities: [50] Power Management version 3 Capabilities: [8c] MSI: Enable- Count=1/1 Maskable- 64bit+ 00:1a.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2 (rev 05) (prog-if 20 [EHCI]) Subsystem: CLEVO/KAPOK Computer 8 Series/C220 Series Chipset Family USB EHCI Flags: bus master, medium devsel,
"PM / QoS: Fix device resume latency PM QoS" breaks sound
Hi, I pulled the latestchanges from Linus' tree this evening and have found that with the new kernel, sound is not working on my laptop. More precisely, the built-in speakers don't produce any sound. Sound does work when I use ear-plugs in the headphone socket. It also works via a bluetooth speaker. I've bisected the problem and ended up at: 0cc2b4e5a020fc7f4d1795741c116c983e9467d7 is the first bad commit commit 0cc2b4e5a020fc7f4d1795741c116c983e9467d7 Author: Rafael J. WysockiDate: Tue Oct 24 15:20:45 2017 +0200 PM / QoS: Fix device resume latency PM QoS The special value of 0 for device resume latency PM QoS means "no restriction", but there are two problems with that. First, device resume latency PM QoS requests with 0 as the value are always put in front of requests with positive values in the priority lists used internally by the PM QoS framework, causing 0 to be chosen as an effective constraint value. However, that 0 is then interpreted as "no restriction" effectively overriding the other requests with specific restrictions which is incorrect. Second, the users of device resume latency PM QoS have no way to specify that *any* resume latency at all should be avoided, which is an artificial limitation in general. To address these issues, modify device resume latency PM QoS to use S32_MAX as the "no constraint" value and 0 as the "no latency at all" one and rework its users (the cpuidle menu governor, the genpd QoS governor and the runtime PM framework) to follow these changes. Also add a special "n/a" value to the corresponding user space I/F to allow user space to indicate that it cannot accept any resume latencies at all for the given device. Fixes: 85dc0b8a4019 (PM / QoS: Make it possible to expose PM QoS latency constraints) Link: https://bugzilla.kernel.org/show_bug.cgi?id=197323 Reported-by: Reinette Chatre Tested-by: Reinette Chatre Signed-off-by: Rafael J. Wysocki Acked-by: Alex Shi Cc: All applicable :04 04 f0c128ec799bb9894cfc5c341f88ad7bdfb15bac 9a2e8171ca47f864bd534cd9c160cce58449a889 M Documentation :04 04 0028ffec81675e686bdd621c0445d3e814d7980c 29db53c6356a6fed9c8bdbc2d6bc7bd56a96e529 M drivers :04 04 2e66b79bd2ffb4fcb00f04a69a0afe5c80d1d3f3 dd6d8e90b59389cd2bd8a0c92716d79d2eeb8268 M include With that change reverted, the speakers emit sound again. The audio devices identified by "lspci -vv" are as follows: 00:03.0 Audio device: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor HD Audio Controller (rev 06) Subsystem: CLEVO/KAPOK Computer Xeon E3-1200 v3/4th Gen Core Processor HD Audio Controller Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- SERR-
"PM / QoS: Fix device resume latency PM QoS" breaks sound
Hi, I pulled the latestchanges from Linus' tree this evening and have found that with the new kernel, sound is not working on my laptop. More precisely, the built-in speakers don't produce any sound. Sound does work when I use ear-plugs in the headphone socket. It also works via a bluetooth speaker. I've bisected the problem and ended up at: 0cc2b4e5a020fc7f4d1795741c116c983e9467d7 is the first bad commit commit 0cc2b4e5a020fc7f4d1795741c116c983e9467d7 Author: Rafael J. Wysocki Date: Tue Oct 24 15:20:45 2017 +0200 PM / QoS: Fix device resume latency PM QoS The special value of 0 for device resume latency PM QoS means "no restriction", but there are two problems with that. First, device resume latency PM QoS requests with 0 as the value are always put in front of requests with positive values in the priority lists used internally by the PM QoS framework, causing 0 to be chosen as an effective constraint value. However, that 0 is then interpreted as "no restriction" effectively overriding the other requests with specific restrictions which is incorrect. Second, the users of device resume latency PM QoS have no way to specify that *any* resume latency at all should be avoided, which is an artificial limitation in general. To address these issues, modify device resume latency PM QoS to use S32_MAX as the "no constraint" value and 0 as the "no latency at all" one and rework its users (the cpuidle menu governor, the genpd QoS governor and the runtime PM framework) to follow these changes. Also add a special "n/a" value to the corresponding user space I/F to allow user space to indicate that it cannot accept any resume latencies at all for the given device. Fixes: 85dc0b8a4019 (PM / QoS: Make it possible to expose PM QoS latency constraints) Link: https://bugzilla.kernel.org/show_bug.cgi?id=197323 Reported-by: Reinette Chatre Tested-by: Reinette Chatre Signed-off-by: Rafael J. Wysocki Acked-by: Alex Shi Cc: All applicable :04 04 f0c128ec799bb9894cfc5c341f88ad7bdfb15bac 9a2e8171ca47f864bd534cd9c160cce58449a889 M Documentation :04 04 0028ffec81675e686bdd621c0445d3e814d7980c 29db53c6356a6fed9c8bdbc2d6bc7bd56a96e529 M drivers :04 04 2e66b79bd2ffb4fcb00f04a69a0afe5c80d1d3f3 dd6d8e90b59389cd2bd8a0c92716d79d2eeb8268 M include With that change reverted, the speakers emit sound again. The audio devices identified by "lspci -vv" are as follows: 00:03.0 Audio device: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor HD Audio Controller (rev 06) Subsystem: CLEVO/KAPOK Computer Xeon E3-1200 v3/4th Gen Core Processor HD Audio Controller Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- SERR-
Re: [PATCH 4.12 004/106] scsi: sg: fix SG_DXFER_FROM_DEV transfers
On 09/08/17 17:51, Greg Kroah-Hartman wrote: > 4.12-stable review patch. If anyone has any objections, please let me know. > > --- I repeat my comments when the patch was queued for stable: 1. Johannes' commit message says that the transfer must have a length bigger than 0, so the code should return false if the length is less than or equal to 0, but the test is for less than 0. 2. But in any case, there's another patch that removes all this sg_is_valid_dxfer() jiggery-pokery and replaces it with a simpler test. It hasn't reached Linus' tree yet but is, I believe, cc'd to stable. As Johannes said in response to the second of my comments, the patch that replaces sg_is_valid_dxfer() with a simpler test is now in Linus' tree - commit f930c7043663188429cd9b254e9d761edfc101ce. Without that change, I think there is still some breakage in sg. Chris --- > > From: Johannes Thumshirn <jthumsh...@suse.de> > > commit 68c59fcea1f2c6a54c62aa896cc623c1b5bc9b47 upstream. > > SG_DXFER_FROM_DEV transfers do not necessarily have a dxferp as we set > it to NULL for the old sg_io read/write interface, but must have a > length bigger than 0. This fixes a regression introduced by commit > 28676d869bbb ("scsi: sg: check for valid direction before starting the > request") > > Signed-off-by: Johannes Thumshirn <jthumsh...@suse.de> > Fixes: 28676d869bbb ("scsi: sg: check for valid direction before starting the > request") > Reported-by: Chris Clayton <chris2...@googlemail.com> > Tested-by: Chris Clayton <chris2...@googlemail.com> > Cc: Douglas Gilbert <dgilb...@interlog.com> > Reviewed-by: Hannes Reinecke <h...@suse.com> > Tested-by: Chris Clayton <chris2...@googlemail.com> > Acked-by: Douglas Gilbert <dgilb...@interlog.com> > Signed-off-by: Martin K. Petersen <martin.peter...@oracle.com> > Signed-off-by: Greg Kroah-Hartman <gre...@linuxfoundation.org> > > --- > drivers/scsi/sg.c |5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > --- a/drivers/scsi/sg.c > +++ b/drivers/scsi/sg.c > @@ -758,8 +758,11 @@ static bool sg_is_valid_dxfer(sg_io_hdr_ > if (hp->dxferp || hp->dxfer_len > 0) > return false; > return true; > - case SG_DXFER_TO_DEV: > case SG_DXFER_FROM_DEV: > + if (hp->dxfer_len < 0) > + return false; > + return true; > + case SG_DXFER_TO_DEV: > case SG_DXFER_TO_FROM_DEV: > if (!hp->dxferp || hp->dxfer_len == 0) > return false; > >
Re: [PATCH 4.12 004/106] scsi: sg: fix SG_DXFER_FROM_DEV transfers
On 09/08/17 17:51, Greg Kroah-Hartman wrote: > 4.12-stable review patch. If anyone has any objections, please let me know. > > --- I repeat my comments when the patch was queued for stable: 1. Johannes' commit message says that the transfer must have a length bigger than 0, so the code should return false if the length is less than or equal to 0, but the test is for less than 0. 2. But in any case, there's another patch that removes all this sg_is_valid_dxfer() jiggery-pokery and replaces it with a simpler test. It hasn't reached Linus' tree yet but is, I believe, cc'd to stable. As Johannes said in response to the second of my comments, the patch that replaces sg_is_valid_dxfer() with a simpler test is now in Linus' tree - commit f930c7043663188429cd9b254e9d761edfc101ce. Without that change, I think there is still some breakage in sg. Chris --- > > From: Johannes Thumshirn > > commit 68c59fcea1f2c6a54c62aa896cc623c1b5bc9b47 upstream. > > SG_DXFER_FROM_DEV transfers do not necessarily have a dxferp as we set > it to NULL for the old sg_io read/write interface, but must have a > length bigger than 0. This fixes a regression introduced by commit > 28676d869bbb ("scsi: sg: check for valid direction before starting the > request") > > Signed-off-by: Johannes Thumshirn > Fixes: 28676d869bbb ("scsi: sg: check for valid direction before starting the > request") > Reported-by: Chris Clayton > Tested-by: Chris Clayton > Cc: Douglas Gilbert > Reviewed-by: Hannes Reinecke > Tested-by: Chris Clayton > Acked-by: Douglas Gilbert > Signed-off-by: Martin K. Petersen > Signed-off-by: Greg Kroah-Hartman > > --- > drivers/scsi/sg.c |5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > --- a/drivers/scsi/sg.c > +++ b/drivers/scsi/sg.c > @@ -758,8 +758,11 @@ static bool sg_is_valid_dxfer(sg_io_hdr_ > if (hp->dxferp || hp->dxfer_len > 0) > return false; > return true; > - case SG_DXFER_TO_DEV: > case SG_DXFER_FROM_DEV: > + if (hp->dxfer_len < 0) > + return false; > + return true; > + case SG_DXFER_TO_DEV: > case SG_DXFER_TO_FROM_DEV: > if (!hp->dxferp || hp->dxfer_len == 0) > return false; > >
Re: [PATCH v2] scsi: sg: fix SG_DXFER_FROM_DEV transfers
On 07/07/17 09:56, Johannes Thumshirn wrote: > SG_DXFER_FROM_DEV transfers do not necessarily have a dxferp as we set > it to NULL for the old sg_io read/write interface, but must have a length > bigger than 0. This fixes a regression introduced by commit 28676d869bbb > ("scsi: sg: check for valid direction before starting the request") > I've tested this new patch and the Nero applications can still find the optical drives on my laptop. Tested-by: Chris Clayton <chris2...@googlemail.com> > Signed-off-by: Johannes Thumshirn <jthumsh...@suse.de> > Fixes: 28676d869bbb ("scsi: sg: check for valid direction before starting the > request") > Reported-by: Chris Clayton <chris2...@googlemail.com> > Tested-by: Chris Clayton <chris2...@googlemail.com> > Cc: Douglas Gilbert <dgilb...@interlog.com> > Reviewed-by: Hannes Reinecke <h...@suse.com> > --- > Changes to v1: > * Fix breakage of the sg_io v3 interface, verified using sg_inq > > drivers/scsi/sg.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c > index 21225d62b0c1..1e82d4128a84 100644 > --- a/drivers/scsi/sg.c > +++ b/drivers/scsi/sg.c > @@ -758,8 +758,11 @@ static bool sg_is_valid_dxfer(sg_io_hdr_t *hp) > if (hp->dxferp || hp->dxfer_len > 0) > return false; > return true; > - case SG_DXFER_TO_DEV: > case SG_DXFER_FROM_DEV: > + if (hp->dxfer_len < 0) > + return false; > + return true; > + case SG_DXFER_TO_DEV: > case SG_DXFER_TO_FROM_DEV: > if (!hp->dxferp || hp->dxfer_len == 0) > return false; >
Re: [PATCH v2] scsi: sg: fix SG_DXFER_FROM_DEV transfers
On 07/07/17 09:56, Johannes Thumshirn wrote: > SG_DXFER_FROM_DEV transfers do not necessarily have a dxferp as we set > it to NULL for the old sg_io read/write interface, but must have a length > bigger than 0. This fixes a regression introduced by commit 28676d869bbb > ("scsi: sg: check for valid direction before starting the request") > I've tested this new patch and the Nero applications can still find the optical drives on my laptop. Tested-by: Chris Clayton > Signed-off-by: Johannes Thumshirn > Fixes: 28676d869bbb ("scsi: sg: check for valid direction before starting the > request") > Reported-by: Chris Clayton > Tested-by: Chris Clayton > Cc: Douglas Gilbert > Reviewed-by: Hannes Reinecke > --- > Changes to v1: > * Fix breakage of the sg_io v3 interface, verified using sg_inq > > drivers/scsi/sg.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c > index 21225d62b0c1..1e82d4128a84 100644 > --- a/drivers/scsi/sg.c > +++ b/drivers/scsi/sg.c > @@ -758,8 +758,11 @@ static bool sg_is_valid_dxfer(sg_io_hdr_t *hp) > if (hp->dxferp || hp->dxfer_len > 0) > return false; > return true; > - case SG_DXFER_TO_DEV: > case SG_DXFER_FROM_DEV: > + if (hp->dxfer_len < 0) > + return false; > + return true; > + case SG_DXFER_TO_DEV: > case SG_DXFER_TO_FROM_DEV: > if (!hp->dxferp || hp->dxfer_len == 0) > return false; >
4.7.0-rc7+: Oops during boot with USB pen drive inserted
Hi, With Linus' latest and greatest, I get an opps when I boot my laptop with a pen drive inserted in any USB port. The oops message is: Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(8,2) The oops seems to be 100% repeatable. If a USB pen drive is not inserted, the laptop boots successfully. I've taken a photograph of the oops and you can view it at http://s714.photobucket.com/user/chris2553/media/IMG_20160722_053841.jpg.html. At the top of the picture, I notice that the partitions on my actual boot disk are being reported as being on /dev/sdb, so it seems likely that, at this point, the pen drive is being seen as /dev/sda, although that has scrolled off the screen. I don't boot via a ramdisk - my kernel has ext4 built in. The grub2 entry is: menuentry "Krisux, Linux 4.7.0-rc7+" { insmod ext2 set root=(hd0,2) linux /boot/vmlinuz-4.7.0-rc7+ ro root=/dev/sda2 resume=/dev/sda6 rootfstype=ext4 net.ifnames=0 } (BTW, Krisux is not a real distro - it's just the name I have given Linux from Scratch system.) The stack that is dumped is: dump_stack panic printk mount_block_root prepare_namespace kernel_init_freeable kernel_init ret_from_fork rest_init I realise I could work around this by specifying the boot partition by, say, its UUID, but I thought you would want me to report this anyway. Of course, I'm happy to provide any other information required and to test any fix, but I will be out and about for the next 14-16 hours, so it will be later tonight or maybe even tomorrow before I can respond. Chris
4.7.0-rc7+: Oops during boot with USB pen drive inserted
Hi, With Linus' latest and greatest, I get an opps when I boot my laptop with a pen drive inserted in any USB port. The oops message is: Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(8,2) The oops seems to be 100% repeatable. If a USB pen drive is not inserted, the laptop boots successfully. I've taken a photograph of the oops and you can view it at http://s714.photobucket.com/user/chris2553/media/IMG_20160722_053841.jpg.html. At the top of the picture, I notice that the partitions on my actual boot disk are being reported as being on /dev/sdb, so it seems likely that, at this point, the pen drive is being seen as /dev/sda, although that has scrolled off the screen. I don't boot via a ramdisk - my kernel has ext4 built in. The grub2 entry is: menuentry "Krisux, Linux 4.7.0-rc7+" { insmod ext2 set root=(hd0,2) linux /boot/vmlinuz-4.7.0-rc7+ ro root=/dev/sda2 resume=/dev/sda6 rootfstype=ext4 net.ifnames=0 } (BTW, Krisux is not a real distro - it's just the name I have given Linux from Scratch system.) The stack that is dumped is: dump_stack panic printk mount_block_root prepare_namespace kernel_init_freeable kernel_init ret_from_fork rest_init I realise I could work around this by specifying the boot partition by, say, its UUID, but I thought you would want me to report this anyway. Of course, I'm happy to provide any other information required and to test any fix, but I will be out and about for the next 14-16 hours, so it will be later tonight or maybe even tomorrow before I can respond. Chris
Re: 4.5 Regression - mouse not working after resume from suspend
I can only assume that although a non-working bluetooth mouse is a symptom of this regression, the silence of the bluetooth folks is because the fault does not lie in the BT subsystem. Consequently, I'm transferring the problem back to LKML in the hope that someone else can solve the problem. On 15/02/16 23:40, Chris Clayton wrote: > Hi, > > Is there anything else I can do to help diagnose this regression. > > To summarise my BT mouse does not work after resuming from suspend to disk or > ram. IT works perfectly in earlier 4.4, > 4.3 and 4.2 kernels. I bisected and found the first bad commit is: > > 2ff13894cfb877cb3d02d96a8402202f0a6f3efd is the first bad commit > commit 2ff13894cfb877cb3d02d96a8402202f0a6f3efd > Author: Johan Hedberg <johan.hedb...@intel.com> > Date: Wed Nov 25 16:15:44 2015 +0200 > > Bluetooth: Perform HCI update for power on synchronously > > Johan requested additional information, which I provided. Checking the > archive at marc.info, it seems the mail didn't > make it to the mailing list. Maybe it exceeded a size limit, I don't know. > Anyway I copied the mail to Johan and Marcel. > > A bit more experimentation revealed that I can reactivate the mouse if I > restart the bluetooth daemon after the machine > resumes. > > Please let me know if I can provide anything else. > > Thanks > > Chris > > On 06/02/16 15:23, Chris Clayton wrote: >> Hi Johan, >> >> The information you requested has been captured from v4.5-rc2-340-g5af9c2e >> and is included below. >> >> On 06/02/16 14:33, Johan Hedberg wrote: >>> Hi Chris, >>> >>> On Sat, Feb 06, 2016, Chris Clayton wrote: >>>> On 06/02/16 11:38, Chris Clayton wrote: >>>>> On 06/02/16 08:37, Chris Clayton wrote: >>>>>> There seems to be a regression in resuming my laptop from a suspend >>>>>> to RAM or disk. The symptom is that my bluetooth >>>>>> mouse doesn't work after the resume. The kernel is built after a >>>>>> pull of Linus' tree this morning (v4.5-rc2-340-g5af9c2e). >>>>>> >>>>>> Attached is the output from dmesg showing the boot, suspend (to >>>>>> RAM) and resume. You'll see that during the resume, >>>>>> error -517 is being reported for some devices. Suspend/resume has worked >>>>>> perfectly with a 4.[234].x kernels. >>>>>> >>>>>> I'll start a bisection, but thought I'd give a heads up in case >>>>>> someone can see the problem before I get done with the >>>>>> bisect. >>>>> >>>>> The bisection ended up at: >>>>> >>>>> 2ff13894cfb877cb3d02d96a8402202f0a6f3efd is the first bad commit >>>>> commit 2ff13894cfb877cb3d02d96a8402202f0a6f3efd >>>>> Author: Johan Hedberg <johan.hedb...@intel.com> >>>>> Date: Wed Nov 25 16:15:44 2015 +0200 >>>>> >>>>> Bluetooth: Perform HCI update for power on synchronously >>>>> >>>>> The request to update HCI during power on is always coming either from >>>>> hdev->req_workqueue or through an ioctl, so it's safe to use >>>>> hci_req_sync for it. This way we also eliminate potential races with >>>>> incoming mgmt commands or other actions while powering on. >>>>> >>>>> Part of this refactoring is the splitting of mgmt_powered() into >>>>> mgmt_power_on() and __mgmt_power_off() functions. The main reason is >>>>> the different requirements as far as hdev locking is concerned, as >>>>> highlighted with the __ prefix of the power off API. >>>>> >>>>> Since the power on in the case of clearing the AUTO_OFF flag cannot be >>>>> done synchronously in the set_powered mgmt handler, the hci_power_on >>>>> work callback is extended to cover this (which also simplifies the >>>>> set_powered helper a lot). >>>>> >>>>> Signed-off-by: Johan Hedberg <johan.hedb...@intel.com> >>>>> Signed-off-by: Marcel Holtmann <mar...@holtmann.org> >>>>> >>>>> :04 04 a093d0be66f39f99c33a6a4725b2330ca9b41d03 >>>>> a1eff79cec3ee7208e5aa200ab5069726bbeea8e M include >>>>> :04 04 d2d122193b33d45fcb9c2bc69f2024487a7528a0 >>>>> 0036e1ec2e125f2432cfd420b5f79ca133ec34f7 M net >>>> >>>> I've ju
Re: 4.5 Regression - mouse not working after resume from suspend
I can only assume that although a non-working bluetooth mouse is a symptom of this regression, the silence of the bluetooth folks is because the fault does not lie in the BT subsystem. Consequently, I'm transferring the problem back to LKML in the hope that someone else can solve the problem. On 15/02/16 23:40, Chris Clayton wrote: > Hi, > > Is there anything else I can do to help diagnose this regression. > > To summarise my BT mouse does not work after resuming from suspend to disk or > ram. IT works perfectly in earlier 4.4, > 4.3 and 4.2 kernels. I bisected and found the first bad commit is: > > 2ff13894cfb877cb3d02d96a8402202f0a6f3efd is the first bad commit > commit 2ff13894cfb877cb3d02d96a8402202f0a6f3efd > Author: Johan Hedberg > Date: Wed Nov 25 16:15:44 2015 +0200 > > Bluetooth: Perform HCI update for power on synchronously > > Johan requested additional information, which I provided. Checking the > archive at marc.info, it seems the mail didn't > make it to the mailing list. Maybe it exceeded a size limit, I don't know. > Anyway I copied the mail to Johan and Marcel. > > A bit more experimentation revealed that I can reactivate the mouse if I > restart the bluetooth daemon after the machine > resumes. > > Please let me know if I can provide anything else. > > Thanks > > Chris > > On 06/02/16 15:23, Chris Clayton wrote: >> Hi Johan, >> >> The information you requested has been captured from v4.5-rc2-340-g5af9c2e >> and is included below. >> >> On 06/02/16 14:33, Johan Hedberg wrote: >>> Hi Chris, >>> >>> On Sat, Feb 06, 2016, Chris Clayton wrote: >>>> On 06/02/16 11:38, Chris Clayton wrote: >>>>> On 06/02/16 08:37, Chris Clayton wrote: >>>>>> There seems to be a regression in resuming my laptop from a suspend >>>>>> to RAM or disk. The symptom is that my bluetooth >>>>>> mouse doesn't work after the resume. The kernel is built after a >>>>>> pull of Linus' tree this morning (v4.5-rc2-340-g5af9c2e). >>>>>> >>>>>> Attached is the output from dmesg showing the boot, suspend (to >>>>>> RAM) and resume. You'll see that during the resume, >>>>>> error -517 is being reported for some devices. Suspend/resume has worked >>>>>> perfectly with a 4.[234].x kernels. >>>>>> >>>>>> I'll start a bisection, but thought I'd give a heads up in case >>>>>> someone can see the problem before I get done with the >>>>>> bisect. >>>>> >>>>> The bisection ended up at: >>>>> >>>>> 2ff13894cfb877cb3d02d96a8402202f0a6f3efd is the first bad commit >>>>> commit 2ff13894cfb877cb3d02d96a8402202f0a6f3efd >>>>> Author: Johan Hedberg >>>>> Date: Wed Nov 25 16:15:44 2015 +0200 >>>>> >>>>> Bluetooth: Perform HCI update for power on synchronously >>>>> >>>>> The request to update HCI during power on is always coming either from >>>>> hdev->req_workqueue or through an ioctl, so it's safe to use >>>>> hci_req_sync for it. This way we also eliminate potential races with >>>>> incoming mgmt commands or other actions while powering on. >>>>> >>>>> Part of this refactoring is the splitting of mgmt_powered() into >>>>> mgmt_power_on() and __mgmt_power_off() functions. The main reason is >>>>> the different requirements as far as hdev locking is concerned, as >>>>> highlighted with the __ prefix of the power off API. >>>>> >>>>> Since the power on in the case of clearing the AUTO_OFF flag cannot be >>>>> done synchronously in the set_powered mgmt handler, the hci_power_on >>>>> work callback is extended to cover this (which also simplifies the >>>>> set_powered helper a lot). >>>>> >>>>> Signed-off-by: Johan Hedberg >>>>> Signed-off-by: Marcel Holtmann >>>>> >>>>> :04 04 a093d0be66f39f99c33a6a4725b2330ca9b41d03 >>>>> a1eff79cec3ee7208e5aa200ab5069726bbeea8e M include >>>>> :04 04 d2d122193b33d45fcb9c2bc69f2024487a7528a0 >>>>> 0036e1ec2e125f2432cfd420b5f79ca133ec34f7 M net >>>> >>>> I've just built a kernel at bf943cbf76ecd3b9838a80d5e08777b0f4ccc665 >>>> (the commit prior to the one the bisect landed
Re: 4.5 Regression - mouse not working after resume from suspend
On 06/02/16 11:38, Chris Clayton wrote: > > > On 06/02/16 08:37, Chris Clayton wrote: >> There seems to be a regression in resuming my laptop from a suspend to RAM >> or disk. The symptom is that my bluetooth >> mouse doesn't work after the resume. The kernel is built after a pull of >> Linus' tree this morning (v4.5-rc2-340-g5af9c2e). >> >> Attached is the output from dmesg showing the boot, suspend (to RAM) and >> resume. You'll see that during the resume, >> error -517 is being reported for some devices. Suspend/resume has worked >> perfectly with a 4.[234].x kernels. >> >> I'll start a bisection, but thought I'd give a heads up in case someone can >> see the problem before I get done with the >> bisect. >> > > The bisection ended up at: > > 2ff13894cfb877cb3d02d96a8402202f0a6f3efd is the first bad commit > commit 2ff13894cfb877cb3d02d96a8402202f0a6f3efd > Author: Johan Hedberg > Date: Wed Nov 25 16:15:44 2015 +0200 > > Bluetooth: Perform HCI update for power on synchronously > > The request to update HCI during power on is always coming either from > hdev->req_workqueue or through an ioctl, so it's safe to use > hci_req_sync for it. This way we also eliminate potential races with > incoming mgmt commands or other actions while powering on. > > Part of this refactoring is the splitting of mgmt_powered() into > mgmt_power_on() and __mgmt_power_off() functions. The main reason is > the different requirements as far as hdev locking is concerned, as > highlighted with the __ prefix of the power off API. > > Since the power on in the case of clearing the AUTO_OFF flag cannot be > done synchronously in the set_powered mgmt handler, the hci_power_on > work callback is extended to cover this (which also simplifies the > set_powered helper a lot). > > Signed-off-by: Johan Hedberg > Signed-off-by: Marcel Holtmann > > :04 04 a093d0be66f39f99c33a6a4725b2330ca9b41d03 > a1eff79cec3ee7208e5aa200ab5069726bbeea8e M include > :04 04 d2d122193b33d45fcb9c2bc69f2024487a7528a0 > 0036e1ec2e125f2432cfd420b5f79ca133ec34f7 M net > > I've just built a kernel at bf943cbf76ecd3b9838a80d5e08777b0f4ccc665 (the commit prior to the one the bisect landed on) and my BT mouse works fine after a suspend/resume. With a kernel built at 2ff13894cfb877cb3d02d96a8402202f0a6f3efd, the mouse does not work after resume. > The bisect log is: > > git bisect start > # bad: [5af9c2e19da6514a1a50b07d97d93b74a7711873] Merge branch 'akpm' > (patches from Andrew) > git bisect bad 5af9c2e19da6514a1a50b07d97d93b74a7711873 > # good: [afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc] Linux 4.4 > git bisect good afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc > # bad: [63b6da39bb38e8f1a1ef3180d32a39d6baf9da84] perf: Fix > perf_event_exit_task() race > git bisect bad 63b6da39bb38e8f1a1ef3180d32a39d6baf9da84 > # bad: [aee3bfa3307cd0da2126bdc0ea359dabea5ee8f7] Merge > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next > git bisect bad aee3bfa3307cd0da2126bdc0ea359dabea5ee8f7 > # good: [60b7eca1dc2ec066916b3b7ac6ad89bea13cb9af] Merge tag > 'upstream-4.5-rc1' of git://git.infradead.org/linux-ubifs > git bisect good 60b7eca1dc2ec066916b3b7ac6ad89bea13cb9af > # bad: [a188222b6ed29404ac2d4232d35d1fe0e77af370] net: Rename > NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK > git bisect bad a188222b6ed29404ac2d4232d35d1fe0e77af370 > # good: [1343c65f70ee1b1f968a08b30e1836a4e37116cd] fm10k: always check > init_hw for errors > git bisect good 1343c65f70ee1b1f968a08b30e1836a4e37116cd > # good: [bc9b145a092aca91a7f6ef40cdb3628b6ada7ec9] Merge branch > 'for-4.5-ancestor-test' of > git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup > git bisect good bc9b145a092aca91a7f6ef40cdb3628b6ada7ec9 > # good: [a4fcad656e1100bdda9b0b752b93a1a276810469] fm10k: whitespace cleanups > git bisect good a4fcad656e1100bdda9b0b752b93a1a276810469 > # bad: [7302b9d90117496049dd4bfa28755f7c2ed55b27] ieee802154/adf7242: Driver > for ADF7242 MAC IEEE802154 > git bisect bad 7302b9d90117496049dd4bfa28755f7c2ed55b27 > # bad: [a0c38245153abe1fd844af9b166d1a5d5dafe7b1] Bluetooth: hci_intel: Use > shorter timeout for HCI commands > git bisect bad a0c38245153abe1fd844af9b166d1a5d5dafe7b1 > # good: [bf943cbf76ecd3b9838a80d5e08777b0f4ccc665] Bluetooth: Move fast > connectable code to hci_request.c > git bisect good bf943cbf76ecd3b9838a80d5e08777b0f4ccc665 > # bad: [742c59516822f4a4bc23b0961d88c569a7f1bf71] Bluetooth: Simplify setting > Configuration Field > git bisect bad 742c59516822f4a4bc23b0961d88c569a7f1bf71 > # bad: [02c04afea93fbba7925984df455bc63e7d92da97] Bluetooth: Si
Re: 4.5 Regression - mouse not working after resume from suspend
On 06/02/16 08:37, Chris Clayton wrote: > There seems to be a regression in resuming my laptop from a suspend to RAM or > disk. The symptom is that my bluetooth > mouse doesn't work after the resume. The kernel is built after a pull of > Linus' tree this morning (v4.5-rc2-340-g5af9c2e). > > Attached is the output from dmesg showing the boot, suspend (to RAM) and > resume. You'll see that during the resume, > error -517 is being reported for some devices. Suspend/resume has worked > perfectly with a 4.[234].x kernels. > > I'll start a bisection, but thought I'd give a heads up in case someone can > see the problem before I get done with the > bisect. > The bisection ended up at: 2ff13894cfb877cb3d02d96a8402202f0a6f3efd is the first bad commit commit 2ff13894cfb877cb3d02d96a8402202f0a6f3efd Author: Johan Hedberg Date: Wed Nov 25 16:15:44 2015 +0200 Bluetooth: Perform HCI update for power on synchronously The request to update HCI during power on is always coming either from hdev->req_workqueue or through an ioctl, so it's safe to use hci_req_sync for it. This way we also eliminate potential races with incoming mgmt commands or other actions while powering on. Part of this refactoring is the splitting of mgmt_powered() into mgmt_power_on() and __mgmt_power_off() functions. The main reason is the different requirements as far as hdev locking is concerned, as highlighted with the __ prefix of the power off API. Since the power on in the case of clearing the AUTO_OFF flag cannot be done synchronously in the set_powered mgmt handler, the hci_power_on work callback is extended to cover this (which also simplifies the set_powered helper a lot). Signed-off-by: Johan Hedberg Signed-off-by: Marcel Holtmann :04 04 a093d0be66f39f99c33a6a4725b2330ca9b41d03 a1eff79cec3ee7208e5aa200ab5069726bbeea8e M include :04 04 d2d122193b33d45fcb9c2bc69f2024487a7528a0 0036e1ec2e125f2432cfd420b5f79ca133ec34f7 M net The bisect log is: git bisect start # bad: [5af9c2e19da6514a1a50b07d97d93b74a7711873] Merge branch 'akpm' (patches from Andrew) git bisect bad 5af9c2e19da6514a1a50b07d97d93b74a7711873 # good: [afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc] Linux 4.4 git bisect good afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc # bad: [63b6da39bb38e8f1a1ef3180d32a39d6baf9da84] perf: Fix perf_event_exit_task() race git bisect bad 63b6da39bb38e8f1a1ef3180d32a39d6baf9da84 # bad: [aee3bfa3307cd0da2126bdc0ea359dabea5ee8f7] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next git bisect bad aee3bfa3307cd0da2126bdc0ea359dabea5ee8f7 # good: [60b7eca1dc2ec066916b3b7ac6ad89bea13cb9af] Merge tag 'upstream-4.5-rc1' of git://git.infradead.org/linux-ubifs git bisect good 60b7eca1dc2ec066916b3b7ac6ad89bea13cb9af # bad: [a188222b6ed29404ac2d4232d35d1fe0e77af370] net: Rename NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK git bisect bad a188222b6ed29404ac2d4232d35d1fe0e77af370 # good: [1343c65f70ee1b1f968a08b30e1836a4e37116cd] fm10k: always check init_hw for errors git bisect good 1343c65f70ee1b1f968a08b30e1836a4e37116cd # good: [bc9b145a092aca91a7f6ef40cdb3628b6ada7ec9] Merge branch 'for-4.5-ancestor-test' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup git bisect good bc9b145a092aca91a7f6ef40cdb3628b6ada7ec9 # good: [a4fcad656e1100bdda9b0b752b93a1a276810469] fm10k: whitespace cleanups git bisect good a4fcad656e1100bdda9b0b752b93a1a276810469 # bad: [7302b9d90117496049dd4bfa28755f7c2ed55b27] ieee802154/adf7242: Driver for ADF7242 MAC IEEE802154 git bisect bad 7302b9d90117496049dd4bfa28755f7c2ed55b27 # bad: [a0c38245153abe1fd844af9b166d1a5d5dafe7b1] Bluetooth: hci_intel: Use shorter timeout for HCI commands git bisect bad a0c38245153abe1fd844af9b166d1a5d5dafe7b1 # good: [bf943cbf76ecd3b9838a80d5e08777b0f4ccc665] Bluetooth: Move fast connectable code to hci_request.c git bisect good bf943cbf76ecd3b9838a80d5e08777b0f4ccc665 # bad: [742c59516822f4a4bc23b0961d88c569a7f1bf71] Bluetooth: Simplify setting Configuration Field git bisect bad 742c59516822f4a4bc23b0961d88c569a7f1bf71 # bad: [02c04afea93fbba7925984df455bc63e7d92da97] Bluetooth: Simplify read_adv_features code git bisect bad 02c04afea93fbba7925984df455bc63e7d92da97 # bad: [2ff13894cfb877cb3d02d96a8402202f0a6f3efd] Bluetooth: Perform HCI update for power on synchronously git bisect bad 2ff13894cfb877cb3d02d96a8402202f0a6f3efd # first bad commit: [2ff13894cfb877cb3d02d96a8402202f0a6f3efd] Bluetooth: Perform HCI update for power on synchronously Just shout if you need any additional diagnotics. > Chris >
4.5 Regression - mouse not working after resume from suspend
There seems to be a regression in resuming my laptop from a suspend to RAM or disk. The symptom is that my bluetooth mouse doesn't work after the resume. The kernel is built after a pull of Linus' tree this morning (v4.5-rc2-340-g5af9c2e). Attached is the output from dmesg showing the boot, suspend (to RAM) and resume. You'll see that during the resume, error -517 is being reported for some devices. Suspend/resume has worked perfectly with a 4.[234].x kernels. I'll start a bisection, but thought I'd give a heads up in case someone can see the problem before I get done with the bisect. Chris [0.00] Linux version 4.5.0-rc2+ (chris@laptop) (gcc version 5.3.1 20160202 (GCC) ) #318 SMP PREEMPT Sat Feb 6 06:38:55 GMT 2016 [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-4.5.0-rc2+ ro root=/dev/sda2 resume=/dev/sda6 rootfstype=ext4 net.ifnames=0 [0.00] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 [0.00] x86/fpu: Supporting XSAVE feature 0x01: 'x87 floating point registers' [0.00] x86/fpu: Supporting XSAVE feature 0x02: 'SSE registers' [0.00] x86/fpu: Supporting XSAVE feature 0x04: 'AVX registers' [0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format. [0.00] x86/fpu: Using 'eager' FPU context switches. [0.00] e820: BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x-0x0009d7ff] usable [0.00] BIOS-e820: [mem 0x0009d800-0x0009] reserved [0.00] BIOS-e820: [mem 0x000e-0x000f] reserved [0.00] BIOS-e820: [mem 0x0010-0xd7216fff] usable [0.00] BIOS-e820: [mem 0xd7217000-0xd721dfff] ACPI NVS [0.00] BIOS-e820: [mem 0xd721e000-0xd7a0cfff] usable [0.00] BIOS-e820: [mem 0xd7a0d000-0xd7ca1fff] reserved [0.00] BIOS-e820: [mem 0xd7ca2000-0xdb4d] usable [0.00] BIOS-e820: [mem 0xdb4e-0xdb82dfff] reserved [0.00] BIOS-e820: [mem 0xdb82e000-0xdb88afff] usable [0.00] BIOS-e820: [mem 0xdb88b000-0xdb9bcfff] ACPI NVS [0.00] BIOS-e820: [mem 0xdb9bd000-0xdbffefff] reserved [0.00] BIOS-e820: [mem 0xdbfff000-0xdbff] usable [0.00] BIOS-e820: [mem 0xdd00-0xdf1f] reserved [0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved [0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved [0.00] BIOS-e820: [mem 0xfed0-0xfed03fff] reserved [0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved [0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved [0.00] BIOS-e820: [mem 0xff00-0x] reserved [0.00] BIOS-e820: [mem 0x0001-0x00041fdf] usable [0.00] NX (Execute Disable) protection: active [0.00] SMBIOS 2.7 present. [0.00] DMI: Notebook W65_67SZ /W65_67SZ, BIOS 1.03.05 02/26/2014 [0.00] e820: update [mem 0x-0x0fff] usable ==> reserved [0.00] e820: remove [mem 0x000a-0x000f] usable [0.00] e820: last_pfn = 0x41fe00 max_arch_pfn = 0x4 [0.00] MTRR default type: uncachable [0.00] MTRR fixed ranges enabled: [0.00] 0-9 write-back [0.00] A-B uncachable [0.00] C-C write-protect [0.00] D-E7FFF uncachable [0.00] E8000-F write-protect [0.00] MTRR variable ranges enabled: [0.00] 0 base 00 mask 7C write-back [0.00] 1 base 04 mask 7FE000 write-back [0.00] 2 base 00E000 mask 7FE000 uncachable [0.00] 3 base 00DE00 mask 7FFE00 uncachable [0.00] 4 base 00DD00 mask 7FFF00 uncachable [0.00] 5 base 041FE0 mask 7FFFE0 uncachable [0.00] 6 disabled [0.00] 7 disabled [0.00] 8 disabled [0.00] 9 disabled [0.00] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WC UC- WT [0.00] e820: update [mem 0xdd00-0x] usable ==> reserved [0.00] e820: last_pfn = 0xdc000 max_arch_pfn = 0x4 [0.00] found SMP MP-table at [mem 0x000fd820-0x000fd82f] mapped at [880fd820] [0.00] Base memory trampoline at [88097000] 97000 size 24576 [0.00] Using GB pages for direct mapping [0.00] BRK [0x01a05000, 0x01a05fff] PGTABLE [0.00] BRK [0x01a06000, 0x01a06fff] PGTABLE [0.00] BRK [0x01a07000, 0x01a07fff] PGTABLE [0.00] BRK [0x01a08000, 0x01a08fff] PGTABLE [0.00] BRK [0x01a09000, 0x01a09fff]
Re: 4.5 Regression - mouse not working after resume from suspend
On 06/02/16 08:37, Chris Clayton wrote: > There seems to be a regression in resuming my laptop from a suspend to RAM or > disk. The symptom is that my bluetooth > mouse doesn't work after the resume. The kernel is built after a pull of > Linus' tree this morning (v4.5-rc2-340-g5af9c2e). > > Attached is the output from dmesg showing the boot, suspend (to RAM) and > resume. You'll see that during the resume, > error -517 is being reported for some devices. Suspend/resume has worked > perfectly with a 4.[234].x kernels. > > I'll start a bisection, but thought I'd give a heads up in case someone can > see the problem before I get done with the > bisect. > The bisection ended up at: 2ff13894cfb877cb3d02d96a8402202f0a6f3efd is the first bad commit commit 2ff13894cfb877cb3d02d96a8402202f0a6f3efd Author: Johan Hedberg <johan.hedb...@intel.com> Date: Wed Nov 25 16:15:44 2015 +0200 Bluetooth: Perform HCI update for power on synchronously The request to update HCI during power on is always coming either from hdev->req_workqueue or through an ioctl, so it's safe to use hci_req_sync for it. This way we also eliminate potential races with incoming mgmt commands or other actions while powering on. Part of this refactoring is the splitting of mgmt_powered() into mgmt_power_on() and __mgmt_power_off() functions. The main reason is the different requirements as far as hdev locking is concerned, as highlighted with the __ prefix of the power off API. Since the power on in the case of clearing the AUTO_OFF flag cannot be done synchronously in the set_powered mgmt handler, the hci_power_on work callback is extended to cover this (which also simplifies the set_powered helper a lot). Signed-off-by: Johan Hedberg <johan.hedb...@intel.com> Signed-off-by: Marcel Holtmann <mar...@holtmann.org> :04 04 a093d0be66f39f99c33a6a4725b2330ca9b41d03 a1eff79cec3ee7208e5aa200ab5069726bbeea8e M include :04 04 d2d122193b33d45fcb9c2bc69f2024487a7528a0 0036e1ec2e125f2432cfd420b5f79ca133ec34f7 M net The bisect log is: git bisect start # bad: [5af9c2e19da6514a1a50b07d97d93b74a7711873] Merge branch 'akpm' (patches from Andrew) git bisect bad 5af9c2e19da6514a1a50b07d97d93b74a7711873 # good: [afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc] Linux 4.4 git bisect good afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc # bad: [63b6da39bb38e8f1a1ef3180d32a39d6baf9da84] perf: Fix perf_event_exit_task() race git bisect bad 63b6da39bb38e8f1a1ef3180d32a39d6baf9da84 # bad: [aee3bfa3307cd0da2126bdc0ea359dabea5ee8f7] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next git bisect bad aee3bfa3307cd0da2126bdc0ea359dabea5ee8f7 # good: [60b7eca1dc2ec066916b3b7ac6ad89bea13cb9af] Merge tag 'upstream-4.5-rc1' of git://git.infradead.org/linux-ubifs git bisect good 60b7eca1dc2ec066916b3b7ac6ad89bea13cb9af # bad: [a188222b6ed29404ac2d4232d35d1fe0e77af370] net: Rename NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK git bisect bad a188222b6ed29404ac2d4232d35d1fe0e77af370 # good: [1343c65f70ee1b1f968a08b30e1836a4e37116cd] fm10k: always check init_hw for errors git bisect good 1343c65f70ee1b1f968a08b30e1836a4e37116cd # good: [bc9b145a092aca91a7f6ef40cdb3628b6ada7ec9] Merge branch 'for-4.5-ancestor-test' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup git bisect good bc9b145a092aca91a7f6ef40cdb3628b6ada7ec9 # good: [a4fcad656e1100bdda9b0b752b93a1a276810469] fm10k: whitespace cleanups git bisect good a4fcad656e1100bdda9b0b752b93a1a276810469 # bad: [7302b9d90117496049dd4bfa28755f7c2ed55b27] ieee802154/adf7242: Driver for ADF7242 MAC IEEE802154 git bisect bad 7302b9d90117496049dd4bfa28755f7c2ed55b27 # bad: [a0c38245153abe1fd844af9b166d1a5d5dafe7b1] Bluetooth: hci_intel: Use shorter timeout for HCI commands git bisect bad a0c38245153abe1fd844af9b166d1a5d5dafe7b1 # good: [bf943cbf76ecd3b9838a80d5e08777b0f4ccc665] Bluetooth: Move fast connectable code to hci_request.c git bisect good bf943cbf76ecd3b9838a80d5e08777b0f4ccc665 # bad: [742c59516822f4a4bc23b0961d88c569a7f1bf71] Bluetooth: Simplify setting Configuration Field git bisect bad 742c59516822f4a4bc23b0961d88c569a7f1bf71 # bad: [02c04afea93fbba7925984df455bc63e7d92da97] Bluetooth: Simplify read_adv_features code git bisect bad 02c04afea93fbba7925984df455bc63e7d92da97 # bad: [2ff13894cfb877cb3d02d96a8402202f0a6f3efd] Bluetooth: Perform HCI update for power on synchronously git bisect bad 2ff13894cfb877cb3d02d96a8402202f0a6f3efd # first bad commit: [2ff13894cfb877cb3d02d96a8402202f0a6f3efd] Bluetooth: Perform HCI update for power on synchronously Just shout if you need any additional diagnotics. > Chris >
Re: 4.5 Regression - mouse not working after resume from suspend
On 06/02/16 11:38, Chris Clayton wrote: > > > On 06/02/16 08:37, Chris Clayton wrote: >> There seems to be a regression in resuming my laptop from a suspend to RAM >> or disk. The symptom is that my bluetooth >> mouse doesn't work after the resume. The kernel is built after a pull of >> Linus' tree this morning (v4.5-rc2-340-g5af9c2e). >> >> Attached is the output from dmesg showing the boot, suspend (to RAM) and >> resume. You'll see that during the resume, >> error -517 is being reported for some devices. Suspend/resume has worked >> perfectly with a 4.[234].x kernels. >> >> I'll start a bisection, but thought I'd give a heads up in case someone can >> see the problem before I get done with the >> bisect. >> > > The bisection ended up at: > > 2ff13894cfb877cb3d02d96a8402202f0a6f3efd is the first bad commit > commit 2ff13894cfb877cb3d02d96a8402202f0a6f3efd > Author: Johan Hedberg <johan.hedb...@intel.com> > Date: Wed Nov 25 16:15:44 2015 +0200 > > Bluetooth: Perform HCI update for power on synchronously > > The request to update HCI during power on is always coming either from > hdev->req_workqueue or through an ioctl, so it's safe to use > hci_req_sync for it. This way we also eliminate potential races with > incoming mgmt commands or other actions while powering on. > > Part of this refactoring is the splitting of mgmt_powered() into > mgmt_power_on() and __mgmt_power_off() functions. The main reason is > the different requirements as far as hdev locking is concerned, as > highlighted with the __ prefix of the power off API. > > Since the power on in the case of clearing the AUTO_OFF flag cannot be > done synchronously in the set_powered mgmt handler, the hci_power_on > work callback is extended to cover this (which also simplifies the > set_powered helper a lot). > > Signed-off-by: Johan Hedberg <johan.hedb...@intel.com> > Signed-off-by: Marcel Holtmann <mar...@holtmann.org> > > :04 04 a093d0be66f39f99c33a6a4725b2330ca9b41d03 > a1eff79cec3ee7208e5aa200ab5069726bbeea8e M include > :04 04 d2d122193b33d45fcb9c2bc69f2024487a7528a0 > 0036e1ec2e125f2432cfd420b5f79ca133ec34f7 M net > > I've just built a kernel at bf943cbf76ecd3b9838a80d5e08777b0f4ccc665 (the commit prior to the one the bisect landed on) and my BT mouse works fine after a suspend/resume. With a kernel built at 2ff13894cfb877cb3d02d96a8402202f0a6f3efd, the mouse does not work after resume. > The bisect log is: > > git bisect start > # bad: [5af9c2e19da6514a1a50b07d97d93b74a7711873] Merge branch 'akpm' > (patches from Andrew) > git bisect bad 5af9c2e19da6514a1a50b07d97d93b74a7711873 > # good: [afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc] Linux 4.4 > git bisect good afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc > # bad: [63b6da39bb38e8f1a1ef3180d32a39d6baf9da84] perf: Fix > perf_event_exit_task() race > git bisect bad 63b6da39bb38e8f1a1ef3180d32a39d6baf9da84 > # bad: [aee3bfa3307cd0da2126bdc0ea359dabea5ee8f7] Merge > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next > git bisect bad aee3bfa3307cd0da2126bdc0ea359dabea5ee8f7 > # good: [60b7eca1dc2ec066916b3b7ac6ad89bea13cb9af] Merge tag > 'upstream-4.5-rc1' of git://git.infradead.org/linux-ubifs > git bisect good 60b7eca1dc2ec066916b3b7ac6ad89bea13cb9af > # bad: [a188222b6ed29404ac2d4232d35d1fe0e77af370] net: Rename > NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK > git bisect bad a188222b6ed29404ac2d4232d35d1fe0e77af370 > # good: [1343c65f70ee1b1f968a08b30e1836a4e37116cd] fm10k: always check > init_hw for errors > git bisect good 1343c65f70ee1b1f968a08b30e1836a4e37116cd > # good: [bc9b145a092aca91a7f6ef40cdb3628b6ada7ec9] Merge branch > 'for-4.5-ancestor-test' of > git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup > git bisect good bc9b145a092aca91a7f6ef40cdb3628b6ada7ec9 > # good: [a4fcad656e1100bdda9b0b752b93a1a276810469] fm10k: whitespace cleanups > git bisect good a4fcad656e1100bdda9b0b752b93a1a276810469 > # bad: [7302b9d90117496049dd4bfa28755f7c2ed55b27] ieee802154/adf7242: Driver > for ADF7242 MAC IEEE802154 > git bisect bad 7302b9d90117496049dd4bfa28755f7c2ed55b27 > # bad: [a0c38245153abe1fd844af9b166d1a5d5dafe7b1] Bluetooth: hci_intel: Use > shorter timeout for HCI commands > git bisect bad a0c38245153abe1fd844af9b166d1a5d5dafe7b1 > # good: [bf943cbf76ecd3b9838a80d5e08777b0f4ccc665] Bluetooth: Move fast > connectable code to hci_request.c > git bisect good bf943cbf76ecd3b9838a80d5e08777b0f4ccc665 > # bad: [742c59516822f4a4bc23b0961d88c569a7f1bf71] Bluetooth: Simplify setting > Configuration Field > g
4.5 Regression - mouse not working after resume from suspend
There seems to be a regression in resuming my laptop from a suspend to RAM or disk. The symptom is that my bluetooth mouse doesn't work after the resume. The kernel is built after a pull of Linus' tree this morning (v4.5-rc2-340-g5af9c2e). Attached is the output from dmesg showing the boot, suspend (to RAM) and resume. You'll see that during the resume, error -517 is being reported for some devices. Suspend/resume has worked perfectly with a 4.[234].x kernels. I'll start a bisection, but thought I'd give a heads up in case someone can see the problem before I get done with the bisect. Chris [0.00] Linux version 4.5.0-rc2+ (chris@laptop) (gcc version 5.3.1 20160202 (GCC) ) #318 SMP PREEMPT Sat Feb 6 06:38:55 GMT 2016 [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-4.5.0-rc2+ ro root=/dev/sda2 resume=/dev/sda6 rootfstype=ext4 net.ifnames=0 [0.00] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 [0.00] x86/fpu: Supporting XSAVE feature 0x01: 'x87 floating point registers' [0.00] x86/fpu: Supporting XSAVE feature 0x02: 'SSE registers' [0.00] x86/fpu: Supporting XSAVE feature 0x04: 'AVX registers' [0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format. [0.00] x86/fpu: Using 'eager' FPU context switches. [0.00] e820: BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x-0x0009d7ff] usable [0.00] BIOS-e820: [mem 0x0009d800-0x0009] reserved [0.00] BIOS-e820: [mem 0x000e-0x000f] reserved [0.00] BIOS-e820: [mem 0x0010-0xd7216fff] usable [0.00] BIOS-e820: [mem 0xd7217000-0xd721dfff] ACPI NVS [0.00] BIOS-e820: [mem 0xd721e000-0xd7a0cfff] usable [0.00] BIOS-e820: [mem 0xd7a0d000-0xd7ca1fff] reserved [0.00] BIOS-e820: [mem 0xd7ca2000-0xdb4d] usable [0.00] BIOS-e820: [mem 0xdb4e-0xdb82dfff] reserved [0.00] BIOS-e820: [mem 0xdb82e000-0xdb88afff] usable [0.00] BIOS-e820: [mem 0xdb88b000-0xdb9bcfff] ACPI NVS [0.00] BIOS-e820: [mem 0xdb9bd000-0xdbffefff] reserved [0.00] BIOS-e820: [mem 0xdbfff000-0xdbff] usable [0.00] BIOS-e820: [mem 0xdd00-0xdf1f] reserved [0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved [0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved [0.00] BIOS-e820: [mem 0xfed0-0xfed03fff] reserved [0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved [0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved [0.00] BIOS-e820: [mem 0xff00-0x] reserved [0.00] BIOS-e820: [mem 0x0001-0x00041fdf] usable [0.00] NX (Execute Disable) protection: active [0.00] SMBIOS 2.7 present. [0.00] DMI: Notebook W65_67SZ /W65_67SZ, BIOS 1.03.05 02/26/2014 [0.00] e820: update [mem 0x-0x0fff] usable ==> reserved [0.00] e820: remove [mem 0x000a-0x000f] usable [0.00] e820: last_pfn = 0x41fe00 max_arch_pfn = 0x4 [0.00] MTRR default type: uncachable [0.00] MTRR fixed ranges enabled: [0.00] 0-9 write-back [0.00] A-B uncachable [0.00] C-C write-protect [0.00] D-E7FFF uncachable [0.00] E8000-F write-protect [0.00] MTRR variable ranges enabled: [0.00] 0 base 00 mask 7C write-back [0.00] 1 base 04 mask 7FE000 write-back [0.00] 2 base 00E000 mask 7FE000 uncachable [0.00] 3 base 00DE00 mask 7FFE00 uncachable [0.00] 4 base 00DD00 mask 7FFF00 uncachable [0.00] 5 base 041FE0 mask 7FFFE0 uncachable [0.00] 6 disabled [0.00] 7 disabled [0.00] 8 disabled [0.00] 9 disabled [0.00] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WC UC- WT [0.00] e820: update [mem 0xdd00-0x] usable ==> reserved [0.00] e820: last_pfn = 0xdc000 max_arch_pfn = 0x4 [0.00] found SMP MP-table at [mem 0x000fd820-0x000fd82f] mapped at [880fd820] [0.00] Base memory trampoline at [88097000] 97000 size 24576 [0.00] Using GB pages for direct mapping [0.00] BRK [0x01a05000, 0x01a05fff] PGTABLE [0.00] BRK [0x01a06000, 0x01a06fff] PGTABLE [0.00] BRK [0x01a07000, 0x01a07fff] PGTABLE [0.00] BRK [0x01a08000, 0x01a08fff] PGTABLE [0.00] BRK [0x01a09000, 0x01a09fff]
Re: EXT4: new warnings from 4.3.0-rc2
On 09/21/15 15:55, Chris Clayton wrote: > Thanks Ortwin. > > On 09/21/15 14:27, Ortwin Glück wrote: >>> [2.481399] EXT4-fs (sda2): couldn't mount as ext3 due to feature >>> incompatibilities >>> [2.482426] EXT4-fs (sda2): couldn't mount as ext2 due to feature >>> incompatibilities >> >> As the kernel doesn't know which FS your root is, it tries the whole list of >> filesystems (init/do_mounts.c >> mount_block_root()). Since the removal of ext3, now the ext4 code is >> responsbile for mounting ext3. Since your FS is >> ext4 and not ext3, the probe for ext3 fails. That's what the message tells >> you. You get these even in previous kernels >> if you say N to ext3 during config. >> > No, I do not get the messages from 4.2.0 even though it is configured the > same as 4.3.0-rc3 as far as EXT{2,3,4} is > concerned: > > # CONFIG_EXT2_FS is not set > # CONFIG_EXT3_FS is not set > CONFIG_EXT4_FS=y > CONFIG_EXT4_USE_FOR_EXT2=y > # CONFIG_EXT4_FS_POSIX_ACL is not set > # CONFIG_EXT4_FS_SECURITY is not set > # CONFIG_EXT4_ENCRYPTION is not set > # CONFIG_EXT4_DEBUG is not set > [chris:~/kernel/linux]$ cd ../linux-4.2.0/ > [chris:~/kernel/linux-4.2.0]$ grep EXT[234] .config > # CONFIG_EXT2_FS is not set > # CONFIG_EXT3_FS is not set > CONFIG_EXT4_FS=y > CONFIG_EXT4_USE_FOR_EXT23=y > # CONFIG_EXT4_FS_POSIX_ACL is not set > # CONFIG_EXT4_FS_SECURITY is not set > # CONFIG_EXT4_ENCRYPTION is not set > # CONFIG_EXT4_DEBUG is not set > > That's why I said they are new messages. > > I've just booted 4.1.7 and I get the messages from that kernel too. I wonder > if there's a recent fix that has made it > into 4.1.7, but not into 4.2.0. I'll apply Greg's 4.2.1-rc1 patch and see > what I get with that. > Applying the 4.2.1-rc1 patch results in a kernel that emits the messages, so I guess my fix-not-yet-in-4.2 theory is right. I'll just ignore the messages. Sorry for the noise. > Chris > > >> If it bugs you, you can add a hint to your kernel command line: >> rootfstype=ext4 >> >> Ortwin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: EXT4: new warnings from 4.3.0-rc2
Thanks Ortwin. On 09/21/15 14:27, Ortwin Glück wrote: >> [2.481399] EXT4-fs (sda2): couldn't mount as ext3 due to feature >> incompatibilities >> [2.482426] EXT4-fs (sda2): couldn't mount as ext2 due to feature >> incompatibilities > > As the kernel doesn't know which FS your root is, it tries the whole list of > filesystems (init/do_mounts.c > mount_block_root()). Since the removal of ext3, now the ext4 code is > responsbile for mounting ext3. Since your FS is > ext4 and not ext3, the probe for ext3 fails. That's what the message tells > you. You get these even in previous kernels > if you say N to ext3 during config. > No, I do not get the messages from 4.2.0 even though it is configured the same as 4.3.0-rc3 as far as EXT{2,3,4} is concerned: # CONFIG_EXT2_FS is not set # CONFIG_EXT3_FS is not set CONFIG_EXT4_FS=y CONFIG_EXT4_USE_FOR_EXT2=y # CONFIG_EXT4_FS_POSIX_ACL is not set # CONFIG_EXT4_FS_SECURITY is not set # CONFIG_EXT4_ENCRYPTION is not set # CONFIG_EXT4_DEBUG is not set [chris:~/kernel/linux]$ cd ../linux-4.2.0/ [chris:~/kernel/linux-4.2.0]$ grep EXT[234] .config # CONFIG_EXT2_FS is not set # CONFIG_EXT3_FS is not set CONFIG_EXT4_FS=y CONFIG_EXT4_USE_FOR_EXT23=y # CONFIG_EXT4_FS_POSIX_ACL is not set # CONFIG_EXT4_FS_SECURITY is not set # CONFIG_EXT4_ENCRYPTION is not set # CONFIG_EXT4_DEBUG is not set That's why I said they are new messages. I've just booted 4.1.7 and I get the messages from that kernel too. I wonder if there's a recent fix that has made it into 4.1.7, but not into 4.2.0. I'll apply Greg's 4.2.1-rc1 patch and see what I get with that. Chris > If it bugs you, you can add a hint to your kernel command line: > rootfstype=ext4 > > Ortwin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
EXT4: new warnings from 4.3.0-rc2
Hi, I've just built and booted 4.3.0-rc2 and I'm seeing the following new messages on the console during boot up: [2.481399] EXT4-fs (sda2): couldn't mount as ext3 due to feature incompatibilities [2.482426] EXT4-fs (sda2): couldn't mount as ext2 due to feature incompatibilities They are immediately followed by: [2.507948] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null) [3.549523] EXT4-fs (sda2): re-mounted. Opts: (null) and they are the messages I normally see (from 4.2.0 and earlier). sda2 is my root partition and is mounted OK, so my system is operating as before, but I thought you would want a heads up about these (slightly alarming) new console messages. The output from dmesg is attached, in case it helps. Chris [0.00] Initializing cgroup subsys cpu [0.00] Linux version 4.3.0-rc2 (chris@laptop) (gcc version 5.2.1 20150915 (GCC) ) #251 SMP PREEMPT Mon Sep 21 07:50:42 BST 2015 [0.00] Command line: root=/dev/sda2 ro resume=/dev/sda6 [0.00] x86/fpu: xstate_offset[2]: 0240, xstate_sizes[2]: 0100 [0.00] x86/fpu: Supporting XSAVE feature 0x01: 'x87 floating point registers' [0.00] x86/fpu: Supporting XSAVE feature 0x02: 'SSE registers' [0.00] x86/fpu: Supporting XSAVE feature 0x04: 'AVX registers' [0.00] x86/fpu: Enabled xstate features 0x7, context size is 0x340 bytes, using 'standard' format. [0.00] x86/fpu: Using 'eager' FPU context switches. [0.00] e820: BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x-0x0009d7ff] usable [0.00] BIOS-e820: [mem 0x0009d800-0x0009] reserved [0.00] BIOS-e820: [mem 0x000e-0x000f] reserved [0.00] BIOS-e820: [mem 0x0010-0xd7216fff] usable [0.00] BIOS-e820: [mem 0xd7217000-0xd721dfff] ACPI NVS [0.00] BIOS-e820: [mem 0xd721e000-0xd7a0cfff] usable [0.00] BIOS-e820: [mem 0xd7a0d000-0xd7ca1fff] reserved [0.00] BIOS-e820: [mem 0xd7ca2000-0xdb4d] usable [0.00] BIOS-e820: [mem 0xdb4e-0xdb82dfff] reserved [0.00] BIOS-e820: [mem 0xdb82e000-0xdb88afff] usable [0.00] BIOS-e820: [mem 0xdb88b000-0xdb9bcfff] ACPI NVS [0.00] BIOS-e820: [mem 0xdb9bd000-0xdbffefff] reserved [0.00] BIOS-e820: [mem 0xdbfff000-0xdbff] usable [0.00] BIOS-e820: [mem 0xdd00-0xdf1f] reserved [0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved [0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved [0.00] BIOS-e820: [mem 0xfed0-0xfed03fff] reserved [0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved [0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved [0.00] BIOS-e820: [mem 0xff00-0x] reserved [0.00] BIOS-e820: [mem 0x0001-0x00041fdf] usable [0.00] NX (Execute Disable) protection: active [0.00] SMBIOS 2.7 present. [0.00] DMI: Notebook W65_67SZ /W65_67SZ, BIOS 1.03.05 02/26/2014 [0.00] e820: update [mem 0x-0x0fff] usable ==> reserved [0.00] e820: remove [mem 0x000a-0x000f] usable [0.00] e820: last_pfn = 0x41fe00 max_arch_pfn = 0x4 [0.00] MTRR default type: uncachable [0.00] MTRR fixed ranges enabled: [0.00] 0-9 write-back [0.00] A-B uncachable [0.00] C-C write-protect [0.00] D-E7FFF uncachable [0.00] E8000-F write-protect [0.00] MTRR variable ranges enabled: [0.00] 0 base 00 mask 7C write-back [0.00] 1 base 04 mask 7FE000 write-back [0.00] 2 base 00E000 mask 7FE000 uncachable [0.00] 3 base 00DE00 mask 7FFE00 uncachable [0.00] 4 base 00DD00 mask 7FFF00 uncachable [0.00] 5 base 041FE0 mask 7FFFE0 uncachable [0.00] 6 disabled [0.00] 7 disabled [0.00] 8 disabled [0.00] 9 disabled [0.00] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WC UC- WT [0.00] e820: update [mem 0xdd00-0x] usable ==> reserved [0.00] e820: last_pfn = 0xdc000 max_arch_pfn = 0x4 [0.00] found SMP MP-table at [mem 0x000fd820-0x000fd82f] mapped at [880fd820] [0.00] Base memory trampoline at [88097000] 97000 size 24576 [0.00] Using GB pages for direct mapping [0.00] init_memory_mapping: [mem 0x-0x000f] [0.00] [mem
Re: EXT4: new warnings from 4.3.0-rc2
On 09/21/15 15:55, Chris Clayton wrote: > Thanks Ortwin. > > On 09/21/15 14:27, Ortwin Glück wrote: >>> [2.481399] EXT4-fs (sda2): couldn't mount as ext3 due to feature >>> incompatibilities >>> [2.482426] EXT4-fs (sda2): couldn't mount as ext2 due to feature >>> incompatibilities >> >> As the kernel doesn't know which FS your root is, it tries the whole list of >> filesystems (init/do_mounts.c >> mount_block_root()). Since the removal of ext3, now the ext4 code is >> responsbile for mounting ext3. Since your FS is >> ext4 and not ext3, the probe for ext3 fails. That's what the message tells >> you. You get these even in previous kernels >> if you say N to ext3 during config. >> > No, I do not get the messages from 4.2.0 even though it is configured the > same as 4.3.0-rc3 as far as EXT{2,3,4} is > concerned: > > # CONFIG_EXT2_FS is not set > # CONFIG_EXT3_FS is not set > CONFIG_EXT4_FS=y > CONFIG_EXT4_USE_FOR_EXT2=y > # CONFIG_EXT4_FS_POSIX_ACL is not set > # CONFIG_EXT4_FS_SECURITY is not set > # CONFIG_EXT4_ENCRYPTION is not set > # CONFIG_EXT4_DEBUG is not set > [chris:~/kernel/linux]$ cd ../linux-4.2.0/ > [chris:~/kernel/linux-4.2.0]$ grep EXT[234] .config > # CONFIG_EXT2_FS is not set > # CONFIG_EXT3_FS is not set > CONFIG_EXT4_FS=y > CONFIG_EXT4_USE_FOR_EXT23=y > # CONFIG_EXT4_FS_POSIX_ACL is not set > # CONFIG_EXT4_FS_SECURITY is not set > # CONFIG_EXT4_ENCRYPTION is not set > # CONFIG_EXT4_DEBUG is not set > > That's why I said they are new messages. > > I've just booted 4.1.7 and I get the messages from that kernel too. I wonder > if there's a recent fix that has made it > into 4.1.7, but not into 4.2.0. I'll apply Greg's 4.2.1-rc1 patch and see > what I get with that. > Applying the 4.2.1-rc1 patch results in a kernel that emits the messages, so I guess my fix-not-yet-in-4.2 theory is right. I'll just ignore the messages. Sorry for the noise. > Chris > > >> If it bugs you, you can add a hint to your kernel command line: >> rootfstype=ext4 >> >> Ortwin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: EXT4: new warnings from 4.3.0-rc2
Thanks Ortwin. On 09/21/15 14:27, Ortwin Glück wrote: >> [2.481399] EXT4-fs (sda2): couldn't mount as ext3 due to feature >> incompatibilities >> [2.482426] EXT4-fs (sda2): couldn't mount as ext2 due to feature >> incompatibilities > > As the kernel doesn't know which FS your root is, it tries the whole list of > filesystems (init/do_mounts.c > mount_block_root()). Since the removal of ext3, now the ext4 code is > responsbile for mounting ext3. Since your FS is > ext4 and not ext3, the probe for ext3 fails. That's what the message tells > you. You get these even in previous kernels > if you say N to ext3 during config. > No, I do not get the messages from 4.2.0 even though it is configured the same as 4.3.0-rc3 as far as EXT{2,3,4} is concerned: # CONFIG_EXT2_FS is not set # CONFIG_EXT3_FS is not set CONFIG_EXT4_FS=y CONFIG_EXT4_USE_FOR_EXT2=y # CONFIG_EXT4_FS_POSIX_ACL is not set # CONFIG_EXT4_FS_SECURITY is not set # CONFIG_EXT4_ENCRYPTION is not set # CONFIG_EXT4_DEBUG is not set [chris:~/kernel/linux]$ cd ../linux-4.2.0/ [chris:~/kernel/linux-4.2.0]$ grep EXT[234] .config # CONFIG_EXT2_FS is not set # CONFIG_EXT3_FS is not set CONFIG_EXT4_FS=y CONFIG_EXT4_USE_FOR_EXT23=y # CONFIG_EXT4_FS_POSIX_ACL is not set # CONFIG_EXT4_FS_SECURITY is not set # CONFIG_EXT4_ENCRYPTION is not set # CONFIG_EXT4_DEBUG is not set That's why I said they are new messages. I've just booted 4.1.7 and I get the messages from that kernel too. I wonder if there's a recent fix that has made it into 4.1.7, but not into 4.2.0. I'll apply Greg's 4.2.1-rc1 patch and see what I get with that. Chris > If it bugs you, you can add a hint to your kernel command line: > rootfstype=ext4 > > Ortwin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
EXT4: new warnings from 4.3.0-rc2
Hi, I've just built and booted 4.3.0-rc2 and I'm seeing the following new messages on the console during boot up: [2.481399] EXT4-fs (sda2): couldn't mount as ext3 due to feature incompatibilities [2.482426] EXT4-fs (sda2): couldn't mount as ext2 due to feature incompatibilities They are immediately followed by: [2.507948] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null) [3.549523] EXT4-fs (sda2): re-mounted. Opts: (null) and they are the messages I normally see (from 4.2.0 and earlier). sda2 is my root partition and is mounted OK, so my system is operating as before, but I thought you would want a heads up about these (slightly alarming) new console messages. The output from dmesg is attached, in case it helps. Chris [0.00] Initializing cgroup subsys cpu [0.00] Linux version 4.3.0-rc2 (chris@laptop) (gcc version 5.2.1 20150915 (GCC) ) #251 SMP PREEMPT Mon Sep 21 07:50:42 BST 2015 [0.00] Command line: root=/dev/sda2 ro resume=/dev/sda6 [0.00] x86/fpu: xstate_offset[2]: 0240, xstate_sizes[2]: 0100 [0.00] x86/fpu: Supporting XSAVE feature 0x01: 'x87 floating point registers' [0.00] x86/fpu: Supporting XSAVE feature 0x02: 'SSE registers' [0.00] x86/fpu: Supporting XSAVE feature 0x04: 'AVX registers' [0.00] x86/fpu: Enabled xstate features 0x7, context size is 0x340 bytes, using 'standard' format. [0.00] x86/fpu: Using 'eager' FPU context switches. [0.00] e820: BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x-0x0009d7ff] usable [0.00] BIOS-e820: [mem 0x0009d800-0x0009] reserved [0.00] BIOS-e820: [mem 0x000e-0x000f] reserved [0.00] BIOS-e820: [mem 0x0010-0xd7216fff] usable [0.00] BIOS-e820: [mem 0xd7217000-0xd721dfff] ACPI NVS [0.00] BIOS-e820: [mem 0xd721e000-0xd7a0cfff] usable [0.00] BIOS-e820: [mem 0xd7a0d000-0xd7ca1fff] reserved [0.00] BIOS-e820: [mem 0xd7ca2000-0xdb4d] usable [0.00] BIOS-e820: [mem 0xdb4e-0xdb82dfff] reserved [0.00] BIOS-e820: [mem 0xdb82e000-0xdb88afff] usable [0.00] BIOS-e820: [mem 0xdb88b000-0xdb9bcfff] ACPI NVS [0.00] BIOS-e820: [mem 0xdb9bd000-0xdbffefff] reserved [0.00] BIOS-e820: [mem 0xdbfff000-0xdbff] usable [0.00] BIOS-e820: [mem 0xdd00-0xdf1f] reserved [0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved [0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved [0.00] BIOS-e820: [mem 0xfed0-0xfed03fff] reserved [0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved [0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved [0.00] BIOS-e820: [mem 0xff00-0x] reserved [0.00] BIOS-e820: [mem 0x0001-0x00041fdf] usable [0.00] NX (Execute Disable) protection: active [0.00] SMBIOS 2.7 present. [0.00] DMI: Notebook W65_67SZ /W65_67SZ, BIOS 1.03.05 02/26/2014 [0.00] e820: update [mem 0x-0x0fff] usable ==> reserved [0.00] e820: remove [mem 0x000a-0x000f] usable [0.00] e820: last_pfn = 0x41fe00 max_arch_pfn = 0x4 [0.00] MTRR default type: uncachable [0.00] MTRR fixed ranges enabled: [0.00] 0-9 write-back [0.00] A-B uncachable [0.00] C-C write-protect [0.00] D-E7FFF uncachable [0.00] E8000-F write-protect [0.00] MTRR variable ranges enabled: [0.00] 0 base 00 mask 7C write-back [0.00] 1 base 04 mask 7FE000 write-back [0.00] 2 base 00E000 mask 7FE000 uncachable [0.00] 3 base 00DE00 mask 7FFE00 uncachable [0.00] 4 base 00DD00 mask 7FFF00 uncachable [0.00] 5 base 041FE0 mask 7FFFE0 uncachable [0.00] 6 disabled [0.00] 7 disabled [0.00] 8 disabled [0.00] 9 disabled [0.00] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WC UC- WT [0.00] e820: update [mem 0xdd00-0x] usable ==> reserved [0.00] e820: last_pfn = 0xdc000 max_arch_pfn = 0x4 [0.00] found SMP MP-table at [mem 0x000fd820-0x000fd82f] mapped at [880fd820] [0.00] Base memory trampoline at [88097000] 97000 size 24576 [0.00] Using GB pages for direct mapping [0.00] init_memory_mapping: [mem 0x-0x000f] [0.00] [mem
Re: [PATCH] iommu: prompt for IOMMU_IO_PGTABLE_LPAE on ARM archs only
On 02/16/15 16:32, Will Deacon wrote: > Hi Chris, > > On Sun, Feb 15, 2015 at 11:17:19AM +0000, Chris Clayton wrote: >> When running "make oldconfig" for an x86_64 kernel, I was prompted for a >> setting for IOMMU_IO_PGTABLE_LPAE. From the prompt and the help text it >> appears that this config item is relevant to ARMv7/v8 only. This patch >> prevents the prompt on non-ARM architectures. Compile tested building a >> cross-compiled x86_64 kernel in an x86 user space. The resultant kernel >> boots fine and I am running it now. >> >> Fixes: e1d3c0fd701df831169b116cd5c5d6203ac07f70 >> Cc: will.dea...@arm.com >> Signed-off-by: Chris Clayton >> >> --- linux/drivers/iommu/Kconfig.orig2015-02-15 09:44:01.235927248 + >> +++ linux/drivers/iommu/Kconfig 2015-02-15 09:44:41.131926434 + >> @@ -22,6 +22,7 @@ config IOMMU_IO_PGTABLE >> >> config IOMMU_IO_PGTABLE_LPAE >> bool "ARMv7/v8 Long Descriptor Format" >> + depends on ARM || ARM64 >> select IOMMU_IO_PGTABLE >> help >> Enable support for the ARM long descriptor pagetable format. > > What's the problem with this? The page-table code is intentionally > decoupled from the CPU architecture and having this boot-tested on x86 > found some real bugs that I'm currently fixing. Sure, you probably don't > need this on your box, but it's not default y and you don't have to > select it. > There's no real problem except that, as I said, the prompt and the help text suggest that the config is relevant to ARM architecture only. Same with the help text. When it popped up on x86_64, it was a surprise. As you say, I can simply answer "N", but the prompt and the help need correcting, because for an ordinary Joe User like me, it's misleading. > Will > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] iommu: prompt for IOMMU_IO_PGTABLE_LPAE on ARM archs only
On 02/16/15 16:32, Will Deacon wrote: Hi Chris, On Sun, Feb 15, 2015 at 11:17:19AM +, Chris Clayton wrote: When running make oldconfig for an x86_64 kernel, I was prompted for a setting for IOMMU_IO_PGTABLE_LPAE. From the prompt and the help text it appears that this config item is relevant to ARMv7/v8 only. This patch prevents the prompt on non-ARM architectures. Compile tested building a cross-compiled x86_64 kernel in an x86 user space. The resultant kernel boots fine and I am running it now. Fixes: e1d3c0fd701df831169b116cd5c5d6203ac07f70 Cc: will.dea...@arm.com Signed-off-by: Chris Clayton chris2...@googlemail.com --- linux/drivers/iommu/Kconfig.orig2015-02-15 09:44:01.235927248 + +++ linux/drivers/iommu/Kconfig 2015-02-15 09:44:41.131926434 + @@ -22,6 +22,7 @@ config IOMMU_IO_PGTABLE config IOMMU_IO_PGTABLE_LPAE bool ARMv7/v8 Long Descriptor Format + depends on ARM || ARM64 select IOMMU_IO_PGTABLE help Enable support for the ARM long descriptor pagetable format. What's the problem with this? The page-table code is intentionally decoupled from the CPU architecture and having this boot-tested on x86 found some real bugs that I'm currently fixing. Sure, you probably don't need this on your box, but it's not default y and you don't have to select it. There's no real problem except that, as I said, the prompt and the help text suggest that the config is relevant to ARM architecture only. Same with the help text. When it popped up on x86_64, it was a surprise. As you say, I can simply answer N, but the prompt and the help need correcting, because for an ordinary Joe User like me, it's misleading. Will -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] iommu: prompt for IOMMU_IO_PGTABLE_LPAE on ARM archs only
When running "make oldconfig" for an x86_64 kernel, I was prompted for a setting for IOMMU_IO_PGTABLE_LPAE. From the prompt and the help text it appears that this config item is relevant to ARMv7/v8 only. This patch prevents the prompt on non-ARM architectures. Compile tested building a cross-compiled x86_64 kernel in an x86 user space. The resultant kernel boots fine and I am running it now. Fixes: e1d3c0fd701df831169b116cd5c5d6203ac07f70 Cc: will.dea...@arm.com Signed-off-by: Chris Clayton --- linux/drivers/iommu/Kconfig.orig2015-02-15 09:44:01.235927248 + +++ linux/drivers/iommu/Kconfig 2015-02-15 09:44:41.131926434 + @@ -22,6 +22,7 @@ config IOMMU_IO_PGTABLE config IOMMU_IO_PGTABLE_LPAE bool "ARMv7/v8 Long Descriptor Format" + depends on ARM || ARM64 select IOMMU_IO_PGTABLE help Enable support for the ARM long descriptor pagetable format. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] iommu: prompt for IOMMU_IO_PGTABLE_LPAE on ARM archs only
When running make oldconfig for an x86_64 kernel, I was prompted for a setting for IOMMU_IO_PGTABLE_LPAE. From the prompt and the help text it appears that this config item is relevant to ARMv7/v8 only. This patch prevents the prompt on non-ARM architectures. Compile tested building a cross-compiled x86_64 kernel in an x86 user space. The resultant kernel boots fine and I am running it now. Fixes: e1d3c0fd701df831169b116cd5c5d6203ac07f70 Cc: will.dea...@arm.com Signed-off-by: Chris Clayton chris2...@googlemail.com --- linux/drivers/iommu/Kconfig.orig2015-02-15 09:44:01.235927248 + +++ linux/drivers/iommu/Kconfig 2015-02-15 09:44:41.131926434 + @@ -22,6 +22,7 @@ config IOMMU_IO_PGTABLE config IOMMU_IO_PGTABLE_LPAE bool ARMv7/v8 Long Descriptor Format + depends on ARM || ARM64 select IOMMU_IO_PGTABLE help Enable support for the ARM long descriptor pagetable format. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG in 3.19.0-rc3+
Thanks Konstantin. [snip] >>> >>> Looks like degree (%edx) is 1 on anon-vma desruction. >>> Probably I've overlooked some weird conrner case in vma splitting/merging. >>> >>> Could you try this patch. It disables vma merging end eliminates half >>> of complicated paths. >>> As I see merging is optional, everything should work fine without it. >>> >>> --- a/mm/mmap.c >>> +++ b/mm/mmap.c >>> @@ -1048,7 +1048,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, >>> * We later require that vma->vm_flags == vm_flags, >>> * so this tests vma->vm_flags & VM_SPECIAL, too. >>> */ >>> - if (vm_flags & VM_SPECIAL) >>> + if (1) >>> return NULL; >>> >>> if (prev) >>> >>> >>> >>> >>> Code from your oops. >>> >>> Code: 00 ad de 48 89 43 18 e8 c5 f9 00 00 48 8b 45 10 48 8d 55 10 48 >>> 83 e8 10 49 39 d6 74 54 48 8b 7d 08 48 89 eb 8b 57 34 85 d2 74 9e <0f> >>> 0b 0f 1f 40 00 e8 6b fc ff ff eb 9a 66 0f 1f 84 00 00 00 00 >>> All code >>> >>>0: 00 ad de 48 89 43 add%ch,0x438948de(%rbp) >>>6: 18 e8 sbb%ch,%al >>>8: c5 f9 00 (bad) >>>b: 00 48 8b add%cl,-0x75(%rax) >>>e: 45 10 48 8d adc%r9b,-0x73(%r8) >>> 12: 55 push %rbp >>> 13: 10 48 83 adc%cl,-0x7d(%rax) >>> 16: e8 10 49 39 d6 callq 0xd639492b >>> 1b: 74 54 je 0x71 >>> 1d: 48 8b 7d 08 mov0x8(%rbp),%rdi >>> 21: 48 89 eb mov%rbp,%rbx >>> 24: 8b 57 34 mov0x34(%rdi),%edx >>> 27: 85 d2 test %edx,%edx >>> 29: 74 9e je 0xffc9 >>> 2b:* 0f 0b ud2 <-- trapping instruction >>> 2d: 0f 1f 40 00 nopl 0x0(%rax) >>> 31: e8 6b fc ff ff callq 0xfca1 >>> 36: eb 9a jmp0xffd2 >>> 38: 66 data16 >>> 39: 0f .byte 0xf >>> 3a: 1f (bad) >>> 3b: 84 00 test %al,(%rax) >>> 3d: 00 00 add%al,(%rax) >>> ... >>> >>> Code starting with the faulting instruction >>> === >>>0: 0f 0b ud2 >>>2: 0f 1f 40 00 nopl 0x0(%rax) >>>6: e8 6b fc ff ff callq 0xfc76 >>>b: eb 9a jmp0xffa7 >>> d: 66 data16 >>>e: 0f .byte 0xf >>>f: 1f (bad) >>> 10: 84 00 test %al,(%rax) >>> 12: 00 00 add%al,(%rax) >>> >>> >>> +Added Oded Gabbay into cc, he's reported this >>> problem too. >>> >> Thanks for the fast reply. >> >> I applied the patch and tested it. I wasn't able to reproduce *my* problem, >> so you are definitely in the right direction :) > > Ok. I've found something. Try patch from attachment. > Your patch has fixed the BUG for me. Thank you. Tested-by: Chris Clayton >> >> Oded >> >>>> >>>> In case it helps, I've attached the xz-compressed related config file. >>>> >>>> Chris >>>> >>>>> >>>>> I've attached the full kernel log file for that boot. >>>>> >>>>> Chris >>>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> Please read the FAQ at http://www.tux.org/lkml/ >>> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG in 3.19.0-rc3+
On 01/11/15 09:52, Oded Gabbay wrote: > > > On 01/11/2015 11:37 AM, Konstantin Khlebnikov wrote: >> On Sun, Jan 11, 2015 at 11:16 AM, Chris Clayton >> wrote: >>> Hi, >>> >>> I've done the bisect and the outcome is below, but, because I almost always >>> forget to mention it, I'll say here that I >>> am running a 32 bit user space on a 64 bit kernel. >>> >>> On 01/10/15 20:17, Chris Clayton wrote: >>>> Hi, >>>> >>>> I'm getting a bug a BUG report from a kernel built from a pull (earlier >>>> today) of the current development kernel >>>> (running git describe gives v3.19-rc3-169-geb74926). So that I have >>>> useable wireless networking, I have also applied the >>>> latest seven iwlwifi patches from the wireless-drivers git tree. Prior to >>>> today's pull, I was not seeing anything >>>> unusual in dmesg. >>>> >>>> The BUG reported is as follows: >>>> >>>> Jan 10 19:41:32 laptop kernel: [ cut here ] >>>> Jan 10 19:41:32 laptop kernel: kernel BUG at mm/rmap.c:399! >>>> Jan 10 19:41:32 laptop kernel: invalid opcode: [#1] PREEMPT SMP >>>> Jan 10 19:41:32 laptop kernel: Modules linked in: rfcomm snd_hda_codec_via >>>> iwlmvm coretemp snd_hda_codec_hdmi >>>> snd_hda_codec_generic snd_hda_intel mac80211 hwmon snd_hda_controller >>>> x86_pkg_temp_thermal acpi_cpufreq iwlwifi cfg80211 >>>> snd_hda_codec snd_hwdep >>>> Jan 10 19:41:32 laptop kernel: CPU: 1 PID: 353 Comm: fc-cache Not tainted >>>> 3.19.0-rc3+ #42 >>>> Jan 10 19:41:32 laptop kernel: Hardware name: Notebook >>>> W65_67SZ/W65_67SZ >>>>, BIOS 1.03.05 02/26/2014 >>>> Jan 10 19:41:32 laptop kernel: task: 8800da98c5c0 ti: 880408dd4000 >>>> task.ti: 880408dd4000 >>>> Jan 10 19:41:32 laptop kernel: RIP: 0010:[] >>>> [] unlink_anon_vmas+0x17a/0x200 >>>> Jan 10 19:41:33 laptop kernel: RSP: 0018:880408dd7d88 EFLAGS: 00010286 >>>> Jan 10 19:41:33 laptop kernel: RAX: 88040b79e150 RBX: 88040b79e140 >>>> RCX: >>>> Jan 10 19:41:33 laptop kernel: RDX: 0001 RSI: 880409f04360 >>>> RDI: 880409f04320 >>>> Jan 10 19:41:33 laptop kernel: RBP: 88040cb13278 R08: >>>> R09: 88040d801c00 >>>> Jan 10 19:41:33 laptop kernel: R10: 88041fa546e0 R11: 88040b79e160 >>>> R12: 880409f04320 >>>> Jan 10 19:41:33 laptop kernel: R13: 88040cb13278 R14: 88040cb13288 >>>> R15: 88040cb13210 >>>> Jan 10 19:41:33 laptop kernel: FS: () >>>> GS:88041fa4() knlGS: >>>> Jan 10 19:41:33 laptop kernel: CS: 0010 DS: 002b ES: 002b CR0: >>>> 80050033 >>>> Jan 10 19:41:33 laptop kernel: CR2: f722c8d4 CR3: 0004082a8000 >>>> CR4: 001407e0 >>>> Jan 10 19:41:33 laptop kernel: Stack: >>>> Jan 10 19:41:33 laptop kernel: 88040d6cfbd8 88040d6cfba0 >>>> 88040cecd160 88040cb13210 >>>> Jan 10 19:41:33 laptop kernel: 88040cbbb630 f7151000 >>>> 880408dd7e28 >>>> Jan 10 19:41:33 laptop kernel: 810e3633 >>>> >>>> Jan 10 19:41:33 laptop kernel: Call Trace: >>>> Jan 10 19:41:33 laptop kernel: [] ? >>>> free_pgtables+0x83/0xf0 >>>> Jan 10 19:41:34 laptop kernel: [] ? exit_mmap+0xc3/0x150 >>>> Jan 10 19:41:34 laptop kernel: [] ? >>>> __do_page_fault+0x17d/0x4b0 >>>> Jan 10 19:41:34 laptop kernel: [] ? mmput+0x21/0xc0 >>>> Jan 10 19:41:34 laptop kernel: [] ? do_exit+0x26d/0xa50 >>>> Jan 10 19:41:34 laptop kernel: [] ? >>>> mntput_no_expire+0x9/0x140 >>>> Jan 10 19:41:34 laptop kernel: [] ? >>>> task_work_run+0xbc/0xf0 >>>> Jan 10 19:41:34 laptop kernel: [] ? >>>> do_group_exit+0x34/0xb0 >>>> Jan 10 19:41:34 laptop kernel: [] ? >>>> SyS_exit_group+0xf/0x10 >>>> Jan 10 19:41:34 laptop kernel: [] ? >>>> sysenter_dispatch+0x7/0x1e >>>> Jan 10 19:41:34 laptop kernel: Code: 00 ad de 48 89 43 18 e8 c5 f9 00 00 >>>> 48 8b 45 10 48 8d 55 10 48 83 e8 10 49 39
Re: BUG in 3.19.0-rc3+
Hi, I've done the bisect and the outcome is below, but, because I almost always forget to mention it, I'll say here that I am running a 32 bit user space on a 64 bit kernel. On 01/10/15 20:17, Chris Clayton wrote: > Hi, > > I'm getting a bug a BUG report from a kernel built from a pull (earlier > today) of the current development kernel > (running git describe gives v3.19-rc3-169-geb74926). So that I have useable > wireless networking, I have also applied the > latest seven iwlwifi patches from the wireless-drivers git tree. Prior to > today's pull, I was not seeing anything > unusual in dmesg. > > The BUG reported is as follows: > > Jan 10 19:41:32 laptop kernel: [ cut here ] > Jan 10 19:41:32 laptop kernel: kernel BUG at mm/rmap.c:399! > Jan 10 19:41:32 laptop kernel: invalid opcode: [#1] PREEMPT SMP > Jan 10 19:41:32 laptop kernel: Modules linked in: rfcomm snd_hda_codec_via > iwlmvm coretemp snd_hda_codec_hdmi > snd_hda_codec_generic snd_hda_intel mac80211 hwmon snd_hda_controller > x86_pkg_temp_thermal acpi_cpufreq iwlwifi cfg80211 > snd_hda_codec snd_hwdep > Jan 10 19:41:32 laptop kernel: CPU: 1 PID: 353 Comm: fc-cache Not tainted > 3.19.0-rc3+ #42 > Jan 10 19:41:32 laptop kernel: Hardware name: Notebook > W65_67SZ/W65_67SZ >, BIOS 1.03.05 02/26/2014 > Jan 10 19:41:32 laptop kernel: task: 8800da98c5c0 ti: 880408dd4000 > task.ti: 880408dd4000 > Jan 10 19:41:32 laptop kernel: RIP: 0010:[] > [] unlink_anon_vmas+0x17a/0x200 > Jan 10 19:41:33 laptop kernel: RSP: 0018:880408dd7d88 EFLAGS: 00010286 > Jan 10 19:41:33 laptop kernel: RAX: 88040b79e150 RBX: 88040b79e140 > RCX: > Jan 10 19:41:33 laptop kernel: RDX: 0001 RSI: 880409f04360 > RDI: 880409f04320 > Jan 10 19:41:33 laptop kernel: RBP: 88040cb13278 R08: > R09: 88040d801c00 > Jan 10 19:41:33 laptop kernel: R10: 88041fa546e0 R11: 88040b79e160 > R12: 880409f04320 > Jan 10 19:41:33 laptop kernel: R13: 88040cb13278 R14: 88040cb13288 > R15: 88040cb13210 > Jan 10 19:41:33 laptop kernel: FS: () > GS:88041fa4() knlGS: > Jan 10 19:41:33 laptop kernel: CS: 0010 DS: 002b ES: 002b CR0: > 80050033 > Jan 10 19:41:33 laptop kernel: CR2: f722c8d4 CR3: 0004082a8000 > CR4: 001407e0 > Jan 10 19:41:33 laptop kernel: Stack: > Jan 10 19:41:33 laptop kernel: 88040d6cfbd8 88040d6cfba0 > 88040cecd160 88040cb13210 > Jan 10 19:41:33 laptop kernel: 88040cbbb630 f7151000 > 880408dd7e28 > Jan 10 19:41:33 laptop kernel: 810e3633 > > Jan 10 19:41:33 laptop kernel: Call Trace: > Jan 10 19:41:33 laptop kernel: [] ? free_pgtables+0x83/0xf0 > Jan 10 19:41:34 laptop kernel: [] ? exit_mmap+0xc3/0x150 > Jan 10 19:41:34 laptop kernel: [] ? > __do_page_fault+0x17d/0x4b0 > Jan 10 19:41:34 laptop kernel: [] ? mmput+0x21/0xc0 > Jan 10 19:41:34 laptop kernel: [] ? do_exit+0x26d/0xa50 > Jan 10 19:41:34 laptop kernel: [] ? > mntput_no_expire+0x9/0x140 > Jan 10 19:41:34 laptop kernel: [] ? task_work_run+0xbc/0xf0 > Jan 10 19:41:34 laptop kernel: [] ? do_group_exit+0x34/0xb0 > Jan 10 19:41:34 laptop kernel: [] ? SyS_exit_group+0xf/0x10 > Jan 10 19:41:34 laptop kernel: [] ? > sysenter_dispatch+0x7/0x1e > Jan 10 19:41:34 laptop kernel: Code: 00 ad de 48 89 43 18 e8 c5 f9 00 00 48 > 8b 45 10 48 8d 55 10 48 83 e8 10 49 39 d6 74 > 54 48 8b 7d 08 48 89 eb 8b 57 34 85 d2 74 9e <0f> 0b 0f 1f 40 00 e8 6b fc ff > ff eb 9a 66 0f 1f 84 00 00 00 00 > Jan 10 19:41:34 laptop kernel: RIP [] > unlink_anon_vmas+0x17a/0x200 > Jan 10 19:41:34 laptop kernel: RSP > Jan 10 19:41:34 laptop kernel: ---[ end trace 4aa713b2a9aa664b ]--- > Jan 10 19:41:34 laptop kernel: Fixing recursive fault but reboot is needed! > Jan 10 19:41:34 laptop kernel: nf_conntrack version 0.5.0 (16384 buckets, > 65536 max) [snip] > > I won't get time tonight, but I can bisect it tomorrow, so this is just a > heads up in case the problem (and fix) jumps > out at anyone. Before I bisect I'll build and run a kernel without the > iwlwifi patches. The bisect ended up at: 7a3ef208e662f4b63d43a23f61a64a129c525bbc is the first bad commit commit 7a3ef208e662f4b63d43a23f61a64a129c525bbc Author: Konstantin Khlebnikov Date: Thu Jan 8 14:32:15 2015 -0800 mm: prevent endless growth of anon_vma hierarchy Constantly forking task causes unlimited grow of anon_vma chain. Each next child allocates new level of anon_vmas and links vma to all
Re: BUG in 3.19.0-rc3+
Hi, I've done the bisect and the outcome is below, but, because I almost always forget to mention it, I'll say here that I am running a 32 bit user space on a 64 bit kernel. On 01/10/15 20:17, Chris Clayton wrote: Hi, I'm getting a bug a BUG report from a kernel built from a pull (earlier today) of the current development kernel (running git describe gives v3.19-rc3-169-geb74926). So that I have useable wireless networking, I have also applied the latest seven iwlwifi patches from the wireless-drivers git tree. Prior to today's pull, I was not seeing anything unusual in dmesg. The BUG reported is as follows: Jan 10 19:41:32 laptop kernel: [ cut here ] Jan 10 19:41:32 laptop kernel: kernel BUG at mm/rmap.c:399! Jan 10 19:41:32 laptop kernel: invalid opcode: [#1] PREEMPT SMP Jan 10 19:41:32 laptop kernel: Modules linked in: rfcomm snd_hda_codec_via iwlmvm coretemp snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel mac80211 hwmon snd_hda_controller x86_pkg_temp_thermal acpi_cpufreq iwlwifi cfg80211 snd_hda_codec snd_hwdep Jan 10 19:41:32 laptop kernel: CPU: 1 PID: 353 Comm: fc-cache Not tainted 3.19.0-rc3+ #42 Jan 10 19:41:32 laptop kernel: Hardware name: Notebook W65_67SZ/W65_67SZ , BIOS 1.03.05 02/26/2014 Jan 10 19:41:32 laptop kernel: task: 8800da98c5c0 ti: 880408dd4000 task.ti: 880408dd4000 Jan 10 19:41:32 laptop kernel: RIP: 0010:[810ef7ea] [810ef7ea] unlink_anon_vmas+0x17a/0x200 Jan 10 19:41:33 laptop kernel: RSP: 0018:880408dd7d88 EFLAGS: 00010286 Jan 10 19:41:33 laptop kernel: RAX: 88040b79e150 RBX: 88040b79e140 RCX: Jan 10 19:41:33 laptop kernel: RDX: 0001 RSI: 880409f04360 RDI: 880409f04320 Jan 10 19:41:33 laptop kernel: RBP: 88040cb13278 R08: R09: 88040d801c00 Jan 10 19:41:33 laptop kernel: R10: 88041fa546e0 R11: 88040b79e160 R12: 880409f04320 Jan 10 19:41:33 laptop kernel: R13: 88040cb13278 R14: 88040cb13288 R15: 88040cb13210 Jan 10 19:41:33 laptop kernel: FS: () GS:88041fa4() knlGS: Jan 10 19:41:33 laptop kernel: CS: 0010 DS: 002b ES: 002b CR0: 80050033 Jan 10 19:41:33 laptop kernel: CR2: f722c8d4 CR3: 0004082a8000 CR4: 001407e0 Jan 10 19:41:33 laptop kernel: Stack: Jan 10 19:41:33 laptop kernel: 88040d6cfbd8 88040d6cfba0 88040cecd160 88040cb13210 Jan 10 19:41:33 laptop kernel: 88040cbbb630 f7151000 880408dd7e28 Jan 10 19:41:33 laptop kernel: 810e3633 Jan 10 19:41:33 laptop kernel: Call Trace: Jan 10 19:41:33 laptop kernel: [810e3633] ? free_pgtables+0x83/0xf0 Jan 10 19:41:34 laptop kernel: [810ec3c3] ? exit_mmap+0xc3/0x150 Jan 10 19:41:34 laptop kernel: [8103980d] ? __do_page_fault+0x17d/0x4b0 Jan 10 19:41:34 laptop kernel: [81042a21] ? mmput+0x21/0xc0 Jan 10 19:41:34 laptop kernel: [8104673d] ? do_exit+0x26d/0xa50 Jan 10 19:41:34 laptop kernel: [8111fe89] ? mntput_no_expire+0x9/0x140 Jan 10 19:41:34 laptop kernel: [8105ca1c] ? task_work_run+0xbc/0xf0 Jan 10 19:41:34 laptop kernel: [81047d44] ? do_group_exit+0x34/0xb0 Jan 10 19:41:34 laptop kernel: [81047dcf] ? SyS_exit_group+0xf/0x10 Jan 10 19:41:34 laptop kernel: [815e0f9f] ? sysenter_dispatch+0x7/0x1e Jan 10 19:41:34 laptop kernel: Code: 00 ad de 48 89 43 18 e8 c5 f9 00 00 48 8b 45 10 48 8d 55 10 48 83 e8 10 49 39 d6 74 54 48 8b 7d 08 48 89 eb 8b 57 34 85 d2 74 9e 0f 0b 0f 1f 40 00 e8 6b fc ff ff eb 9a 66 0f 1f 84 00 00 00 00 Jan 10 19:41:34 laptop kernel: RIP [810ef7ea] unlink_anon_vmas+0x17a/0x200 Jan 10 19:41:34 laptop kernel: RSP 880408dd7d88 Jan 10 19:41:34 laptop kernel: ---[ end trace 4aa713b2a9aa664b ]--- Jan 10 19:41:34 laptop kernel: Fixing recursive fault but reboot is needed! Jan 10 19:41:34 laptop kernel: nf_conntrack version 0.5.0 (16384 buckets, 65536 max) [snip] I won't get time tonight, but I can bisect it tomorrow, so this is just a heads up in case the problem (and fix) jumps out at anyone. Before I bisect I'll build and run a kernel without the iwlwifi patches. The bisect ended up at: 7a3ef208e662f4b63d43a23f61a64a129c525bbc is the first bad commit commit 7a3ef208e662f4b63d43a23f61a64a129c525bbc Author: Konstantin Khlebnikov koc...@gmail.com Date: Thu Jan 8 14:32:15 2015 -0800 mm: prevent endless growth of anon_vma hierarchy Constantly forking task causes unlimited grow of anon_vma chain. Each next child allocates new level of anon_vmas and links vma to all previous levels because pages might be inherited from any level. This patch adds heuristic which decides