RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-04-14 Thread Joakim Zhang

> -Original Message-
> From: Joakim Zhang 
> Sent: 2021年4月14日 16:07
> To: Thierry Reding 
> Cc: David S. Miller ; Jakub Kicinski ;
> Jon Hunter ; Giuseppe Cavallaro
> ; Alexandre Torgue ;
> Jose Abreu ; net...@vger.kernel.org; Linux Kernel
> Mailing List ; linux-tegra
> 
> Subject: RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> resume back
> 
> 
> > -Original Message-
> > From: Thierry Reding 
> > Sent: 2021年4月14日 15:41
> > To: Joakim Zhang 
> > Cc: David S. Miller ; Jakub Kicinski
> > ; Jon Hunter ; Giuseppe
> > Cavallaro ; Alexandre Torgue
> > ; Jose Abreu ;
> > net...@vger.kernel.org; Linux Kernel Mailing List
> > ; linux-tegra
> > 
> > Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers
> > when mac resume back
> >
> > On Wed, Apr 14, 2021 at 02:18:58AM +, Joakim Zhang wrote:
> > >
> > > > -Original Message-
> > > > From: Thierry Reding 
> > > > Sent: 2021年4月14日 0:07
> > > > To: David S. Miller ; Jakub Kicinski
> > > > 
> > > > Cc: Joakim Zhang ; Jon Hunter
> > > > ; Giuseppe Cavallaro
> > > > ; Alexandre Torgue
> > > > ; Jose Abreu ;
> > > > net...@vger.kernel.org; Linux Kernel Mailing List
> > > > ; linux-tegra
> > > > 
> > > > Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers
> > > > when mac resume back
> > > >
> > > > On Tue, Apr 13, 2021 at 12:13:01PM +, Joakim Zhang wrote:
> > > > >
> > > > > Hi Jon,
> > > > >
> > > > > > -Original Message-
> > > > > > From: Jon Hunter 
> > > > > > Sent: 2021年4月13日 16:41
> > > > > > To: Joakim Zhang ; Giuseppe Cavallaro
> > > > > > ; Alexandre Torgue
> > > > > > ; Jose Abreu 
> > > > > > Cc: net...@vger.kernel.org; Linux Kernel Mailing List
> > > > > > ; linux-tegra
> > > > > > ; Jakub Kicinski
> > > > > > 
> > > > > > Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx
> > > > > > buffers when mac resume back
> > > > > >
> > > > > >
> > > > > > On 01/04/2021 17:28, Jon Hunter wrote:
> > > > > > >
> > > > > > > On 31/03/2021 12:41, Joakim Zhang wrote:
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > >>> In answer to your question, resuming from suspend does
> > > > > > >>> work on this board without your change. We have been
> > > > > > >>> testing suspend/resume now on this board since Linux v5.8
> > > > > > >>> and so we have the ability to bisect such regressions. So
> > > > > > >>> it is clear to me that this is the change that caused
> > > > > > this, but I am not sure why.
> > > > > > >>
> > > > > > >> Yes, I know this issue is regression caused by my patch. I
> > > > > > >> just want to
> > > > > > analyze the potential reasons. Due to the code change only
> > > > > > related to the page recycle and reallocate.
> > > > > > >> So I guess if this page operate need IOMMU works when IOMMU
> > > > > > >> is
> > > > enabled.
> > > > > > Could you help check if IOMMU driver resume before STMMAC? Our
> > > > > > common desire is to find the root cause, right?
> > > > > > >
> > > > > > >
> > > > > > > Yes of course that is the desire here indeed. I had assumed
> > > > > > > that the suspend/resume order was good because we have never
> > > > > > > seen any problems, but nonetheless it is always good to check.
> > > > > > > Using ftrace I enabled tracing of the appropriate
> > > > > > > suspend/resume functions and this is what I see ...
> > > > > > >
> > > > > > > # tracer: function
> > > > > > > #
> > > > > > > # entries-in-buffer/entries-written: 4/4   #P:6
> > > > > > > #
> > > > > > > #_-=> irqs-off
> > > > > > > # 

RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-04-14 Thread Joakim Zhang

> -Original Message-
> From: Thierry Reding 
> Sent: 2021年4月14日 15:41
> To: Joakim Zhang 
> Cc: David S. Miller ; Jakub Kicinski ;
> Jon Hunter ; Giuseppe Cavallaro
> ; Alexandre Torgue ;
> Jose Abreu ; net...@vger.kernel.org; Linux Kernel
> Mailing List ; linux-tegra
> 
> Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> resume back
> 
> On Wed, Apr 14, 2021 at 02:18:58AM +, Joakim Zhang wrote:
> >
> > > -Original Message-
> > > From: Thierry Reding 
> > > Sent: 2021年4月14日 0:07
> > > To: David S. Miller ; Jakub Kicinski
> > > 
> > > Cc: Joakim Zhang ; Jon Hunter
> > > ; Giuseppe Cavallaro ;
> > > Alexandre Torgue ; Jose Abreu
> > > ; net...@vger.kernel.org; Linux Kernel Mailing
> > > List ; linux-tegra
> > > 
> > > Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers
> > > when mac resume back
> > >
> > > On Tue, Apr 13, 2021 at 12:13:01PM +, Joakim Zhang wrote:
> > > >
> > > > Hi Jon,
> > > >
> > > > > -Original Message-
> > > > > From: Jon Hunter 
> > > > > Sent: 2021年4月13日 16:41
> > > > > To: Joakim Zhang ; Giuseppe Cavallaro
> > > > > ; Alexandre Torgue
> > > > > ; Jose Abreu 
> > > > > Cc: net...@vger.kernel.org; Linux Kernel Mailing List
> > > > > ; linux-tegra
> > > > > ; Jakub Kicinski 
> > > > > Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx
> > > > > buffers when mac resume back
> > > > >
> > > > >
> > > > > On 01/04/2021 17:28, Jon Hunter wrote:
> > > > > >
> > > > > > On 31/03/2021 12:41, Joakim Zhang wrote:
> > > > > >
> > > > > > ...
> > > > > >
> > > > > >>> In answer to your question, resuming from suspend does work
> > > > > >>> on this board without your change. We have been testing
> > > > > >>> suspend/resume now on this board since Linux v5.8 and so we
> > > > > >>> have the ability to bisect such regressions. So it is clear
> > > > > >>> to me that this is the change that caused
> > > > > this, but I am not sure why.
> > > > > >>
> > > > > >> Yes, I know this issue is regression caused by my patch. I
> > > > > >> just want to
> > > > > analyze the potential reasons. Due to the code change only
> > > > > related to the page recycle and reallocate.
> > > > > >> So I guess if this page operate need IOMMU works when IOMMU
> > > > > >> is
> > > enabled.
> > > > > Could you help check if IOMMU driver resume before STMMAC? Our
> > > > > common desire is to find the root cause, right?
> > > > > >
> > > > > >
> > > > > > Yes of course that is the desire here indeed. I had assumed
> > > > > > that the suspend/resume order was good because we have never
> > > > > > seen any problems, but nonetheless it is always good to check.
> > > > > > Using ftrace I enabled tracing of the appropriate
> > > > > > suspend/resume functions and this is what I see ...
> > > > > >
> > > > > > # tracer: function
> > > > > > #
> > > > > > # entries-in-buffer/entries-written: 4/4   #P:6
> > > > > > #
> > > > > > #_-=> irqs-off
> > > > > > #   / _=> need-resched
> > > > > > #  | / _---=> hardirq/softirq
> > > > > > #  || / _--=> preempt-depth
> > > > > > #  ||| / delay
> > > > > > #   TASK-PID CPU#     TIMESTAMP
> FUNCTION
> > > > > > #  | | |     | |
> > > > > >  rtcwake-748 [000] ...1   536.700777:
> > > > > stmmac_pltfr_suspend <-platform_pm_suspend
> > > > > >  rtcwake-748 [000] ...1   536.735532:
> > > > > arm_smmu_pm_suspend <-platform_pm_suspend
> > > > > >  rtcwake-748 [000] ...1   536.757290:
> > > > > arm_smmu_pm_

Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-04-14 Thread Thierry Reding
On Wed, Apr 14, 2021 at 02:18:58AM +, Joakim Zhang wrote:
> 
> > -Original Message-
> > From: Thierry Reding 
> > Sent: 2021年4月14日 0:07
> > To: David S. Miller ; Jakub Kicinski 
> > Cc: Joakim Zhang ; Jon Hunter
> > ; Giuseppe Cavallaro ;
> > Alexandre Torgue ; Jose Abreu
> > ; net...@vger.kernel.org; Linux Kernel Mailing List
> > ; linux-tegra 
> > Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> > resume back
> > 
> > On Tue, Apr 13, 2021 at 12:13:01PM +, Joakim Zhang wrote:
> > >
> > > Hi Jon,
> > >
> > > > -Original Message-
> > > > From: Jon Hunter 
> > > > Sent: 2021年4月13日 16:41
> > > > To: Joakim Zhang ; Giuseppe Cavallaro
> > > > ; Alexandre Torgue
> > > > ; Jose Abreu 
> > > > Cc: net...@vger.kernel.org; Linux Kernel Mailing List
> > > > ; linux-tegra
> > > > ; Jakub Kicinski 
> > > > Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers
> > > > when mac resume back
> > > >
> > > >
> > > > On 01/04/2021 17:28, Jon Hunter wrote:
> > > > >
> > > > > On 31/03/2021 12:41, Joakim Zhang wrote:
> > > > >
> > > > > ...
> > > > >
> > > > >>> In answer to your question, resuming from suspend does work on
> > > > >>> this board without your change. We have been testing
> > > > >>> suspend/resume now on this board since Linux v5.8 and so we have
> > > > >>> the ability to bisect such regressions. So it is clear to me
> > > > >>> that this is the change that caused
> > > > this, but I am not sure why.
> > > > >>
> > > > >> Yes, I know this issue is regression caused by my patch. I just
> > > > >> want to
> > > > analyze the potential reasons. Due to the code change only related
> > > > to the page recycle and reallocate.
> > > > >> So I guess if this page operate need IOMMU works when IOMMU is
> > enabled.
> > > > Could you help check if IOMMU driver resume before STMMAC? Our
> > > > common desire is to find the root cause, right?
> > > > >
> > > > >
> > > > > Yes of course that is the desire here indeed. I had assumed that
> > > > > the suspend/resume order was good because we have never seen any
> > > > > problems, but nonetheless it is always good to check. Using ftrace
> > > > > I enabled tracing of the appropriate suspend/resume functions and
> > > > > this is what I see ...
> > > > >
> > > > > # tracer: function
> > > > > #
> > > > > # entries-in-buffer/entries-written: 4/4   #P:6
> > > > > #
> > > > > #_-=> irqs-off
> > > > > #   / _=> need-resched
> > > > > #  | / _---=> hardirq/softirq
> > > > > #  || / _--=> preempt-depth
> > > > > #  ||| / delay
> > > > > #   TASK-PID CPU#     TIMESTAMP  FUNCTION
> > > > > #  | | |     | |
> > > > >  rtcwake-748 [000] ...1   536.700777:
> > > > stmmac_pltfr_suspend <-platform_pm_suspend
> > > > >  rtcwake-748 [000] ...1   536.735532:
> > > > arm_smmu_pm_suspend <-platform_pm_suspend
> > > > >  rtcwake-748 [000] ...1   536.757290:
> > > > arm_smmu_pm_resume <-platform_pm_resume
> > > > >  rtcwake-748 [003] ...1   536.856771:
> > > > stmmac_pltfr_resume <-platform_pm_resume
> > > > >
> > > > >
> > > > > So I don't see any ordering issues that could be causing this.
> > > >
> > > >
> > > > Another thing I have found is that for our platform, if the driver
> > > > for the ethernet PHY (in this case broadcom PHY) is enabled, then it
> > > > fails to resume but if I disable the PHY in the kernel
> > > > configuration, then resume works. I have found that if I move the
> > > > reinit of the RX buffers to before the startup of the phy, then it can 
> > > > resume
> > OK with the PHY enabled.
> > > >
> > 

RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-04-13 Thread Joakim Zhang

> -Original Message-
> From: Thierry Reding 
> Sent: 2021年4月14日 0:07
> To: David S. Miller ; Jakub Kicinski 
> Cc: Joakim Zhang ; Jon Hunter
> ; Giuseppe Cavallaro ;
> Alexandre Torgue ; Jose Abreu
> ; net...@vger.kernel.org; Linux Kernel Mailing List
> ; linux-tegra 
> Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> resume back
> 
> On Tue, Apr 13, 2021 at 12:13:01PM +, Joakim Zhang wrote:
> >
> > Hi Jon,
> >
> > > -Original Message-
> > > From: Jon Hunter 
> > > Sent: 2021年4月13日 16:41
> > > To: Joakim Zhang ; Giuseppe Cavallaro
> > > ; Alexandre Torgue
> > > ; Jose Abreu 
> > > Cc: net...@vger.kernel.org; Linux Kernel Mailing List
> > > ; linux-tegra
> > > ; Jakub Kicinski 
> > > Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers
> > > when mac resume back
> > >
> > >
> > > On 01/04/2021 17:28, Jon Hunter wrote:
> > > >
> > > > On 31/03/2021 12:41, Joakim Zhang wrote:
> > > >
> > > > ...
> > > >
> > > >>> In answer to your question, resuming from suspend does work on
> > > >>> this board without your change. We have been testing
> > > >>> suspend/resume now on this board since Linux v5.8 and so we have
> > > >>> the ability to bisect such regressions. So it is clear to me
> > > >>> that this is the change that caused
> > > this, but I am not sure why.
> > > >>
> > > >> Yes, I know this issue is regression caused by my patch. I just
> > > >> want to
> > > analyze the potential reasons. Due to the code change only related
> > > to the page recycle and reallocate.
> > > >> So I guess if this page operate need IOMMU works when IOMMU is
> enabled.
> > > Could you help check if IOMMU driver resume before STMMAC? Our
> > > common desire is to find the root cause, right?
> > > >
> > > >
> > > > Yes of course that is the desire here indeed. I had assumed that
> > > > the suspend/resume order was good because we have never seen any
> > > > problems, but nonetheless it is always good to check. Using ftrace
> > > > I enabled tracing of the appropriate suspend/resume functions and
> > > > this is what I see ...
> > > >
> > > > # tracer: function
> > > > #
> > > > # entries-in-buffer/entries-written: 4/4   #P:6
> > > > #
> > > > #_-=> irqs-off
> > > > #   / _=> need-resched
> > > > #  | / _---=> hardirq/softirq
> > > > #  || / _--=> preempt-depth
> > > > #  ||| / delay
> > > > #   TASK-PID CPU#     TIMESTAMP  FUNCTION
> > > > #  | | |     | |
> > > >  rtcwake-748 [000] ...1   536.700777:
> > > stmmac_pltfr_suspend <-platform_pm_suspend
> > > >  rtcwake-748 [000] ...1   536.735532:
> > > arm_smmu_pm_suspend <-platform_pm_suspend
> > > >  rtcwake-748 [000] ...1   536.757290:
> > > arm_smmu_pm_resume <-platform_pm_resume
> > > >  rtcwake-748 [003] ...1   536.856771:
> > > stmmac_pltfr_resume <-platform_pm_resume
> > > >
> > > >
> > > > So I don't see any ordering issues that could be causing this.
> > >
> > >
> > > Another thing I have found is that for our platform, if the driver
> > > for the ethernet PHY (in this case broadcom PHY) is enabled, then it
> > > fails to resume but if I disable the PHY in the kernel
> > > configuration, then resume works. I have found that if I move the
> > > reinit of the RX buffers to before the startup of the phy, then it can 
> > > resume
> OK with the PHY enabled.
> > >
> > > Does the following work for you? Does your platform use a specific
> > > ethernet PHY driver?
> >
> > I am also looking into this issue these days, we use the Realtek RTL8211FDI
> PHY, driver is drivers/net/phy/realtek.c.
> >
> > For our EQOS MAC integrated in our SoC, Rx side logic depends on RXC clock
> from PHY, so we need phylink_start before MAC.
> >
> > I will test below code change tomorrow to see if it can work at my side, 
>

Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-04-13 Thread Jakub Kicinski
On Tue, 13 Apr 2021 18:06:39 +0200 Thierry Reding wrote:
> given where we are in the release cycle, I think it'd be best to revert
> commit 9c63faaa931e ("net: stmmac: re-init rx buffers when mac resume
> back") for now.
> 
> To summarize the discussion: the patch was meant as a workaround to fix
> an occasional suspend/resume failure on one board that was not fully
> root caused, and ends up causing fully reproducible suspend/resume
> failures on at least one other board.
> 
> Joakim is looking at an alternative solution and Jon and I can provide
> testing from the Tegra side for any fixes.
> 
> Do you want me to send a revert patch or can you revert directly on top
> of your tree?

Please send a revert with the justification in the commit log, and 
perhaps also:

Link: https://lore.kernel.org/r/708edb92-a5df-ecc4-3126-5ab36707e...@nvidia.com/

for those who want to dig deeper in the history. Make sure you 
CC relevant folks so they can review and express their opinions.

Thanks!


Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-04-13 Thread Thierry Reding
On Tue, Apr 13, 2021 at 12:13:01PM +, Joakim Zhang wrote:
> 
> Hi Jon,
> 
> > -Original Message-
> > From: Jon Hunter 
> > Sent: 2021年4月13日 16:41
> > To: Joakim Zhang ; Giuseppe Cavallaro
> > ; Alexandre Torgue ;
> > Jose Abreu 
> > Cc: net...@vger.kernel.org; Linux Kernel Mailing List
> > ; linux-tegra ;
> > Jakub Kicinski 
> > Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> > resume back
> > 
> > 
> > On 01/04/2021 17:28, Jon Hunter wrote:
> > >
> > > On 31/03/2021 12:41, Joakim Zhang wrote:
> > >
> > > ...
> > >
> > >>> In answer to your question, resuming from suspend does work on this
> > >>> board without your change. We have been testing suspend/resume now
> > >>> on this board since Linux v5.8 and so we have the ability to bisect
> > >>> such regressions. So it is clear to me that this is the change that 
> > >>> caused
> > this, but I am not sure why.
> > >>
> > >> Yes, I know this issue is regression caused by my patch. I just want to
> > analyze the potential reasons. Due to the code change only related to the 
> > page
> > recycle and reallocate.
> > >> So I guess if this page operate need IOMMU works when IOMMU is enabled.
> > Could you help check if IOMMU driver resume before STMMAC? Our common
> > desire is to find the root cause, right?
> > >
> > >
> > > Yes of course that is the desire here indeed. I had assumed that the
> > > suspend/resume order was good because we have never seen any problems,
> > > but nonetheless it is always good to check. Using ftrace I enabled
> > > tracing of the appropriate suspend/resume functions and this is what I
> > > see ...
> > >
> > > # tracer: function
> > > #
> > > # entries-in-buffer/entries-written: 4/4   #P:6
> > > #
> > > #_-=> irqs-off
> > > #   / _=> need-resched
> > > #  | / _---=> hardirq/softirq
> > > #  || / _--=> preempt-depth
> > > #  ||| / delay
> > > #   TASK-PID CPU#     TIMESTAMP  FUNCTION
> > > #  | | |     | |
> > >  rtcwake-748 [000] ...1   536.700777:
> > stmmac_pltfr_suspend <-platform_pm_suspend
> > >  rtcwake-748 [000] ...1   536.735532:
> > arm_smmu_pm_suspend <-platform_pm_suspend
> > >  rtcwake-748 [000] ...1   536.757290:
> > arm_smmu_pm_resume <-platform_pm_resume
> > >  rtcwake-748 [003] ...1   536.856771:
> > stmmac_pltfr_resume <-platform_pm_resume
> > >
> > >
> > > So I don't see any ordering issues that could be causing this.
> > 
> > 
> > Another thing I have found is that for our platform, if the driver for the 
> > ethernet
> > PHY (in this case broadcom PHY) is enabled, then it fails to resume but if I
> > disable the PHY in the kernel configuration, then resume works. I have found
> > that if I move the reinit of the RX buffers to before the startup of the 
> > phy, then
> > it can resume OK with the PHY enabled.
> > 
> > Does the following work for you? Does your platform use a specific ethernet
> > PHY driver?
> 
> I am also looking into this issue these days, we use the Realtek RTL8211FDI 
> PHY, driver is drivers/net/phy/realtek.c.
> 
> For our EQOS MAC integrated in our SoC, Rx side logic depends on RXC clock 
> from PHY, so we need phylink_start before MAC.
> 
> I will test below code change tomorrow to see if it can work at my side, 
> since it is only re-init memory, need not RXC clock.
> 
> 
> > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > index 208cae344ffa..071d15d86dbe 100644
> > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > @@ -5416,19 +5416,20 @@ int stmmac_resume(struct device *dev)
> > return ret;
> > }
> > +   rtnl_lock();
> > +   mutex_lock(>lock);
> > +   stmmac_reinit_rx_buffers(priv);
> > +   mutex_unlock(>lock);
> > +
> > if (!device_may_wakeup(priv->device) || !priv->plat->pmt) {
> > -   rtnl_lock();
> >  

RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-04-13 Thread Joakim Zhang

Hi Jon,

> -Original Message-
> From: Jon Hunter 
> Sent: 2021年4月13日 16:41
> To: Joakim Zhang ; Giuseppe Cavallaro
> ; Alexandre Torgue ;
> Jose Abreu 
> Cc: net...@vger.kernel.org; Linux Kernel Mailing List
> ; linux-tegra ;
> Jakub Kicinski 
> Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> resume back
> 
> 
> On 01/04/2021 17:28, Jon Hunter wrote:
> >
> > On 31/03/2021 12:41, Joakim Zhang wrote:
> >
> > ...
> >
> >>> In answer to your question, resuming from suspend does work on this
> >>> board without your change. We have been testing suspend/resume now
> >>> on this board since Linux v5.8 and so we have the ability to bisect
> >>> such regressions. So it is clear to me that this is the change that caused
> this, but I am not sure why.
> >>
> >> Yes, I know this issue is regression caused by my patch. I just want to
> analyze the potential reasons. Due to the code change only related to the page
> recycle and reallocate.
> >> So I guess if this page operate need IOMMU works when IOMMU is enabled.
> Could you help check if IOMMU driver resume before STMMAC? Our common
> desire is to find the root cause, right?
> >
> >
> > Yes of course that is the desire here indeed. I had assumed that the
> > suspend/resume order was good because we have never seen any problems,
> > but nonetheless it is always good to check. Using ftrace I enabled
> > tracing of the appropriate suspend/resume functions and this is what I
> > see ...
> >
> > # tracer: function
> > #
> > # entries-in-buffer/entries-written: 4/4   #P:6
> > #
> > #_-=> irqs-off
> > #   / _=> need-resched
> > #  | / _---=> hardirq/softirq
> > #  || / _--=> preempt-depth
> > #  ||| / delay
> > #   TASK-PID CPU#     TIMESTAMP  FUNCTION
> > #  | | |     | |
> >  rtcwake-748 [000] ...1   536.700777:
> stmmac_pltfr_suspend <-platform_pm_suspend
> >  rtcwake-748 [000] ...1   536.735532:
> arm_smmu_pm_suspend <-platform_pm_suspend
> >  rtcwake-748 [000] ...1   536.757290:
> arm_smmu_pm_resume <-platform_pm_resume
> >  rtcwake-748 [003] ...1   536.856771:
> stmmac_pltfr_resume <-platform_pm_resume
> >
> >
> > So I don't see any ordering issues that could be causing this.
> 
> 
> Another thing I have found is that for our platform, if the driver for the 
> ethernet
> PHY (in this case broadcom PHY) is enabled, then it fails to resume but if I
> disable the PHY in the kernel configuration, then resume works. I have found
> that if I move the reinit of the RX buffers to before the startup of the phy, 
> then
> it can resume OK with the PHY enabled.
> 
> Does the following work for you? Does your platform use a specific ethernet
> PHY driver?

I am also looking into this issue these days, we use the Realtek RTL8211FDI 
PHY, driver is drivers/net/phy/realtek.c.

For our EQOS MAC integrated in our SoC, Rx side logic depends on RXC clock from 
PHY, so we need phylink_start before MAC.

I will test below code change tomorrow to see if it can work at my side, since 
it is only re-init memory, need not RXC clock.


> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index 208cae344ffa..071d15d86dbe 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -5416,19 +5416,20 @@ int stmmac_resume(struct device *dev)
> return ret;
> }
> +   rtnl_lock();
> +   mutex_lock(>lock);
> +   stmmac_reinit_rx_buffers(priv);
> +   mutex_unlock(>lock);
> +
> if (!device_may_wakeup(priv->device) || !priv->plat->pmt) {
> -   rtnl_lock();
> phylink_start(priv->phylink);
> /* We may have called phylink_speed_down before */
> phylink_speed_up(priv->phylink);
> -   rtnl_unlock();
> }
> -   rtnl_lock();
> mutex_lock(>lock);
> stmmac_reset_queues_param(priv);
> -   stmmac_reinit_rx_buffers(priv);
> stmmac_free_tx_skbufs(priv);
> stmmac_clear_descriptors(priv);
> 
> 
> It is still not clear to us why the existing call to
> stmmac_clear_descriptors() is not sufficient to fix your problem.

During sus

Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-04-13 Thread Jon Hunter


On 01/04/2021 17:28, Jon Hunter wrote:
> 
> On 31/03/2021 12:41, Joakim Zhang wrote:
> 
> ...
> 
>>> In answer to your question, resuming from suspend does work on this board
>>> without your change. We have been testing suspend/resume now on this board
>>> since Linux v5.8 and so we have the ability to bisect such regressions. So 
>>> it is
>>> clear to me that this is the change that caused this, but I am not sure why.
>>
>> Yes, I know this issue is regression caused by my patch. I just want to 
>> analyze the potential reasons. Due to the code change only related to the 
>> page recycle and reallocate.
>> So I guess if this page operate need IOMMU works when IOMMU is enabled. 
>> Could you help check if IOMMU driver resume before STMMAC? Our common desire 
>> is to find the root cause, right?
> 
> 
> Yes of course that is the desire here indeed. I had assumed that the
> suspend/resume order was good because we have never seen any problems,
> but nonetheless it is always good to check. Using ftrace I enabled
> tracing of the appropriate suspend/resume functions and this is what
> I see ...
> 
> # tracer: function
> #
> # entries-in-buffer/entries-written: 4/4   #P:6
> #
> #_-=> irqs-off
> #   / _=> need-resched
> #  | / _---=> hardirq/softirq
> #  || / _--=> preempt-depth
> #  ||| / delay
> #   TASK-PID CPU#     TIMESTAMP  FUNCTION
> #  | | |     | |
>  rtcwake-748 [000] ...1   536.700777: stmmac_pltfr_suspend 
> <-platform_pm_suspend
>  rtcwake-748 [000] ...1   536.735532: arm_smmu_pm_suspend 
> <-platform_pm_suspend
>  rtcwake-748 [000] ...1   536.757290: arm_smmu_pm_resume 
> <-platform_pm_resume
>  rtcwake-748 [003] ...1   536.856771: stmmac_pltfr_resume 
> <-platform_pm_resume
> 
> 
> So I don't see any ordering issues that could be causing this. 


Another thing I have found is that for our platform, if the driver for
the ethernet PHY (in this case broadcom PHY) is enabled, then it fails
to resume but if I disable the PHY in the kernel configuration, then
resume works. I have found that if I move the reinit of the RX buffers
to before the startup of the phy, then it can resume OK with the PHY
enabled.

Does the following work for you? Does your platform use a specific
ethernet PHY driver?

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 208cae344ffa..071d15d86dbe 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -5416,19 +5416,20 @@ int stmmac_resume(struct device *dev)
return ret;
}
+   rtnl_lock();
+   mutex_lock(>lock);
+   stmmac_reinit_rx_buffers(priv);
+   mutex_unlock(>lock);
+
if (!device_may_wakeup(priv->device) || !priv->plat->pmt) {
-   rtnl_lock();
phylink_start(priv->phylink);
/* We may have called phylink_speed_down before */
phylink_speed_up(priv->phylink);
-   rtnl_unlock();
}
-   rtnl_lock();
mutex_lock(>lock);
stmmac_reset_queues_param(priv);
-   stmmac_reinit_rx_buffers(priv);
stmmac_free_tx_skbufs(priv);
stmmac_clear_descriptors(priv);


It is still not clear to us why the existing call to
stmmac_clear_descriptors() is not sufficient to fix your problem.

How often does the issue you see occur?

Thanks
Jon

-- 
nvpublic


Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-04-01 Thread Jon Hunter


On 31/03/2021 12:41, Joakim Zhang wrote:

...

>> In answer to your question, resuming from suspend does work on this board
>> without your change. We have been testing suspend/resume now on this board
>> since Linux v5.8 and so we have the ability to bisect such regressions. So 
>> it is
>> clear to me that this is the change that caused this, but I am not sure why.
> 
> Yes, I know this issue is regression caused by my patch. I just want to 
> analyze the potential reasons. Due to the code change only related to the 
> page recycle and reallocate.
> So I guess if this page operate need IOMMU works when IOMMU is enabled. Could 
> you help check if IOMMU driver resume before STMMAC? Our common desire is to 
> find the root cause, right?


Yes of course that is the desire here indeed. I had assumed that the
suspend/resume order was good because we have never seen any problems,
but nonetheless it is always good to check. Using ftrace I enabled
tracing of the appropriate suspend/resume functions and this is what
I see ...

# tracer: function
#
# entries-in-buffer/entries-written: 4/4   #P:6
#
#_-=> irqs-off
#   / _=> need-resched
#  | / _---=> hardirq/softirq
#  || / _--=> preempt-depth
#  ||| / delay
#   TASK-PID CPU#     TIMESTAMP  FUNCTION
#  | | |     | |
 rtcwake-748 [000] ...1   536.700777: stmmac_pltfr_suspend 
<-platform_pm_suspend
 rtcwake-748 [000] ...1   536.735532: arm_smmu_pm_suspend 
<-platform_pm_suspend
 rtcwake-748 [000] ...1   536.757290: arm_smmu_pm_resume 
<-platform_pm_resume
 rtcwake-748 [003] ...1   536.856771: stmmac_pltfr_resume 
<-platform_pm_resume


So I don't see any ordering issues that could be causing this. 

Jon

-- 
nvpublic


RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-03-31 Thread Joakim Zhang

> -Original Message-
> From: Jon Hunter 
> Sent: 2021年3月31日 19:29
> To: Joakim Zhang ; Giuseppe Cavallaro
> ; Alexandre Torgue ;
> Jose Abreu 
> Cc: net...@vger.kernel.org; Linux Kernel Mailing List
> ; linux-tegra ;
> Jakub Kicinski 
> Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> resume back
> 
> 
> On 31/03/2021 12:10, Joakim Zhang wrote:
> 
> ...
> 
> >>>>>>>> You mean one of your boards? Does other boards with STMMAC can
> >>>>>>>> work
> >>>>>>> fine?
> >>>>>>>
> >>>>>>> We have two devices with the STMMAC and one works OK and the
> >>>>>>> other
> >>>>> fails.
> >>>>>>> They are different generation of device and so there could be
> >>>>>>> some architectural differences which is causing this to only be
> >>>>>>> seen on one
> >>> device.
> >>>>>> It's really strange, but I also don't know what architectural
> >>>>>> differences could
> >>>>> affect this. Sorry.
> >>>
> >>>
> >>> I realised that for the board which fails after this change is made,
> >>> it has the IOMMU enabled. The other board does not at the moment
> >>> (although work is in progress to enable). If I add
> >>> 'iommu.passthrough=1' to cmdline for the failing board, then it
> >>> works again. So in my case, the problem is linked to the IOMMU being
> enabled.
> >>>
> >>> Does you platform enable the IOMMU?
> >>
> >> Hi Jon,
> >>
> >> There is no IOMMU hardware available on our boards. But why IOMMU
> >> would affect it during suspend/resume, and no problem in normal mode?
> >
> > One more add, I saw drivers/iommu/tegra-gart.c(not sure if is this) support
> suspend/resume, is it possible iommu resume back after stmmac?
> 
> 
> This board is the tegra186-p2771- (Jetson TX2) and uses the
> arm,mmu-500 and not the above driver.

OK.

> In answer to your question, resuming from suspend does work on this board
> without your change. We have been testing suspend/resume now on this board
> since Linux v5.8 and so we have the ability to bisect such regressions. So it 
> is
> clear to me that this is the change that caused this, but I am not sure why.

Yes, I know this issue is regression caused by my patch. I just want to analyze 
the potential reasons. Due to the code change only related to the page recycle 
and reallocate.
So I guess if this page operate need IOMMU works when IOMMU is enabled. Could 
you help check if IOMMU driver resume before STMMAC? Our common desire is to 
find the root cause, right?

Best Regards,
Joakim Zhang
> Thanks
> Jon
> 
> --
> nvpublic


Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-03-31 Thread Jon Hunter


On 31/03/2021 12:10, Joakim Zhang wrote:

...

 You mean one of your boards? Does other boards with STMMAC can
 work
>>> fine?
>>>
>>> We have two devices with the STMMAC and one works OK and the
>>> other
> fails.
>>> They are different generation of device and so there could be
>>> some architectural differences which is causing this to only be
>>> seen on one
>>> device.
>> It's really strange, but I also don't know what architectural
>> differences could
> affect this. Sorry.
>>>
>>>
>>> I realised that for the board which fails after this change is made,
>>> it has the IOMMU enabled. The other board does not at the moment
>>> (although work is in progress to enable). If I add
>>> 'iommu.passthrough=1' to cmdline for the failing board, then it works
>>> again. So in my case, the problem is linked to the IOMMU being enabled.
>>>
>>> Does you platform enable the IOMMU?
>>
>> Hi Jon,
>>
>> There is no IOMMU hardware available on our boards. But why IOMMU would
>> affect it during suspend/resume, and no problem in normal mode?
> 
> One more add, I saw drivers/iommu/tegra-gart.c(not sure if is this) support 
> suspend/resume, is it possible iommu resume back after stmmac?


This board is the tegra186-p2771- (Jetson TX2) and uses the
arm,mmu-500 and not the above driver.

In answer to your question, resuming from suspend does work on this
board without your change. We have been testing suspend/resume now on
this board since Linux v5.8 and so we have the ability to bisect such
regressions. So it is clear to me that this is the change that caused
this, but I am not sure why.

Thanks
Jon

-- 
nvpublic


Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-03-31 Thread Jon Hunter


On 31/03/2021 08:43, Joakim Zhang wrote:

...

>>> You mean one of your boards? Does other boards with STMMAC can
>>> work
>> fine?
>>
>> We have two devices with the STMMAC and one works OK and the other
 fails.
>> They are different generation of device and so there could be some
>> architectural differences which is causing this to only be seen on one
>> device.
> It's really strange, but I also don't know what architectural
> differences could
 affect this. Sorry.
>>
>>
>> I realised that for the board which fails after this change is made, it has 
>> the
>> IOMMU enabled. The other board does not at the moment (although work is in
>> progress to enable). If I add 'iommu.passthrough=1' to cmdline for the 
>> failing
>> board, then it works again. So in my case, the problem is linked to the IOMMU
>> being enabled.
>>
>> Does you platform enable the IOMMU?
> 
> Hi Jon,
> 
> There is no IOMMU hardware available on our boards. But why IOMMU would 
> affect it during suspend/resume, and no problem in normal mode?


I am not sure either and I don't see anything obvious.

Guiseppe, Alexandre, Jose, do you see anything that is wrong with
Joakim's change 9c63faaa931e? This is completely breaking resume from
suspend on one of our boards and I would like to get your inputs?

Thanks
Jon

-- 
nvpublic


RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-03-31 Thread Joakim Zhang

> -Original Message-
> From: Joakim Zhang 
> Sent: 2021年3月31日 15:44
> To: Jon Hunter 
> Cc: net...@vger.kernel.org; Linux Kernel Mailing List
> ; linux-tegra ;
> Jakub Kicinski 
> Subject: RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> resume back
> 
> 
> > -Original Message-
> > From: Jon Hunter 
> > Sent: 2021年3月30日 20:51
> > To: Joakim Zhang 
> > Cc: net...@vger.kernel.org; Linux Kernel Mailing List
> > ; linux-tegra
> > ; Jakub Kicinski 
> > Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers
> > when mac resume back
> >
> >
> >
> > On 25/03/2021 08:12, Joakim Zhang wrote:
> >
> > ...
> >
> > >>>>> You mean one of your boards? Does other boards with STMMAC can
> > >>>>> work
> > >>>> fine?
> > >>>>
> > >>>> We have two devices with the STMMAC and one works OK and the
> > >>>> other
> > >> fails.
> > >>>> They are different generation of device and so there could be
> > >>>> some architectural differences which is causing this to only be
> > >>>> seen on one
> > device.
> > >>> It's really strange, but I also don't know what architectural
> > >>> differences could
> > >> affect this. Sorry.
> >
> >
> > I realised that for the board which fails after this change is made,
> > it has the IOMMU enabled. The other board does not at the moment
> > (although work is in progress to enable). If I add
> > 'iommu.passthrough=1' to cmdline for the failing board, then it works
> > again. So in my case, the problem is linked to the IOMMU being enabled.
> >
> > Does you platform enable the IOMMU?
> 
> Hi Jon,
> 
> There is no IOMMU hardware available on our boards. But why IOMMU would
> affect it during suspend/resume, and no problem in normal mode?

One more add, I saw drivers/iommu/tegra-gart.c(not sure if is this) support 
suspend/resume, is it possible iommu resume back after stmmac?

Best Regards,
Joakim Zhang
> Best Regards,
> Joakim Zhang
> > Thanks
> > Jon
> >
> > --
> > nvpublic


RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-03-31 Thread Joakim Zhang

> -Original Message-
> From: Jon Hunter 
> Sent: 2021年3月30日 20:51
> To: Joakim Zhang 
> Cc: net...@vger.kernel.org; Linux Kernel Mailing List
> ; linux-tegra ;
> Jakub Kicinski 
> Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> resume back
> 
> 
> 
> On 25/03/2021 08:12, Joakim Zhang wrote:
> 
> ...
> 
> >>>>> You mean one of your boards? Does other boards with STMMAC can
> >>>>> work
> >>>> fine?
> >>>>
> >>>> We have two devices with the STMMAC and one works OK and the other
> >> fails.
> >>>> They are different generation of device and so there could be some
> >>>> architectural differences which is causing this to only be seen on one
> device.
> >>> It's really strange, but I also don't know what architectural
> >>> differences could
> >> affect this. Sorry.
> 
> 
> I realised that for the board which fails after this change is made, it has 
> the
> IOMMU enabled. The other board does not at the moment (although work is in
> progress to enable). If I add 'iommu.passthrough=1' to cmdline for the failing
> board, then it works again. So in my case, the problem is linked to the IOMMU
> being enabled.
> 
> Does you platform enable the IOMMU?

Hi Jon,

There is no IOMMU hardware available on our boards. But why IOMMU would affect 
it during suspend/resume, and no problem in normal mode?

Best Regards,
Joakim Zhang
> Thanks
> Jon
> 
> --
> nvpublic


Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-03-30 Thread Jon Hunter



On 25/03/2021 08:12, Joakim Zhang wrote:

...

> You mean one of your boards? Does other boards with STMMAC can work
 fine?

 We have two devices with the STMMAC and one works OK and the other
>> fails.
 They are different generation of device and so there could be some
 architectural differences which is causing this to only be seen on one 
 device.
>>> It's really strange, but I also don't know what architectural differences 
>>> could
>> affect this. Sorry.


I realised that for the board which fails after this change is made, it
has the IOMMU enabled. The other board does not at the moment (although
work is in progress to enable). If I add 'iommu.passthrough=1' to
cmdline for the failing board, then it works again. So in my case, the
problem is linked to the IOMMU being enabled.

Does you platform enable the IOMMU?

Thanks
Jon

-- 
nvpublic


RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-03-25 Thread Joakim Zhang

> -Original Message-
> From: Jon Hunter 
> Sent: 2021年3月25日 16:01
> To: Joakim Zhang 
> Cc: net...@vger.kernel.org; Linux Kernel Mailing List
> ; linux-tegra ;
> Jakub Kicinski 
> Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> resume back
> 
> 
> On 25/03/2021 07:53, Joakim Zhang wrote:
> >
> >> -Original Message-
> >> From: Jon Hunter 
> >> Sent: 2021年3月24日 20:39
> >> To: Joakim Zhang 
> >> Cc: net...@vger.kernel.org; Linux Kernel Mailing List
> >> ; linux-tegra
> >> ; Jakub Kicinski 
> >> Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers
> >> when mac resume back
> >>
> >>
> >>
> >> On 24/03/2021 12:20, Joakim Zhang wrote:
> >>
> >> ...
> >>
> >>> Sorry for this breakage at your side.
> >>>
> >>> You mean one of your boards? Does other boards with STMMAC can work
> >> fine?
> >>
> >> We have two devices with the STMMAC and one works OK and the other
> fails.
> >> They are different generation of device and so there could be some
> >> architectural differences which is causing this to only be seen on one 
> >> device.
> > It's really strange, but I also don't know what architectural differences 
> > could
> affect this. Sorry.
> 
> 
> Maybe caching somewhere? In other words, could there be any cache flushing
> that we are missing here?
Have no idea, have not account into such case.

> >>> We do daily test with NFS to mount rootfs, on issue found. And I add
> >>> this
> >> patch at the resume patch, and on error check, this should not break
> suspend.
> >>> I even did the overnight stress test, there is no issue found.
> >>>
> >>> Could you please do more test to see where the issue happen?
> >>
> >> The issue occurs 100% of the time on the failing board and always on
> >> the first resume from suspend. Is there any more debug I can enable
> >> to track down what the problem is?
> >>
> >
> > As commit messages described, the patch aims to re-init rx buffers
> > address, since the address is not fixed, so I only can recycle and then
> re-allocate all of them. The page pool is allocated once when open the net
> device.
> >
> > Could you please debug if it fails at some functions, such as
> page_pool_dev_alloc_pages() ?
> 
> 
> Yes that was the first thing I tried, but no obvious failures from allocating 
> the
> pools.
> 
> Are you certain that the problem you are seeing, that is being fixed by this
> change, is generic to all devices? The commit message states that 'descriptor
> write back by DMA could exhibit unusual behavior', is this a known issue in 
> the
> STMMAC controller? If so does this impact all versions and what is the actual
> problem?

Yes, I confirm this patch fix issue at my side. It should not be a generic, it 
can reproduce at one of our boards.
To be honest, I have not found the root cause, this should be a workaround, I 
upstream it since I think it will not affect others which don't suffer from 
this.

Best Regards,
Joakim Zhang
> Jon
> 
> --
> nvpublic


Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-03-25 Thread Jon Hunter


On 25/03/2021 07:53, Joakim Zhang wrote:
> 
>> -Original Message-
>> From: Jon Hunter 
>> Sent: 2021年3月24日 20:39
>> To: Joakim Zhang 
>> Cc: net...@vger.kernel.org; Linux Kernel Mailing List
>> ; linux-tegra ;
>> Jakub Kicinski 
>> Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
>> resume back
>>
>>
>>
>> On 24/03/2021 12:20, Joakim Zhang wrote:
>>
>> ...
>>
>>> Sorry for this breakage at your side.
>>>
>>> You mean one of your boards? Does other boards with STMMAC can work
>> fine?
>>
>> We have two devices with the STMMAC and one works OK and the other fails.
>> They are different generation of device and so there could be some
>> architectural differences which is causing this to only be seen on one 
>> device.
> It's really strange, but I also don't know what architectural differences 
> could affect this. Sorry.


Maybe caching somewhere? In other words, could there be any cache
flushing that we are missing here?

>>> We do daily test with NFS to mount rootfs, on issue found. And I add this
>> patch at the resume patch, and on error check, this should not break suspend.
>>> I even did the overnight stress test, there is no issue found.
>>>
>>> Could you please do more test to see where the issue happen?
>>
>> The issue occurs 100% of the time on the failing board and always on the 
>> first
>> resume from suspend. Is there any more debug I can enable to track down
>> what the problem is?
>>
> 
> As commit messages described, the patch aims to re-init rx buffers address, 
> since the address is not fixed, so I only can 
> recycle and then re-allocate all of them. The page pool is allocated once 
> when open the net device.
> 
> Could you please debug if it fails at some functions, such as 
> page_pool_dev_alloc_pages() ?


Yes that was the first thing I tried, but no obvious failures from
allocating the pools.

Are you certain that the problem you are seeing, that is being fixed by
this change, is generic to all devices? The commit message states that
'descriptor write back by DMA could exhibit unusual behavior', is this a
known issue in the STMMAC controller? If so does this impact all
versions and what is the actual problem?

Jon

-- 
nvpublic


RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-03-25 Thread Joakim Zhang

> -Original Message-
> From: Jon Hunter 
> Sent: 2021年3月24日 20:39
> To: Joakim Zhang 
> Cc: net...@vger.kernel.org; Linux Kernel Mailing List
> ; linux-tegra ;
> Jakub Kicinski 
> Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> resume back
> 
> 
> 
> On 24/03/2021 12:20, Joakim Zhang wrote:
> 
> ...
> 
> > Sorry for this breakage at your side.
> >
> > You mean one of your boards? Does other boards with STMMAC can work
> fine?
> 
> We have two devices with the STMMAC and one works OK and the other fails.
> They are different generation of device and so there could be some
> architectural differences which is causing this to only be seen on one device.
It's really strange, but I also don't know what architectural differences could 
affect this. Sorry.

> > We do daily test with NFS to mount rootfs, on issue found. And I add this
> patch at the resume patch, and on error check, this should not break suspend.
> > I even did the overnight stress test, there is no issue found.
> >
> > Could you please do more test to see where the issue happen?
> 
> The issue occurs 100% of the time on the failing board and always on the first
> resume from suspend. Is there any more debug I can enable to track down
> what the problem is?
> 

As commit messages described, the patch aims to re-init rx buffers address, 
since the address is not fixed, so I only can 
recycle and then re-allocate all of them. The page pool is allocated once when 
open the net device.

Could you please debug if it fails at some functions, such as 
page_pool_dev_alloc_pages() ?

Best Regards,
Joakim Zhang
> Jon
> 
> --
> nvpublic


Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-03-24 Thread Jon Hunter



On 24/03/2021 12:20, Joakim Zhang wrote:

...

> Sorry for this breakage at your side.
> 
> You mean one of your boards? Does other boards with STMMAC can work fine?

We have two devices with the STMMAC and one works OK and the other
fails. They are different generation of device and so there could be
some architectural differences which is causing this to only be seen on
one device.

> We do daily test with NFS to mount rootfs, on issue found. And I add this 
> patch at the resume patch, and on error check, this should not break suspend.
> I even did the overnight stress test, there is no issue found.
> 
> Could you please do more test to see where the issue happen?

The issue occurs 100% of the time on the failing board and always on the
first resume from suspend. Is there any more debug I can enable to track
down what the problem is?

Jon

-- 
nvpublic


RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-03-24 Thread Joakim Zhang

> -Original Message-
> From: Jon Hunter 
> Sent: 2021年3月24日 18:51
> To: Joakim Zhang 
> Cc: net...@vger.kernel.org; Linux Kernel Mailing List
> ; linux-tegra ;
> Jakub Kicinski 
> Subject: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> resume back
> 
> Hi Joakim,
> 
> Starting with v5.12-rc3 I noticed that one of our boards, Tegra186 Jetson TX2,
> was not long resuming from suspend. Bisect points to commit 9c63faaa931e
> ("net: stmmac: re-init rx buffers when mac resume back") and reverting this on
> top of mainline fixes the problem.
> 
> Interestingly, the board appears to partially resume from suspend and I see
> ethernet appear to resume ...
> 
>  dwc-eth-dwmac 249.ethernet eth0: configuring for phy/rgmii link  mode
>  dwmac4: Master AXI performs any burst length  dwc-eth-dwmac
> 249.ethernet eth0: No Safety Features support found  dwc-eth-dwmac
> 249.ethernet eth0: Link is Up - 1Gbps/Full - flow  control rx/tx
> 
> I don't see any crash, but then it hangs at some point. Please note that this
> board is using NFS for mounting the rootfs.
> 
> Let me know if there is any more info I can provide or tests I can run.

Hi Jon,

Sorry for this breakage at your side.

You mean one of your boards? Does other boards with STMMAC can work fine?

We do daily test with NFS to mount rootfs, on issue found. And I add this patch 
at the resume patch, and on error check, this should not break suspend.
I even did the overnight stress test, there is no issue found.

Could you please do more test to see where the issue happen?

Best Regards,
Joakim Zhang
> Thanks
> Jon
> 



Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-03-24 Thread Jon Hunter
Hi Joakim,

Starting with v5.12-rc3 I noticed that one of our boards, Tegra186
Jetson TX2, was not long resuming from suspend. Bisect points to commit
9c63faaa931e ("net: stmmac: re-init rx buffers when mac resume back")
and reverting this on top of mainline fixes the problem.

Interestingly, the board appears to partially resume from suspend and I
see ethernet appear to resume ...

 dwc-eth-dwmac 249.ethernet eth0: configuring for phy/rgmii link
 mode
 dwmac4: Master AXI performs any burst length
 dwc-eth-dwmac 249.ethernet eth0: No Safety Features support found
 dwc-eth-dwmac 249.ethernet eth0: Link is Up - 1Gbps/Full - flow
 control rx/tx

I don't see any crash, but then it hangs at some point. Please note that
this board is using NFS for mounting the rootfs.

Let me know if there is any more info I can provide or tests I can run.

Thanks
Jon