Re: [PATCH 1/3] PCI/ASPM: Use the path max in L1 ASPM latency check

2020-12-17 Thread Ian Kumlien
On Thu, Dec 17, 2020 at 12:21 AM Bjorn Helgaas  wrote:
>
> On Wed, Dec 16, 2020 at 12:20:53PM +0100, Ian Kumlien wrote:
> > On Wed, Dec 16, 2020 at 1:08 AM Bjorn Helgaas  wrote:
> > > On Tue, Dec 15, 2020 at 02:09:12PM +0100, Ian Kumlien wrote:
> > > > On Tue, Dec 15, 2020 at 1:40 AM Bjorn Helgaas  
> > > > wrote:
> > > > > On Mon, Dec 14, 2020 at 11:56:31PM +0100, Ian Kumlien wrote:
> > > > > > On Mon, Dec 14, 2020 at 8:19 PM Bjorn Helgaas  
> > > > > > wrote:
> > > > >
> > > > > > > If you're interested, you could probably unload the Realtek 
> > > > > > > drivers,
> > > > > > > remove the devices, and set the PCI_EXP_LNKCTL_LD (Link Disable) 
> > > > > > > bit
> > > > > > > in 02:04.0, e.g.,
> > > > > > >
> > > > > > >   # 
> > > > > > > RT=/sys/devices/pci:00/:00:01.2/:01:00.0/:02:04.0
> > > > > > >   # echo 1 > $RT/:04:00.0/remove
> > > > > > >   # echo 1 > $RT/:04:00.1/remove
> > > > > > >   # echo 1 > $RT/:04:00.2/remove
> > > > > > >   # echo 1 > $RT/:04:00.4/remove
> > > > > > >   # echo 1 > $RT/:04:00.7/remove
> > > > > > >   # setpci -s02:04.0 CAP_EXP+0x10.w=0x0010
> > > > > > >
> > > > > > > That should take 04:00.x out of the picture.
> > > > > >
> > > > > > Didn't actually change the behaviour, I'm suspecting an errata for 
> > > > > > AMD pcie...
> > > > > >
> > > > > > So did this, with unpatched kernel:
> > > > > > [ ID] Interval   Transfer Bitrate Retr  Cwnd
> > > > > > [  5]   0.00-1.00   sec  4.56 MBytes  38.2 Mbits/sec0   67.9 
> > > > > > KBytes
> > > > > > [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec0   96.2 
> > > > > > KBytes
> > > > > > [  5]   2.00-3.00   sec  4.85 MBytes  40.7 Mbits/sec0   50.9 
> > > > > > KBytes
> > > > > > [  5]   3.00-4.00   sec  4.23 MBytes  35.4 Mbits/sec0   70.7 
> > > > > > KBytes
> > > > > > [  5]   4.00-5.00   sec  4.23 MBytes  35.4 Mbits/sec0   48.1 
> > > > > > KBytes
> > > > > > [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec0   45.2 
> > > > > > KBytes
> > > > > > [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec0   36.8 
> > > > > > KBytes
> > > > > > [  5]   7.00-8.00   sec  3.98 MBytes  33.4 Mbits/sec0   36.8 
> > > > > > KBytes
> > > > > > [  5]   8.00-9.00   sec  4.23 MBytes  35.4 Mbits/sec0   36.8 
> > > > > > KBytes
> > > > > > [  5]   9.00-10.00  sec  4.23 MBytes  35.4 Mbits/sec0   48.1 
> > > > > > KBytes
> > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > [ ID] Interval   Transfer Bitrate Retr
> > > > > > [  5]   0.00-10.00  sec  43.2 MBytes  36.2 Mbits/sec0   
> > > > > >   sender
> > > > > > [  5]   0.00-10.00  sec  42.7 MBytes  35.8 Mbits/sec
> > > > > >   receiver
> > > > > >
> > > > > > and:
> > > > > > echo 0 > 
> > > > > > /sys/devices/pci:00/:00:01.2/:01:00.0/link/l1_aspm
> > > > >
> > > > > BTW, thanks a lot for testing out the "l1_aspm" sysfs file.  I'm very
> > > > > pleased that it seems to be working as intended.
> > > >
> > > > It was nice to find it for easy disabling :)
> > > >
> > > > > > and:
> > > > > > [ ID] Interval   Transfer Bitrate Retr  Cwnd
> > > > > > [  5]   0.00-1.00   sec   113 MBytes   951 Mbits/sec  153772 
> > > > > > KBytes
> > > > > > [  5]   1.00-2.00   sec   109 MBytes   912 Mbits/sec  276550 
> > > > > > KBytes
> > > > > > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec  123625 
> > > > > > KBytes
> > > > > > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   31687 
> > > > > > KBytes
> > > > > > [  5]   4.00-5.00   sec   110 MBytes   923 Mbits/sec0679 
> > > > > > KBytes
> > > > > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec  136577 
> > > > > > KBytes
> > > > > > [  5]   6.00-7.00   sec   110 MBytes   923 Mbits/sec  214645 
> > > > > > KBytes
> > > > > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec   32628 
> > > > > > KBytes
> > > > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   81537 
> > > > > > KBytes
> > > > > > [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   10577 
> > > > > > KBytes
> > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > [ ID] Interval   Transfer Bitrate Retr
> > > > > > [  5]   0.00-10.00  sec  1.08 GBytes   927 Mbits/sec  1056  
> > > > > >sender
> > > > > > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec
> > > > > >   receiver
> > > > > >
> > > > > > But this only confirms that the fix i experience is a side effect.
> > > > > >
> > > > > > The original code is still wrong :)
> > > > >
> > > > > What exactly is this machine?  Brand, model, config?  Maybe you could
> > > > > add this and a dmesg log to the buzilla?  It seems like other people
> > > > > should be seeing the same problem, so I'm hoping to grub around on the
> > > > > web to see if there are similar reports involving these devices.
> > > >
> 

Re: [PATCH 1/3] PCI/ASPM: Use the path max in L1 ASPM latency check

2020-12-16 Thread Bjorn Helgaas
On Wed, Dec 16, 2020 at 12:20:53PM +0100, Ian Kumlien wrote:
> On Wed, Dec 16, 2020 at 1:08 AM Bjorn Helgaas  wrote:
> > On Tue, Dec 15, 2020 at 02:09:12PM +0100, Ian Kumlien wrote:
> > > On Tue, Dec 15, 2020 at 1:40 AM Bjorn Helgaas  wrote:
> > > > On Mon, Dec 14, 2020 at 11:56:31PM +0100, Ian Kumlien wrote:
> > > > > On Mon, Dec 14, 2020 at 8:19 PM Bjorn Helgaas  
> > > > > wrote:
> > > >
> > > > > > If you're interested, you could probably unload the Realtek drivers,
> > > > > > remove the devices, and set the PCI_EXP_LNKCTL_LD (Link Disable) bit
> > > > > > in 02:04.0, e.g.,
> > > > > >
> > > > > >   # 
> > > > > > RT=/sys/devices/pci:00/:00:01.2/:01:00.0/:02:04.0
> > > > > >   # echo 1 > $RT/:04:00.0/remove
> > > > > >   # echo 1 > $RT/:04:00.1/remove
> > > > > >   # echo 1 > $RT/:04:00.2/remove
> > > > > >   # echo 1 > $RT/:04:00.4/remove
> > > > > >   # echo 1 > $RT/:04:00.7/remove
> > > > > >   # setpci -s02:04.0 CAP_EXP+0x10.w=0x0010
> > > > > >
> > > > > > That should take 04:00.x out of the picture.
> > > > >
> > > > > Didn't actually change the behaviour, I'm suspecting an errata for 
> > > > > AMD pcie...
> > > > >
> > > > > So did this, with unpatched kernel:
> > > > > [ ID] Interval   Transfer Bitrate Retr  Cwnd
> > > > > [  5]   0.00-1.00   sec  4.56 MBytes  38.2 Mbits/sec0   67.9 
> > > > > KBytes
> > > > > [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec0   96.2 
> > > > > KBytes
> > > > > [  5]   2.00-3.00   sec  4.85 MBytes  40.7 Mbits/sec0   50.9 
> > > > > KBytes
> > > > > [  5]   3.00-4.00   sec  4.23 MBytes  35.4 Mbits/sec0   70.7 
> > > > > KBytes
> > > > > [  5]   4.00-5.00   sec  4.23 MBytes  35.4 Mbits/sec0   48.1 
> > > > > KBytes
> > > > > [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec0   45.2 
> > > > > KBytes
> > > > > [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec0   36.8 
> > > > > KBytes
> > > > > [  5]   7.00-8.00   sec  3.98 MBytes  33.4 Mbits/sec0   36.8 
> > > > > KBytes
> > > > > [  5]   8.00-9.00   sec  4.23 MBytes  35.4 Mbits/sec0   36.8 
> > > > > KBytes
> > > > > [  5]   9.00-10.00  sec  4.23 MBytes  35.4 Mbits/sec0   48.1 
> > > > > KBytes
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval   Transfer Bitrate Retr
> > > > > [  5]   0.00-10.00  sec  43.2 MBytes  36.2 Mbits/sec0 
> > > > > sender
> > > > > [  5]   0.00-10.00  sec  42.7 MBytes  35.8 Mbits/sec  
> > > > > receiver
> > > > >
> > > > > and:
> > > > > echo 0 > 
> > > > > /sys/devices/pci:00/:00:01.2/:01:00.0/link/l1_aspm
> > > >
> > > > BTW, thanks a lot for testing out the "l1_aspm" sysfs file.  I'm very
> > > > pleased that it seems to be working as intended.
> > >
> > > It was nice to find it for easy disabling :)
> > >
> > > > > and:
> > > > > [ ID] Interval   Transfer Bitrate Retr  Cwnd
> > > > > [  5]   0.00-1.00   sec   113 MBytes   951 Mbits/sec  153772 
> > > > > KBytes
> > > > > [  5]   1.00-2.00   sec   109 MBytes   912 Mbits/sec  276550 
> > > > > KBytes
> > > > > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec  123625 
> > > > > KBytes
> > > > > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   31687 
> > > > > KBytes
> > > > > [  5]   4.00-5.00   sec   110 MBytes   923 Mbits/sec0679 
> > > > > KBytes
> > > > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec  136577 
> > > > > KBytes
> > > > > [  5]   6.00-7.00   sec   110 MBytes   923 Mbits/sec  214645 
> > > > > KBytes
> > > > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec   32628 
> > > > > KBytes
> > > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   81537 
> > > > > KBytes
> > > > > [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   10577 
> > > > > KBytes
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval   Transfer Bitrate Retr
> > > > > [  5]   0.00-10.00  sec  1.08 GBytes   927 Mbits/sec  1056
> > > > >  sender
> > > > > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  
> > > > > receiver
> > > > >
> > > > > But this only confirms that the fix i experience is a side effect.
> > > > >
> > > > > The original code is still wrong :)
> > > >
> > > > What exactly is this machine?  Brand, model, config?  Maybe you could
> > > > add this and a dmesg log to the buzilla?  It seems like other people
> > > > should be seeing the same problem, so I'm hoping to grub around on the
> > > > web to see if there are similar reports involving these devices.
> > >
> > > ASUS Pro WS X570-ACE with AMD Ryzen 9 3900X
> >
> > Possible similar issues:
> >
> >   https://forums.unraid.net/topic/94274-hardware-upgrade-woes/
> >   
> > 

Re: [PATCH 1/3] PCI/ASPM: Use the path max in L1 ASPM latency check

2020-12-16 Thread Ian Kumlien
On Wed, Dec 16, 2020 at 1:08 AM Bjorn Helgaas  wrote:
>
> On Tue, Dec 15, 2020 at 02:09:12PM +0100, Ian Kumlien wrote:
> > On Tue, Dec 15, 2020 at 1:40 AM Bjorn Helgaas  wrote:
> > >
> > > On Mon, Dec 14, 2020 at 11:56:31PM +0100, Ian Kumlien wrote:
> > > > On Mon, Dec 14, 2020 at 8:19 PM Bjorn Helgaas  
> > > > wrote:
> > >
> > > > > If you're interested, you could probably unload the Realtek drivers,
> > > > > remove the devices, and set the PCI_EXP_LNKCTL_LD (Link Disable) bit
> > > > > in 02:04.0, e.g.,
> > > > >
> > > > >   # RT=/sys/devices/pci:00/:00:01.2/:01:00.0/:02:04.0
> > > > >   # echo 1 > $RT/:04:00.0/remove
> > > > >   # echo 1 > $RT/:04:00.1/remove
> > > > >   # echo 1 > $RT/:04:00.2/remove
> > > > >   # echo 1 > $RT/:04:00.4/remove
> > > > >   # echo 1 > $RT/:04:00.7/remove
> > > > >   # setpci -s02:04.0 CAP_EXP+0x10.w=0x0010
> > > > >
> > > > > That should take 04:00.x out of the picture.
> > > >
> > > > Didn't actually change the behaviour, I'm suspecting an errata for AMD 
> > > > pcie...
> > > >
> > > > So did this, with unpatched kernel:
> > > > [ ID] Interval   Transfer Bitrate Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  4.56 MBytes  38.2 Mbits/sec0   67.9 KBytes
> > > > [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec0   96.2 KBytes
> > > > [  5]   2.00-3.00   sec  4.85 MBytes  40.7 Mbits/sec0   50.9 KBytes
> > > > [  5]   3.00-4.00   sec  4.23 MBytes  35.4 Mbits/sec0   70.7 KBytes
> > > > [  5]   4.00-5.00   sec  4.23 MBytes  35.4 Mbits/sec0   48.1 KBytes
> > > > [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec0   45.2 KBytes
> > > > [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec0   36.8 KBytes
> > > > [  5]   7.00-8.00   sec  3.98 MBytes  33.4 Mbits/sec0   36.8 KBytes
> > > > [  5]   8.00-9.00   sec  4.23 MBytes  35.4 Mbits/sec0   36.8 KBytes
> > > > [  5]   9.00-10.00  sec  4.23 MBytes  35.4 Mbits/sec0   48.1 KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval   Transfer Bitrate Retr
> > > > [  5]   0.00-10.00  sec  43.2 MBytes  36.2 Mbits/sec0 
> > > > sender
> > > > [  5]   0.00-10.00  sec  42.7 MBytes  35.8 Mbits/sec  
> > > > receiver
> > > >
> > > > and:
> > > > echo 0 > /sys/devices/pci:00/:00:01.2/:01:00.0/link/l1_aspm
> > >
> > > BTW, thanks a lot for testing out the "l1_aspm" sysfs file.  I'm very
> > > pleased that it seems to be working as intended.
> >
> > It was nice to find it for easy disabling :)
> >
> > > > and:
> > > > [ ID] Interval   Transfer Bitrate Retr  Cwnd
> > > > [  5]   0.00-1.00   sec   113 MBytes   951 Mbits/sec  153772 KBytes
> > > > [  5]   1.00-2.00   sec   109 MBytes   912 Mbits/sec  276550 KBytes
> > > > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec  123625 KBytes
> > > > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   31687 KBytes
> > > > [  5]   4.00-5.00   sec   110 MBytes   923 Mbits/sec0679 KBytes
> > > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec  136577 KBytes
> > > > [  5]   6.00-7.00   sec   110 MBytes   923 Mbits/sec  214645 KBytes
> > > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec   32628 KBytes
> > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   81537 KBytes
> > > > [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   10577 KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval   Transfer Bitrate Retr
> > > > [  5]   0.00-10.00  sec  1.08 GBytes   927 Mbits/sec  1056 
> > > > sender
> > > > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  
> > > > receiver
> > > >
> > > > But this only confirms that the fix i experience is a side effect.
> > > >
> > > > The original code is still wrong :)
> > >
> > > What exactly is this machine?  Brand, model, config?  Maybe you could
> > > add this and a dmesg log to the buzilla?  It seems like other people
> > > should be seeing the same problem, so I'm hoping to grub around on the
> > > web to see if there are similar reports involving these devices.
> >
> > ASUS Pro WS X570-ACE with AMD Ryzen 9 3900X
>
> Possible similar issues:
>
>   https://forums.unraid.net/topic/94274-hardware-upgrade-woes/
>   
> https://forums.servethehome.com/index.php?threads/upgraded-my-home-server-from-intel-to-amd-virtual-disk-stuck-in-degraded-unhealty-state.25535/
>  (Windows)

Could be, I suspect that we need a workaround (is there a quirk for
"reporting wrong latency"?) and the patches.

> > > https://bugzilla.kernel.org/show_bug.cgi?id=209725
> > >
> > > Here's one that is superficially similar:
> > > https://linux-hardware.org/index.php?probe=e5f24075e5=lspci_all
> > > in that it has a RP -- switch -- I211 path.  Interestingly, the switch
> > > here advertises <64us L1 exit latency instead of the <32us latency
> 

Re: [PATCH 1/3] PCI/ASPM: Use the path max in L1 ASPM latency check

2020-12-15 Thread Bjorn Helgaas
On Tue, Dec 15, 2020 at 02:09:12PM +0100, Ian Kumlien wrote:
> On Tue, Dec 15, 2020 at 1:40 AM Bjorn Helgaas  wrote:
> >
> > On Mon, Dec 14, 2020 at 11:56:31PM +0100, Ian Kumlien wrote:
> > > On Mon, Dec 14, 2020 at 8:19 PM Bjorn Helgaas  wrote:
> >
> > > > If you're interested, you could probably unload the Realtek drivers,
> > > > remove the devices, and set the PCI_EXP_LNKCTL_LD (Link Disable) bit
> > > > in 02:04.0, e.g.,
> > > >
> > > >   # RT=/sys/devices/pci:00/:00:01.2/:01:00.0/:02:04.0
> > > >   # echo 1 > $RT/:04:00.0/remove
> > > >   # echo 1 > $RT/:04:00.1/remove
> > > >   # echo 1 > $RT/:04:00.2/remove
> > > >   # echo 1 > $RT/:04:00.4/remove
> > > >   # echo 1 > $RT/:04:00.7/remove
> > > >   # setpci -s02:04.0 CAP_EXP+0x10.w=0x0010
> > > >
> > > > That should take 04:00.x out of the picture.
> > >
> > > Didn't actually change the behaviour, I'm suspecting an errata for AMD 
> > > pcie...
> > >
> > > So did this, with unpatched kernel:
> > > [ ID] Interval   Transfer Bitrate Retr  Cwnd
> > > [  5]   0.00-1.00   sec  4.56 MBytes  38.2 Mbits/sec0   67.9 KBytes
> > > [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec0   96.2 KBytes
> > > [  5]   2.00-3.00   sec  4.85 MBytes  40.7 Mbits/sec0   50.9 KBytes
> > > [  5]   3.00-4.00   sec  4.23 MBytes  35.4 Mbits/sec0   70.7 KBytes
> > > [  5]   4.00-5.00   sec  4.23 MBytes  35.4 Mbits/sec0   48.1 KBytes
> > > [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec0   45.2 KBytes
> > > [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec0   36.8 KBytes
> > > [  5]   7.00-8.00   sec  3.98 MBytes  33.4 Mbits/sec0   36.8 KBytes
> > > [  5]   8.00-9.00   sec  4.23 MBytes  35.4 Mbits/sec0   36.8 KBytes
> > > [  5]   9.00-10.00  sec  4.23 MBytes  35.4 Mbits/sec0   48.1 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval   Transfer Bitrate Retr
> > > [  5]   0.00-10.00  sec  43.2 MBytes  36.2 Mbits/sec0 
> > > sender
> > > [  5]   0.00-10.00  sec  42.7 MBytes  35.8 Mbits/sec  
> > > receiver
> > >
> > > and:
> > > echo 0 > /sys/devices/pci:00/:00:01.2/:01:00.0/link/l1_aspm
> >
> > BTW, thanks a lot for testing out the "l1_aspm" sysfs file.  I'm very
> > pleased that it seems to be working as intended.
> 
> It was nice to find it for easy disabling :)
> 
> > > and:
> > > [ ID] Interval   Transfer Bitrate Retr  Cwnd
> > > [  5]   0.00-1.00   sec   113 MBytes   951 Mbits/sec  153772 KBytes
> > > [  5]   1.00-2.00   sec   109 MBytes   912 Mbits/sec  276550 KBytes
> > > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec  123625 KBytes
> > > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   31687 KBytes
> > > [  5]   4.00-5.00   sec   110 MBytes   923 Mbits/sec0679 KBytes
> > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec  136577 KBytes
> > > [  5]   6.00-7.00   sec   110 MBytes   923 Mbits/sec  214645 KBytes
> > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec   32628 KBytes
> > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   81537 KBytes
> > > [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   10577 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval   Transfer Bitrate Retr
> > > [  5]   0.00-10.00  sec  1.08 GBytes   927 Mbits/sec  1056 
> > > sender
> > > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  
> > > receiver
> > >
> > > But this only confirms that the fix i experience is a side effect.
> > >
> > > The original code is still wrong :)
> >
> > What exactly is this machine?  Brand, model, config?  Maybe you could
> > add this and a dmesg log to the buzilla?  It seems like other people
> > should be seeing the same problem, so I'm hoping to grub around on the
> > web to see if there are similar reports involving these devices.
> 
> ASUS Pro WS X570-ACE with AMD Ryzen 9 3900X

Possible similar issues:

  https://forums.unraid.net/topic/94274-hardware-upgrade-woes/
  
https://forums.servethehome.com/index.php?threads/upgraded-my-home-server-from-intel-to-amd-virtual-disk-stuck-in-degraded-unhealty-state.25535/
 (Windows)

> > https://bugzilla.kernel.org/show_bug.cgi?id=209725
> >
> > Here's one that is superficially similar:
> > https://linux-hardware.org/index.php?probe=e5f24075e5=lspci_all
> > in that it has a RP -- switch -- I211 path.  Interestingly, the switch
> > here advertises <64us L1 exit latency instead of the <32us latency
> > your switch advertises.  Of course, I can't tell if it's exactly the
> > same switch.
> 
> Same chipset it seems
> 
> I'm running bios version:
> Version: 2206
> Release Date: 08/13/2020
> 
> ANd latest is:
> Version 3003
> 2020/12/07
> 
> Will test upgrading that as well, but it could be that they report the
> incorrect latency of the 

Re: [PATCH 1/3] PCI/ASPM: Use the path max in L1 ASPM latency check

2020-12-15 Thread Ian Kumlien
On Tue, Dec 15, 2020 at 1:40 AM Bjorn Helgaas  wrote:
>
> On Mon, Dec 14, 2020 at 11:56:31PM +0100, Ian Kumlien wrote:
> > On Mon, Dec 14, 2020 at 8:19 PM Bjorn Helgaas  wrote:
>
> > > If you're interested, you could probably unload the Realtek drivers,
> > > remove the devices, and set the PCI_EXP_LNKCTL_LD (Link Disable) bit
> > > in 02:04.0, e.g.,
> > >
> > >   # RT=/sys/devices/pci:00/:00:01.2/:01:00.0/:02:04.0
> > >   # echo 1 > $RT/:04:00.0/remove
> > >   # echo 1 > $RT/:04:00.1/remove
> > >   # echo 1 > $RT/:04:00.2/remove
> > >   # echo 1 > $RT/:04:00.4/remove
> > >   # echo 1 > $RT/:04:00.7/remove
> > >   # setpci -s02:04.0 CAP_EXP+0x10.w=0x0010
> > >
> > > That should take 04:00.x out of the picture.
> >
> > Didn't actually change the behaviour, I'm suspecting an errata for AMD 
> > pcie...
> >
> > So did this, with unpatched kernel:
> > [ ID] Interval   Transfer Bitrate Retr  Cwnd
> > [  5]   0.00-1.00   sec  4.56 MBytes  38.2 Mbits/sec0   67.9 KBytes
> > [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec0   96.2 KBytes
> > [  5]   2.00-3.00   sec  4.85 MBytes  40.7 Mbits/sec0   50.9 KBytes
> > [  5]   3.00-4.00   sec  4.23 MBytes  35.4 Mbits/sec0   70.7 KBytes
> > [  5]   4.00-5.00   sec  4.23 MBytes  35.4 Mbits/sec0   48.1 KBytes
> > [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec0   45.2 KBytes
> > [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec0   36.8 KBytes
> > [  5]   7.00-8.00   sec  3.98 MBytes  33.4 Mbits/sec0   36.8 KBytes
> > [  5]   8.00-9.00   sec  4.23 MBytes  35.4 Mbits/sec0   36.8 KBytes
> > [  5]   9.00-10.00  sec  4.23 MBytes  35.4 Mbits/sec0   48.1 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval   Transfer Bitrate Retr
> > [  5]   0.00-10.00  sec  43.2 MBytes  36.2 Mbits/sec0 sender
> > [  5]   0.00-10.00  sec  42.7 MBytes  35.8 Mbits/sec  
> > receiver
> >
> > and:
> > echo 0 > /sys/devices/pci:00/:00:01.2/:01:00.0/link/l1_aspm
>
> BTW, thanks a lot for testing out the "l1_aspm" sysfs file.  I'm very
> pleased that it seems to be working as intended.

It was nice to find it for easy disabling :)

> > and:
> > [ ID] Interval   Transfer Bitrate Retr  Cwnd
> > [  5]   0.00-1.00   sec   113 MBytes   951 Mbits/sec  153772 KBytes
> > [  5]   1.00-2.00   sec   109 MBytes   912 Mbits/sec  276550 KBytes
> > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec  123625 KBytes
> > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   31687 KBytes
> > [  5]   4.00-5.00   sec   110 MBytes   923 Mbits/sec0679 KBytes
> > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec  136577 KBytes
> > [  5]   6.00-7.00   sec   110 MBytes   923 Mbits/sec  214645 KBytes
> > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec   32628 KBytes
> > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   81537 KBytes
> > [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   10577 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval   Transfer Bitrate Retr
> > [  5]   0.00-10.00  sec  1.08 GBytes   927 Mbits/sec  1056 
> > sender
> > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  
> > receiver
> >
> > But this only confirms that the fix i experience is a side effect.
> >
> > The original code is still wrong :)
>
> What exactly is this machine?  Brand, model, config?  Maybe you could
> add this and a dmesg log to the buzilla?  It seems like other people
> should be seeing the same problem, so I'm hoping to grub around on the
> web to see if there are similar reports involving these devices.

ASUS Pro WS X570-ACE with AMD Ryzen 9 3900X

> https://bugzilla.kernel.org/show_bug.cgi?id=209725
>
> Here's one that is superficially similar:
> https://linux-hardware.org/index.php?probe=e5f24075e5=lspci_all
> in that it has a RP -- switch -- I211 path.  Interestingly, the switch
> here advertises <64us L1 exit latency instead of the <32us latency
> your switch advertises.  Of course, I can't tell if it's exactly the
> same switch.

Same chipset it seems

I'm running bios version:
Version: 2206
Release Date: 08/13/2020

ANd latest is:
Version 3003
2020/12/07

Will test upgrading that as well, but it could be that they report the
incorrect latency of the switch - I don't know how many things AGESA
changes but... It's been updated twice since my upgrade.

> Bjorn


Re: [PATCH 1/3] PCI/ASPM: Use the path max in L1 ASPM latency check

2020-12-14 Thread Bjorn Helgaas
On Mon, Dec 14, 2020 at 11:56:31PM +0100, Ian Kumlien wrote:
> On Mon, Dec 14, 2020 at 8:19 PM Bjorn Helgaas  wrote:

> > If you're interested, you could probably unload the Realtek drivers,
> > remove the devices, and set the PCI_EXP_LNKCTL_LD (Link Disable) bit
> > in 02:04.0, e.g.,
> >
> >   # RT=/sys/devices/pci:00/:00:01.2/:01:00.0/:02:04.0
> >   # echo 1 > $RT/:04:00.0/remove
> >   # echo 1 > $RT/:04:00.1/remove
> >   # echo 1 > $RT/:04:00.2/remove
> >   # echo 1 > $RT/:04:00.4/remove
> >   # echo 1 > $RT/:04:00.7/remove
> >   # setpci -s02:04.0 CAP_EXP+0x10.w=0x0010
> >
> > That should take 04:00.x out of the picture.
> 
> Didn't actually change the behaviour, I'm suspecting an errata for AMD pcie...
> 
> So did this, with unpatched kernel:
> [ ID] Interval   Transfer Bitrate Retr  Cwnd
> [  5]   0.00-1.00   sec  4.56 MBytes  38.2 Mbits/sec0   67.9 KBytes
> [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec0   96.2 KBytes
> [  5]   2.00-3.00   sec  4.85 MBytes  40.7 Mbits/sec0   50.9 KBytes
> [  5]   3.00-4.00   sec  4.23 MBytes  35.4 Mbits/sec0   70.7 KBytes
> [  5]   4.00-5.00   sec  4.23 MBytes  35.4 Mbits/sec0   48.1 KBytes
> [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec0   45.2 KBytes
> [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec0   36.8 KBytes
> [  5]   7.00-8.00   sec  3.98 MBytes  33.4 Mbits/sec0   36.8 KBytes
> [  5]   8.00-9.00   sec  4.23 MBytes  35.4 Mbits/sec0   36.8 KBytes
> [  5]   9.00-10.00  sec  4.23 MBytes  35.4 Mbits/sec0   48.1 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval   Transfer Bitrate Retr
> [  5]   0.00-10.00  sec  43.2 MBytes  36.2 Mbits/sec0 sender
> [  5]   0.00-10.00  sec  42.7 MBytes  35.8 Mbits/sec  receiver
> 
> and:
> echo 0 > /sys/devices/pci:00/:00:01.2/:01:00.0/link/l1_aspm

BTW, thanks a lot for testing out the "l1_aspm" sysfs file.  I'm very
pleased that it seems to be working as intended.

> and:
> [ ID] Interval   Transfer Bitrate Retr  Cwnd
> [  5]   0.00-1.00   sec   113 MBytes   951 Mbits/sec  153772 KBytes
> [  5]   1.00-2.00   sec   109 MBytes   912 Mbits/sec  276550 KBytes
> [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec  123625 KBytes
> [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   31687 KBytes
> [  5]   4.00-5.00   sec   110 MBytes   923 Mbits/sec0679 KBytes
> [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec  136577 KBytes
> [  5]   6.00-7.00   sec   110 MBytes   923 Mbits/sec  214645 KBytes
> [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec   32628 KBytes
> [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   81537 KBytes
> [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   10577 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval   Transfer Bitrate Retr
> [  5]   0.00-10.00  sec  1.08 GBytes   927 Mbits/sec  1056 sender
> [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  receiver
> 
> But this only confirms that the fix i experience is a side effect.
> 
> The original code is still wrong :)

What exactly is this machine?  Brand, model, config?  Maybe you could
add this and a dmesg log to the buzilla?  It seems like other people
should be seeing the same problem, so I'm hoping to grub around on the
web to see if there are similar reports involving these devices.

https://bugzilla.kernel.org/show_bug.cgi?id=209725

Here's one that is superficially similar:
https://linux-hardware.org/index.php?probe=e5f24075e5=lspci_all
in that it has a RP -- switch -- I211 path.  Interestingly, the switch
here advertises <64us L1 exit latency instead of the <32us latency
your switch advertises.  Of course, I can't tell if it's exactly the
same switch.

Bjorn


Re: [PATCH 1/3] PCI/ASPM: Use the path max in L1 ASPM latency check

2020-12-14 Thread Ian Kumlien
On Mon, Dec 14, 2020 at 8:19 PM Bjorn Helgaas  wrote:
>
> On Mon, Dec 14, 2020 at 04:47:32PM +0100, Ian Kumlien wrote:
> > On Mon, Dec 14, 2020 at 3:02 PM Bjorn Helgaas  wrote:
> > > On Mon, Dec 14, 2020 at 10:14:18AM +0100, Ian Kumlien wrote:
> > > > On Mon, Dec 14, 2020 at 6:44 AM Bjorn Helgaas  
> > > > wrote:
> > > > >
> > > > > [+cc Jesse, Tony, David, Jakub, Heiner, lists in case there's an ASPM
> > > > > issue with I211 or Realtek NICs.  Beginning of thread:
> > > > > https://lore.kernel.org/r/20201024205548.1837770-1-ian.kuml...@gmail.com
> > > > >
> > > > > Short story: Ian has:
> > > > >
> > > > >   Root Port --- Switch --- I211 NIC
> > > > >\-- multifunction Realtek NIC, etc
> > > > >
> > > > > and the I211 performance is poor with ASPM L1 enabled on both links
> > > > > in the path to it.  The patch here disables ASPM on the upstream link
> > > > > and fixes the performance, but AFAICT the devices in that path give us
> > > > > no reason to disable L1.  If I understand the spec correctly, the
> > > > > Realtek device should not be relevant to the I211 path.]
> > > > >
> > > > > On Sun, Dec 13, 2020 at 10:39:53PM +0100, Ian Kumlien wrote:
> > > > > > On Sun, Dec 13, 2020 at 12:47 AM Bjorn Helgaas  
> > > > > > wrote:
> > > > > > > On Sat, Oct 24, 2020 at 10:55:46PM +0200, Ian Kumlien wrote:
> > > > > > > > Make pcie_aspm_check_latency comply with the PCIe spec, 
> > > > > > > > specifically:
> > > > > > > > "5.4.1.2.2. Exit from the L1 State"
> > > > > > > >
> > > > > > > > Which makes it clear that each switch is required to
> > > > > > > > initiate a transition within 1μs from receiving it,
> > > > > > > > accumulating this latency and then we have to wait for the
> > > > > > > > slowest link along the path before entering L0 state from
> > > > > > > > L1.
> > > > > > > > ...
> > > > > > >
> > > > > > > > On my specific system:
> > > > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit 
> > > > > > > > Network Connection (rev 03)
> > > > > > > > 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., 
> > > > > > > > Ltd. Device 816e (rev 1a)
> > > > > > > >
> > > > > > > > Exit latency   Acceptable latency
> > > > > > > > Tree:   L1   L0s   L1   L0s
> > > > > > > > --  ---  - ---  --
> > > > > > > > 00:01.2 <32 us   -
> > > > > > > > | 01:00.0   <32 us   -
> > > > > > > > |- 02:03.0  <32 us   -
> > > > > > > > | \03:00.0  <16 us   <2us  <64 us   <512ns
> > > > > > > > |
> > > > > > > > \- 02:04.0  <32 us   -
> > > > > > > >   \04:00.0  <64 us   unlimited <64 us   <512ns
> > > > > > > >
> > > > > > > > 04:00.0's latency is the same as the maximum it allows so as
> > > > > > > > we walk the path the first switchs startup latency will pass
> > > > > > > > the acceptable latency limit for the link, and as a
> > > > > > > > side-effect it fixes my issues with 03:00.0.
> > > > > > > >
> > > > > > > > Without this patch, 03:00.0 misbehaves and only gives me ~40
> > > > > > > > mbit/s over links with 6 or more hops. With this patch I'm
> > > > > > > > back to a maximum of ~933 mbit/s.
> > > > > > >
> > > > > > > There are two paths here that share a Link:
> > > > > > >
> > > > > > >   00:01.2 --- 01:00.0 -- 02:03.0 --- 03:00.0 I211 NIC
> > > > > > >   00:01.2 --- 01:00.0 -- 02:04.0 --- 04:00.x multifunction Realtek
> > > > > > >
> > > > > > > 1) The path to the I211 NIC includes four Ports and two Links (the
> > > > > > >connection between 01:00.0 and 02:03.0 is internal Switch 
> > > > > > > routing,
> > > > > > >not a Link).
> > > > > >
> > > > > > >The Ports advertise L1 exit latencies of <32us, <32us, <32us,
> > > > > > ><16us.  If both Links are in L1 and 03:00.0 initiates L1 exit 
> > > > > > > at T,
> > > > > > >01:00.0 initiates L1 exit at T + 1.  A TLP from 03:00.0 may 
> > > > > > > see up
> > > > > > >to 1 + 32 = 33us of L1 exit latency.
> > > > > > >
> > > > > > >The NIC can tolerate up to 64us of L1 exit latency, so it is 
> > > > > > > safe
> > > > > > >to enable L1 for both Links.
> > > > > > >
> > > > > > > 2) The path to the Realtek device is similar except that the 
> > > > > > > Realtek
> > > > > > >L1 exit latency is <64us.  If both Links are in L1 and 04:00.x
> > > > > > >initiates L1 exit at T, 01:00.0 again initiates L1 exit at T + 
> > > > > > > 1,
> > > > > > >but a TLP from 04:00.x may see up to 1 + 64 = 65us of L1 exit
> > > > > > >latency.
> > > > > > >
> > > > > > >The Realtek device can only tolerate 64us of latency, so it is 
> > > > > > > not
> > > > > > >safe to enable L1 for both Links.  It should be safe to enable 
> > > > > > > L1
> > > > > > >on the shared link because the exit latency for that link 
> > > > > > > would be
> > > > > > ><32us.
> > > > > >
> > > > > > 04:00.0:
> > > > > > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 
> > > > > > <64us
> > > > > > LnkCap: 

Re: [PATCH 1/3] PCI/ASPM: Use the path max in L1 ASPM latency check

2020-12-14 Thread Bjorn Helgaas
On Mon, Dec 14, 2020 at 04:47:32PM +0100, Ian Kumlien wrote:
> On Mon, Dec 14, 2020 at 3:02 PM Bjorn Helgaas  wrote:
> > On Mon, Dec 14, 2020 at 10:14:18AM +0100, Ian Kumlien wrote:
> > > On Mon, Dec 14, 2020 at 6:44 AM Bjorn Helgaas  wrote:
> > > >
> > > > [+cc Jesse, Tony, David, Jakub, Heiner, lists in case there's an ASPM
> > > > issue with I211 or Realtek NICs.  Beginning of thread:
> > > > https://lore.kernel.org/r/20201024205548.1837770-1-ian.kuml...@gmail.com
> > > >
> > > > Short story: Ian has:
> > > >
> > > >   Root Port --- Switch --- I211 NIC
> > > >\-- multifunction Realtek NIC, etc
> > > >
> > > > and the I211 performance is poor with ASPM L1 enabled on both links
> > > > in the path to it.  The patch here disables ASPM on the upstream link
> > > > and fixes the performance, but AFAICT the devices in that path give us
> > > > no reason to disable L1.  If I understand the spec correctly, the
> > > > Realtek device should not be relevant to the I211 path.]
> > > >
> > > > On Sun, Dec 13, 2020 at 10:39:53PM +0100, Ian Kumlien wrote:
> > > > > On Sun, Dec 13, 2020 at 12:47 AM Bjorn Helgaas  
> > > > > wrote:
> > > > > > On Sat, Oct 24, 2020 at 10:55:46PM +0200, Ian Kumlien wrote:
> > > > > > > Make pcie_aspm_check_latency comply with the PCIe spec, 
> > > > > > > specifically:
> > > > > > > "5.4.1.2.2. Exit from the L1 State"
> > > > > > >
> > > > > > > Which makes it clear that each switch is required to
> > > > > > > initiate a transition within 1μs from receiving it,
> > > > > > > accumulating this latency and then we have to wait for the
> > > > > > > slowest link along the path before entering L0 state from
> > > > > > > L1.
> > > > > > > ...
> > > > > >
> > > > > > > On my specific system:
> > > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit 
> > > > > > > Network Connection (rev 03)
> > > > > > > 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. 
> > > > > > > Device 816e (rev 1a)
> > > > > > >
> > > > > > > Exit latency   Acceptable latency
> > > > > > > Tree:   L1   L0s   L1   L0s
> > > > > > > --  ---  - ---  --
> > > > > > > 00:01.2 <32 us   -
> > > > > > > | 01:00.0   <32 us   -
> > > > > > > |- 02:03.0  <32 us   -
> > > > > > > | \03:00.0  <16 us   <2us  <64 us   <512ns
> > > > > > > |
> > > > > > > \- 02:04.0  <32 us   -
> > > > > > >   \04:00.0  <64 us   unlimited <64 us   <512ns
> > > > > > >
> > > > > > > 04:00.0's latency is the same as the maximum it allows so as
> > > > > > > we walk the path the first switchs startup latency will pass
> > > > > > > the acceptable latency limit for the link, and as a
> > > > > > > side-effect it fixes my issues with 03:00.0.
> > > > > > >
> > > > > > > Without this patch, 03:00.0 misbehaves and only gives me ~40
> > > > > > > mbit/s over links with 6 or more hops. With this patch I'm
> > > > > > > back to a maximum of ~933 mbit/s.
> > > > > >
> > > > > > There are two paths here that share a Link:
> > > > > >
> > > > > >   00:01.2 --- 01:00.0 -- 02:03.0 --- 03:00.0 I211 NIC
> > > > > >   00:01.2 --- 01:00.0 -- 02:04.0 --- 04:00.x multifunction Realtek
> > > > > >
> > > > > > 1) The path to the I211 NIC includes four Ports and two Links (the
> > > > > >connection between 01:00.0 and 02:03.0 is internal Switch 
> > > > > > routing,
> > > > > >not a Link).
> > > > >
> > > > > >The Ports advertise L1 exit latencies of <32us, <32us, <32us,
> > > > > ><16us.  If both Links are in L1 and 03:00.0 initiates L1 exit at 
> > > > > > T,
> > > > > >01:00.0 initiates L1 exit at T + 1.  A TLP from 03:00.0 may see 
> > > > > > up
> > > > > >to 1 + 32 = 33us of L1 exit latency.
> > > > > >
> > > > > >The NIC can tolerate up to 64us of L1 exit latency, so it is safe
> > > > > >to enable L1 for both Links.
> > > > > >
> > > > > > 2) The path to the Realtek device is similar except that the Realtek
> > > > > >L1 exit latency is <64us.  If both Links are in L1 and 04:00.x
> > > > > >initiates L1 exit at T, 01:00.0 again initiates L1 exit at T + 1,
> > > > > >but a TLP from 04:00.x may see up to 1 + 64 = 65us of L1 exit
> > > > > >latency.
> > > > > >
> > > > > >The Realtek device can only tolerate 64us of latency, so it is 
> > > > > > not
> > > > > >safe to enable L1 for both Links.  It should be safe to enable L1
> > > > > >on the shared link because the exit latency for that link would 
> > > > > > be
> > > > > ><32us.
> > > > >
> > > > > 04:00.0:
> > > > > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 
> > > > > <64us
> > > > > LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s
> > > > > unlimited, L1 <64us
> > > > >
> > > > > So maximum latency for the entire link has to be <64 us
> > > > > For the device to leave L1 ASPM takes <64us
> > > > >
> > > > > So the device itself is the slowest entry along the link, which
> > > > > 

Re: [PATCH 1/3] PCI/ASPM: Use the path max in L1 ASPM latency check

2020-12-14 Thread Ian Kumlien
On Mon, Dec 14, 2020 at 3:02 PM Bjorn Helgaas  wrote:
>
> On Mon, Dec 14, 2020 at 10:14:18AM +0100, Ian Kumlien wrote:
> > On Mon, Dec 14, 2020 at 6:44 AM Bjorn Helgaas  wrote:
> > >
> > > [+cc Jesse, Tony, David, Jakub, Heiner, lists in case there's an ASPM
> > > issue with I211 or Realtek NICs.  Beginning of thread:
> > > https://lore.kernel.org/r/20201024205548.1837770-1-ian.kuml...@gmail.com
> > >
> > > Short story: Ian has:
> > >
> > >   Root Port --- Switch --- I211 NIC
> > >\-- multifunction Realtek NIC, etc
> > >
> > > and the I211 performance is poor with ASPM L1 enabled on both links
> > > in the path to it.  The patch here disables ASPM on the upstream link
> > > and fixes the performance, but AFAICT the devices in that path give us
> > > no reason to disable L1.  If I understand the spec correctly, the
> > > Realtek device should not be relevant to the I211 path.]
> > >
> > > On Sun, Dec 13, 2020 at 10:39:53PM +0100, Ian Kumlien wrote:
> > > > On Sun, Dec 13, 2020 at 12:47 AM Bjorn Helgaas  
> > > > wrote:
> > > > > On Sat, Oct 24, 2020 at 10:55:46PM +0200, Ian Kumlien wrote:
> > > > > > Make pcie_aspm_check_latency comply with the PCIe spec, 
> > > > > > specifically:
> > > > > > "5.4.1.2.2. Exit from the L1 State"
> > > > > >
> > > > > > Which makes it clear that each switch is required to
> > > > > > initiate a transition within 1μs from receiving it,
> > > > > > accumulating this latency and then we have to wait for the
> > > > > > slowest link along the path before entering L0 state from
> > > > > > L1.
> > > > > > ...
> > > > >
> > > > > > On my specific system:
> > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network 
> > > > > > Connection (rev 03)
> > > > > > 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. 
> > > > > > Device 816e (rev 1a)
> > > > > >
> > > > > > Exit latency   Acceptable latency
> > > > > > Tree:   L1   L0s   L1   L0s
> > > > > > --  ---  - ---  --
> > > > > > 00:01.2 <32 us   -
> > > > > > | 01:00.0   <32 us   -
> > > > > > |- 02:03.0  <32 us   -
> > > > > > | \03:00.0  <16 us   <2us  <64 us   <512ns
> > > > > > |
> > > > > > \- 02:04.0  <32 us   -
> > > > > >   \04:00.0  <64 us   unlimited <64 us   <512ns
> > > > > >
> > > > > > 04:00.0's latency is the same as the maximum it allows so as
> > > > > > we walk the path the first switchs startup latency will pass
> > > > > > the acceptable latency limit for the link, and as a
> > > > > > side-effect it fixes my issues with 03:00.0.
> > > > > >
> > > > > > Without this patch, 03:00.0 misbehaves and only gives me ~40
> > > > > > mbit/s over links with 6 or more hops. With this patch I'm
> > > > > > back to a maximum of ~933 mbit/s.
> > > > >
> > > > > There are two paths here that share a Link:
> > > > >
> > > > >   00:01.2 --- 01:00.0 -- 02:03.0 --- 03:00.0 I211 NIC
> > > > >   00:01.2 --- 01:00.0 -- 02:04.0 --- 04:00.x multifunction Realtek
> > > > >
> > > > > 1) The path to the I211 NIC includes four Ports and two Links (the
> > > > >connection between 01:00.0 and 02:03.0 is internal Switch routing,
> > > > >not a Link).
> > > >
> > > > >The Ports advertise L1 exit latencies of <32us, <32us, <32us,
> > > > ><16us.  If both Links are in L1 and 03:00.0 initiates L1 exit at T,
> > > > >01:00.0 initiates L1 exit at T + 1.  A TLP from 03:00.0 may see up
> > > > >to 1 + 32 = 33us of L1 exit latency.
> > > > >
> > > > >The NIC can tolerate up to 64us of L1 exit latency, so it is safe
> > > > >to enable L1 for both Links.
> > > > >
> > > > > 2) The path to the Realtek device is similar except that the Realtek
> > > > >L1 exit latency is <64us.  If both Links are in L1 and 04:00.x
> > > > >initiates L1 exit at T, 01:00.0 again initiates L1 exit at T + 1,
> > > > >but a TLP from 04:00.x may see up to 1 + 64 = 65us of L1 exit
> > > > >latency.
> > > > >
> > > > >The Realtek device can only tolerate 64us of latency, so it is not
> > > > >safe to enable L1 for both Links.  It should be safe to enable L1
> > > > >on the shared link because the exit latency for that link would be
> > > > ><32us.
> > > >
> > > > 04:00.0:
> > > > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> > > > LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s
> > > > unlimited, L1 <64us
> > > >
> > > > So maximum latency for the entire link has to be <64 us
> > > > For the device to leave L1 ASPM takes <64us
> > > >
> > > > So the device itself is the slowest entry along the link, which
> > > > means that nothing else along that path can have ASPM enabled
> > >
> > > Yes.  That's what I said above: "it is not safe to enable L1 for both
> > > Links."  Unless I'm missing something, we agree on that.
> > >
> > > I also said that it should be safe to enable L1 on the shared Link
> > > (from 00:01.2 to 01:00.0) because if the 

Re: [PATCH 1/3] PCI/ASPM: Use the path max in L1 ASPM latency check

2020-12-14 Thread Bjorn Helgaas
On Mon, Dec 14, 2020 at 10:14:18AM +0100, Ian Kumlien wrote:
> On Mon, Dec 14, 2020 at 6:44 AM Bjorn Helgaas  wrote:
> >
> > [+cc Jesse, Tony, David, Jakub, Heiner, lists in case there's an ASPM
> > issue with I211 or Realtek NICs.  Beginning of thread:
> > https://lore.kernel.org/r/20201024205548.1837770-1-ian.kuml...@gmail.com
> >
> > Short story: Ian has:
> >
> >   Root Port --- Switch --- I211 NIC
> >\-- multifunction Realtek NIC, etc
> >
> > and the I211 performance is poor with ASPM L1 enabled on both links
> > in the path to it.  The patch here disables ASPM on the upstream link
> > and fixes the performance, but AFAICT the devices in that path give us
> > no reason to disable L1.  If I understand the spec correctly, the
> > Realtek device should not be relevant to the I211 path.]
> >
> > On Sun, Dec 13, 2020 at 10:39:53PM +0100, Ian Kumlien wrote:
> > > On Sun, Dec 13, 2020 at 12:47 AM Bjorn Helgaas  wrote:
> > > > On Sat, Oct 24, 2020 at 10:55:46PM +0200, Ian Kumlien wrote:
> > > > > Make pcie_aspm_check_latency comply with the PCIe spec, specifically:
> > > > > "5.4.1.2.2. Exit from the L1 State"
> > > > >
> > > > > Which makes it clear that each switch is required to
> > > > > initiate a transition within 1μs from receiving it,
> > > > > accumulating this latency and then we have to wait for the
> > > > > slowest link along the path before entering L0 state from
> > > > > L1.
> > > > > ...
> > > >
> > > > > On my specific system:
> > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network 
> > > > > Connection (rev 03)
> > > > > 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. 
> > > > > Device 816e (rev 1a)
> > > > >
> > > > > Exit latency   Acceptable latency
> > > > > Tree:   L1   L0s   L1   L0s
> > > > > --  ---  - ---  --
> > > > > 00:01.2 <32 us   -
> > > > > | 01:00.0   <32 us   -
> > > > > |- 02:03.0  <32 us   -
> > > > > | \03:00.0  <16 us   <2us  <64 us   <512ns
> > > > > |
> > > > > \- 02:04.0  <32 us   -
> > > > >   \04:00.0  <64 us   unlimited <64 us   <512ns
> > > > >
> > > > > 04:00.0's latency is the same as the maximum it allows so as
> > > > > we walk the path the first switchs startup latency will pass
> > > > > the acceptable latency limit for the link, and as a
> > > > > side-effect it fixes my issues with 03:00.0.
> > > > >
> > > > > Without this patch, 03:00.0 misbehaves and only gives me ~40
> > > > > mbit/s over links with 6 or more hops. With this patch I'm
> > > > > back to a maximum of ~933 mbit/s.
> > > >
> > > > There are two paths here that share a Link:
> > > >
> > > >   00:01.2 --- 01:00.0 -- 02:03.0 --- 03:00.0 I211 NIC
> > > >   00:01.2 --- 01:00.0 -- 02:04.0 --- 04:00.x multifunction Realtek
> > > >
> > > > 1) The path to the I211 NIC includes four Ports and two Links (the
> > > >connection between 01:00.0 and 02:03.0 is internal Switch routing,
> > > >not a Link).
> > >
> > > >The Ports advertise L1 exit latencies of <32us, <32us, <32us,
> > > ><16us.  If both Links are in L1 and 03:00.0 initiates L1 exit at T,
> > > >01:00.0 initiates L1 exit at T + 1.  A TLP from 03:00.0 may see up
> > > >to 1 + 32 = 33us of L1 exit latency.
> > > >
> > > >The NIC can tolerate up to 64us of L1 exit latency, so it is safe
> > > >to enable L1 for both Links.
> > > >
> > > > 2) The path to the Realtek device is similar except that the Realtek
> > > >L1 exit latency is <64us.  If both Links are in L1 and 04:00.x
> > > >initiates L1 exit at T, 01:00.0 again initiates L1 exit at T + 1,
> > > >but a TLP from 04:00.x may see up to 1 + 64 = 65us of L1 exit
> > > >latency.
> > > >
> > > >The Realtek device can only tolerate 64us of latency, so it is not
> > > >safe to enable L1 for both Links.  It should be safe to enable L1
> > > >on the shared link because the exit latency for that link would be
> > > ><32us.
> > >
> > > 04:00.0:
> > > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> > > LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s
> > > unlimited, L1 <64us
> > >
> > > So maximum latency for the entire link has to be <64 us
> > > For the device to leave L1 ASPM takes <64us
> > >
> > > So the device itself is the slowest entry along the link, which
> > > means that nothing else along that path can have ASPM enabled
> >
> > Yes.  That's what I said above: "it is not safe to enable L1 for both
> > Links."  Unless I'm missing something, we agree on that.
> >
> > I also said that it should be safe to enable L1 on the shared Link
> > (from 00:01.2 to 01:00.0) because if the downstream Link is always in
> > L0, the exit latency of the shared Link should be <32us, and 04:00.x
> > can tolerate 64us.
> 
> Exit latency of shared link would be max of link, ie 64 + L1-hops, not 32

I don't think this is true.  The path from 00:01.2 to 04:00.x includes
two 

Re: [PATCH 1/3] PCI/ASPM: Use the path max in L1 ASPM latency check

2020-12-14 Thread Ian Kumlien
On Mon, Dec 14, 2020 at 6:44 AM Bjorn Helgaas  wrote:
>
> [+cc Jesse, Tony, David, Jakub, Heiner, lists in case there's an ASPM
> issue with I211 or Realtek NICs.  Beginning of thread:
> https://lore.kernel.org/r/20201024205548.1837770-1-ian.kuml...@gmail.com
>
> Short story: Ian has:
>
>   Root Port --- Switch --- I211 NIC
>\-- multifunction Realtek NIC, etc
>
> and the I211 performance is poor with ASPM L1 enabled on both links
> in the path to it.  The patch here disables ASPM on the upstream link
> and fixes the performance, but AFAICT the devices in that path give us
> no reason to disable L1.  If I understand the spec correctly, the
> Realtek device should not be relevant to the I211 path.]
>
> On Sun, Dec 13, 2020 at 10:39:53PM +0100, Ian Kumlien wrote:
> > On Sun, Dec 13, 2020 at 12:47 AM Bjorn Helgaas  wrote:
> > > On Sat, Oct 24, 2020 at 10:55:46PM +0200, Ian Kumlien wrote:
> > > > Make pcie_aspm_check_latency comply with the PCIe spec, specifically:
> > > > "5.4.1.2.2. Exit from the L1 State"
> > > >
> > > > Which makes it clear that each switch is required to initiate a
> > > > transition within 1μs from receiving it, accumulating this latency and
> > > > then we have to wait for the slowest link along the path before
> > > > entering L0 state from L1.
> > > > ...
> > >
> > > > On my specific system:
> > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network 
> > > > Connection (rev 03)
> > > > 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device 
> > > > 816e (rev 1a)
> > > >
> > > > Exit latency   Acceptable latency
> > > > Tree:   L1   L0s   L1   L0s
> > > > --  ---  - ---  --
> > > > 00:01.2 <32 us   -
> > > > | 01:00.0   <32 us   -
> > > > |- 02:03.0  <32 us   -
> > > > | \03:00.0  <16 us   <2us  <64 us   <512ns
> > > > |
> > > > \- 02:04.0  <32 us   -
> > > >   \04:00.0  <64 us   unlimited <64 us   <512ns
> > > >
> > > > 04:00.0's latency is the same as the maximum it allows so as we walk 
> > > > the path
> > > > the first switchs startup latency will pass the acceptable latency limit
> > > > for the link, and as a side-effect it fixes my issues with 03:00.0.
> > > >
> > > > Without this patch, 03:00.0 misbehaves and only gives me ~40 mbit/s over
> > > > links with 6 or more hops. With this patch I'm back to a maximum of ~933
> > > > mbit/s.
> > >
> > > There are two paths here that share a Link:
> > >
> > >   00:01.2 --- 01:00.0 -- 02:03.0 --- 03:00.0 I211 NIC
> > >   00:01.2 --- 01:00.0 -- 02:04.0 --- 04:00.x multifunction Realtek
> > >
> > > 1) The path to the I211 NIC includes four Ports and two Links (the
> > >connection between 01:00.0 and 02:03.0 is internal Switch routing,
> > >not a Link).
> >
> > >The Ports advertise L1 exit latencies of <32us, <32us, <32us,
> > ><16us.  If both Links are in L1 and 03:00.0 initiates L1 exit at T,
> > >01:00.0 initiates L1 exit at T + 1.  A TLP from 03:00.0 may see up
> > >to 1 + 32 = 33us of L1 exit latency.
> > >
> > >The NIC can tolerate up to 64us of L1 exit latency, so it is safe
> > >to enable L1 for both Links.
> > >
> > > 2) The path to the Realtek device is similar except that the Realtek
> > >L1 exit latency is <64us.  If both Links are in L1 and 04:00.x
> > >initiates L1 exit at T, 01:00.0 again initiates L1 exit at T + 1,
> > >but a TLP from 04:00.x may see up to 1 + 64 = 65us of L1 exit
> > >latency.
> > >
> > >The Realtek device can only tolerate 64us of latency, so it is not
> > >safe to enable L1 for both Links.  It should be safe to enable L1
> > >on the shared link because the exit latency for that link would be
> > ><32us.
> >
> > 04:00.0:
> > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> > LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s
> > unlimited, L1 <64us
> >
> > So maximum latency for the entire link has to be <64 us
> > For the device to leave L1 ASPM takes <64us
> >
> > So the device itself is the slowest entry along the link, which
> > means that nothing else along that path can have ASPM enabled
>
> Yes.  That's what I said above: "it is not safe to enable L1 for both
> Links."  Unless I'm missing something, we agree on that.
>
> I also said that it should be safe to enable L1 on the shared Link
> (from 00:01.2 to 01:00.0) because if the downstream Link is always in
> L0, the exit latency of the shared Link should be <32us, and 04:00.x
> can tolerate 64us.

Exit latency of shared link would be max of link, ie 64 + L1-hops, not 32

> > > > The original code path did:
> > > > 04:00:0-02:04.0 max latency 64-> ok
> > > > 02:04.0-01:00.0 max latency 32 +1 -> ok
> > > > 01:00.0-00:01.2 max latency 32 +2 -> ok
> > > >
> > > > And thus didn't see any L1 ASPM latency issues.
> > > >
> > > > The new code does:
> > > > 04:00:0-02:04.0 max latency 64-> ok
> > > > 02:04.0-01:00.0 

Re: [PATCH 1/3] PCI/ASPM: Use the path max in L1 ASPM latency check

2020-12-13 Thread Bjorn Helgaas
[+cc Jesse, Tony, David, Jakub, Heiner, lists in case there's an ASPM
issue with I211 or Realtek NICs.  Beginning of thread:
https://lore.kernel.org/r/20201024205548.1837770-1-ian.kuml...@gmail.com

Short story: Ian has:

  Root Port --- Switch --- I211 NIC
   \-- multifunction Realtek NIC, etc

and the I211 performance is poor with ASPM L1 enabled on both links
in the path to it.  The patch here disables ASPM on the upstream link
and fixes the performance, but AFAICT the devices in that path give us
no reason to disable L1.  If I understand the spec correctly, the
Realtek device should not be relevant to the I211 path.]

On Sun, Dec 13, 2020 at 10:39:53PM +0100, Ian Kumlien wrote:
> On Sun, Dec 13, 2020 at 12:47 AM Bjorn Helgaas  wrote:
> > On Sat, Oct 24, 2020 at 10:55:46PM +0200, Ian Kumlien wrote:
> > > Make pcie_aspm_check_latency comply with the PCIe spec, specifically:
> > > "5.4.1.2.2. Exit from the L1 State"
> > >
> > > Which makes it clear that each switch is required to initiate a
> > > transition within 1μs from receiving it, accumulating this latency and
> > > then we have to wait for the slowest link along the path before
> > > entering L0 state from L1.
> > > ...
> >
> > > On my specific system:
> > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network 
> > > Connection (rev 03)
> > > 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device 
> > > 816e (rev 1a)
> > >
> > > Exit latency   Acceptable latency
> > > Tree:   L1   L0s   L1   L0s
> > > --  ---  - ---  --
> > > 00:01.2 <32 us   -
> > > | 01:00.0   <32 us   -
> > > |- 02:03.0  <32 us   -
> > > | \03:00.0  <16 us   <2us  <64 us   <512ns
> > > |
> > > \- 02:04.0  <32 us   -
> > >   \04:00.0  <64 us   unlimited <64 us   <512ns
> > >
> > > 04:00.0's latency is the same as the maximum it allows so as we walk the 
> > > path
> > > the first switchs startup latency will pass the acceptable latency limit
> > > for the link, and as a side-effect it fixes my issues with 03:00.0.
> > >
> > > Without this patch, 03:00.0 misbehaves and only gives me ~40 mbit/s over
> > > links with 6 or more hops. With this patch I'm back to a maximum of ~933
> > > mbit/s.
> >
> > There are two paths here that share a Link:
> >
> >   00:01.2 --- 01:00.0 -- 02:03.0 --- 03:00.0 I211 NIC
> >   00:01.2 --- 01:00.0 -- 02:04.0 --- 04:00.x multifunction Realtek
> >
> > 1) The path to the I211 NIC includes four Ports and two Links (the
> >connection between 01:00.0 and 02:03.0 is internal Switch routing,
> >not a Link).
> 
> >The Ports advertise L1 exit latencies of <32us, <32us, <32us,
> ><16us.  If both Links are in L1 and 03:00.0 initiates L1 exit at T,
> >01:00.0 initiates L1 exit at T + 1.  A TLP from 03:00.0 may see up
> >to 1 + 32 = 33us of L1 exit latency.
> >
> >The NIC can tolerate up to 64us of L1 exit latency, so it is safe
> >to enable L1 for both Links.
> >
> > 2) The path to the Realtek device is similar except that the Realtek
> >L1 exit latency is <64us.  If both Links are in L1 and 04:00.x
> >initiates L1 exit at T, 01:00.0 again initiates L1 exit at T + 1,
> >but a TLP from 04:00.x may see up to 1 + 64 = 65us of L1 exit
> >latency.
> >
> >The Realtek device can only tolerate 64us of latency, so it is not
> >safe to enable L1 for both Links.  It should be safe to enable L1
> >on the shared link because the exit latency for that link would be
> ><32us.
> 
> 04:00.0:
> DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s
> unlimited, L1 <64us
> 
> So maximum latency for the entire link has to be <64 us
> For the device to leave L1 ASPM takes <64us
> 
> So the device itself is the slowest entry along the link, which
> means that nothing else along that path can have ASPM enabled

Yes.  That's what I said above: "it is not safe to enable L1 for both
Links."  Unless I'm missing something, we agree on that.

I also said that it should be safe to enable L1 on the shared Link
(from 00:01.2 to 01:00.0) because if the downstream Link is always in
L0, the exit latency of the shared Link should be <32us, and 04:00.x
can tolerate 64us.

> > > The original code path did:
> > > 04:00:0-02:04.0 max latency 64-> ok
> > > 02:04.0-01:00.0 max latency 32 +1 -> ok
> > > 01:00.0-00:01.2 max latency 32 +2 -> ok
> > >
> > > And thus didn't see any L1 ASPM latency issues.
> > >
> > > The new code does:
> > > 04:00:0-02:04.0 max latency 64-> ok
> > > 02:04.0-01:00.0 max latency 64 +1 -> latency exceeded
> > > 01:00.0-00:01.2 max latency 64 +2 -> latency exceeded
> >
> > [Nit: I don't think we should add 1 for the 02:04.0 -- 01:00.0 piece
> > because that's internal Switch routing, not a Link.  But even without
> > that extra microsecond, this path does exceed the acceptable latency
> > since 1 + 64 =