Re: [PATCH net-next v1 1/6] lan743x: boost performance on cpu archs w/o dma cache snooping

2021-02-05 Thread Sven Van Asbroeck
Hi Christoph,

On Fri, Feb 5, 2021 at 4:31 AM Christoph Hellwig  wrote:
>
> This is a pattern we've seen in a few other net driver, so we should
> be ok.  It just is rather hairy and needs a good justification, which
> seems to be given here.

Thank you so much for taking the time to look into this.
That is certainly good news !!


Re: [PATCH net-next v1 1/6] lan743x: boost performance on cpu archs w/o dma cache snooping

2021-02-05 Thread Sergej Bauer
On Friday, February 5, 2021 5:07:22 PM MSK you wrote:
> Hi Sergej,
> 
> On Fri, Feb 5, 2021 at 7:44 AM Sergej Bauer  wrote:
> > Hi Sven
> > I can confirm great stability improvement after your patch
> > "lan743x: boost performance on cpu archs w/o dma cache snooping".
> > 
> > Test machine is Intel Pentium G4560 3.50GHz
> > lan743x with rejected virtual phy 'inside'
> 
> Interesting, so the speed boost patch seems to improve things even on
> Intel...
> 
> Would you be able to apply and test the multi-buffer patch as well?
> To do that, you can simply apply patches [2/6] and [3/6] on top of
> what you already have.
> 

Hi Sven
Tests after applying patches [2/6] and [3/6] are:
$ ifmtu eth7 500
$ sudo test_ber -l eth7 -c 1000 -n 100 -f500 --no-conf
...
number of sent packets  = 100
number of received packets  = 713288
number of lost packets = 286712
number of out of order packets = 0
number of bit errors   = 0
total errors detected  = 286712
bit error rate = 0.286712
average speed: 427.8043 Mbit/s

$ ifmtu eth7 1500
$ sudo test_ber -l eth7 -c 1000 -n 100 -f500 --no-conf
...
number of sent packets  = 100
number of received packets  = 707869
number of lost packets = 292131
number of out of order packets = 0
number of bit errors   = 0
total errors detected  = 292131
bit error rate = 0.292131
average speed: 431.0163 Mbit/s

$ sudo test_ber -l eth7 -c 1000 -n 100 -f1500 --no-conf
...
number of sent packets  = 100
number of received packets  = 100
number of lost packets = 0
number of out of order packets = 0
number of bit errors   = 0
total errors detected  = 0
bit error rate = 0
average speed: 646.4932 Mbit/s

> Keeping in mind that Bryan has identified an issue with the above
> patch, which will get fixed in v2. So YMMV.
I'll perform tests with v2 as well.


Re: [PATCH net-next v1 1/6] lan743x: boost performance on cpu archs w/o dma cache snooping

2021-02-05 Thread Sven Van Asbroeck
Hi Sergej,

On Fri, Feb 5, 2021 at 10:09 AM Sergej Bauer  wrote:
>
> Tests after applying patches [2/6] and [3/6] are:
> $ ifmtu eth7 500
> $ sudo test_ber -l eth7 -c 1000 -n 100 -f500 --no-conf

Thank you! Is there a way for me to run test_ber myself?
Is this a standard, or a bespoke testing tool?


Re: [PATCH net-next v1 1/6] lan743x: boost performance on cpu archs w/o dma cache snooping

2021-02-05 Thread Sergej Bauer
sOn Friday, February 5, 2021 7:39:40 PM MSK Sven Van Asbroeck wrote:
> Hi Sergej,
> 
> On Fri, Feb 5, 2021 at 10:09 AM Sergej Bauer  wrote:
> > Tests after applying patches [2/6] and [3/6] are:
> > $ ifmtu eth7 500
> > $ sudo test_ber -l eth7 -c 1000 -n 100 -f500 --no-conf
> 
> Thank you! Is there a way for me to run test_ber myself?
> Is this a standard, or a bespoke testing tool?
It's kind of bespoke... A part of framework to assist HW guys in developing
PHY-device. But the project is finished, so I could ask for permission to send
the source to you.





Re: [PATCH net-next v1 1/6] lan743x: boost performance on cpu archs w/o dma cache snooping

2021-02-05 Thread Sven Van Asbroeck
Hi Sergej,

On Fri, Feb 5, 2021 at 7:44 AM Sergej Bauer  wrote:
>
> Hi Sven
> I can confirm great stability improvement after your patch
> "lan743x: boost performance on cpu archs w/o dma cache snooping".
>
> Test machine is Intel Pentium G4560 3.50GHz
> lan743x with rejected virtual phy 'inside'

Interesting, so the speed boost patch seems to improve things even on Intel...

Would you be able to apply and test the multi-buffer patch as well?
To do that, you can simply apply patches [2/6] and [3/6] on top of
what you already have.

Keeping in mind that Bryan has identified an issue with the above
patch, which will get fixed in v2. So YMMV.


Re: [PATCH net-next v1 1/6] lan743x: boost performance on cpu archs w/o dma cache snooping

2021-02-05 Thread Sergej Bauer
Hi Sven
I can confirm great stability improvement after your patch
"lan743x: boost performance on cpu archs w/o dma cache snooping".
Please note, that test_ber opens raw sockets as
s = socket(AF_PACKET, SOCK_RAW, ETH_P_ALL)
and resulting 'average speed' is a average egress speed.

Test machine is Intel Pentium G4560 3.50GHz
lan743x with rejected virtual phy 'inside'

What I had before:
$ ifmtu eth7 500
$ test_ber -l eth7 -c 1000 -n 100 -f500 --no-conf
...
number of sent packets  = 100
number of received packets  = 289017
number of lost packets = 710983
number of out of order packets = 0
number of bit errors   = 0
total errors detected  = 710983
bit error rate = 0.710983
average speed: 429.3799 Mbit/s

$ ifmtu eth7 1500
$ sudo test_ber -l eth7 -c 1000 -n 100 -f1500 --no-conf
...
number of sent packets  = 100
number of received packets  = 577194
number of lost packets = 422806
number of out of order packets = 0
number of bit errors   = 0
total errors detected  = 422806
bit error rate = 0.422806
average speed: 644.6557 Mbit/s
---

and what I had after your patch:
$ ifmtu eth7 500
$ test_ber -l eth7 -c 1000 -n 100 -f500 --no-conf
...
number of sent packets  = 100
number of received packets  = 711329
number of lost packets = 288671
number of out of order packets = 0
number of bit errors   = 0
total errors detected  = 288671
bit error rate = 0.288671
average speed: 429.2263 Mbit/s

$ ifmtu eth7 1500
$ test_ber -l eth7 -c 1000 -n 100 -f1500 --no-conf
...
number of sent packets  = 100
number of received packets  = 100
number of lost packets = 0
number of out of order packets = 0
number of bit errors   = 0
total errors detected  = 0
bit error rate = 0
average speed: 644.5405 Mbit/s


Re: [PATCH net-next v1 1/6] lan743x: boost performance on cpu archs w/o dma cache snooping

2021-02-05 Thread Christoph Hellwig
This is a pattern we've seen in a few other net driver, so we should
be ok.  It just is rather hairy and needs a good justification, which
seems to be given here.


Re: [PATCH net-next v1 1/6] lan743x: boost performance on cpu archs w/o dma cache snooping

2021-01-30 Thread Sven Van Asbroeck
On Sat, Jan 30, 2021 at 5:11 PM  wrote:
>
> It appears you moved this packet_length assignment from just below the 
> following if block, however  you left out the le32_to_cpu.See next comment

PS this merge snafu is removed completely by the next patch in the set.
So this will not prevent you from reviewing/testing the multi-buffer support,
should you want to.


Re: [PATCH net-next v1 1/6] lan743x: boost performance on cpu archs w/o dma cache snooping

2021-01-30 Thread Sven Van Asbroeck
Hi Bryan, thank you so much for reviewing, I really appreciate it.

On Sat, Jan 30, 2021 at 5:11 PM  wrote:
>
> > /* unmap from dma */
> > +   packet_length = RX_DESC_DATA0_FRAME_LENGTH_GET_
> > +   (descriptor->data0);
> It appears you moved this packet_length assignment from just below the 
> following if block, however  you left out the le32_to_cpu.See next comment
>

Correct on both counts. This is a merge snafu that crept in when I
rebased to Alexey's very recent little/big endian patch, at the last
minute.
My tests didn't catch it, because I'm running on a little-endian cpu,
where le32_to_cpu() compiles to a nop.

Had I had the good sense to run sparse on every patch, like Jakub has,
I would have caught it...


RE: [PATCH net-next v1 1/6] lan743x: boost performance on cpu archs w/o dma cache snooping

2021-01-30 Thread Bryan.Whitehead
Sven, see below comments

> @@ -2148,11 +2149,18 @@ static int lan743x_rx_process_packet(struct
> lan743x_rx *rx)
> descriptor = &rx->ring_cpu_ptr[first_index];
> 
> /* unmap from dma */
> +   packet_length = RX_DESC_DATA0_FRAME_LENGTH_GET_
> +   (descriptor->data0);
It appears you moved this packet_length assignment from just below the 
following if block, however  you left out the le32_to_cpu.See next comment

> if (buffer_info->dma_ptr) {
> -   dma_unmap_single(&rx->adapter->pdev->dev,
> -buffer_info->dma_ptr,
> -buffer_info->buffer_length,
> -DMA_FROM_DEVICE);
> +   
> dma_sync_single_for_cpu(&rx->adapter->pdev->dev,
> +   buffer_info->dma_ptr,
> +   packet_length,
> +   DMA_FROM_DEVICE);
> +   
> dma_unmap_single_attrs(&rx->adapter->pdev->dev,
> +  buffer_info->dma_ptr,
> +  
> buffer_info->buffer_length,
> +  DMA_FROM_DEVICE,
> +
> + DMA_ATTR_SKIP_CPU_SYNC);
> buffer_info->dma_ptr = 0;
> buffer_info->buffer_length = 0;
> }
Just below here is the following line
packet_length = RX_DESC_DATA0_FRAME_LENGTH_GET_
(le32_to_cpu(descriptor->data0));
This line was moved above the previous if block, but the le32_to_cpu was 
removed. Is that intentional?
Also I don't see any mention of this packet_length assignment (after the if 
block) being removed.
Since packet_length already contains this value, it seems unnecessary to keep 
this assignment.

> @@ -2167,8 +2175,8 @@ static int lan743x_rx_process_packet(struct
> lan743x_rx *rx)
> int index = first_index;
> 
> /* multi buffer packet not supported */
> -   /* this should not happen since
> -* buffers are allocated to be at least jumbo size
> +   /* this should not happen since buffers are allocated
> +* to be at least the mtu size configured in the mac.
>  */
> 
> /* clean up buffers */ @@ -2628,6 +2636,9 @@ static 
> int
> lan743x_netdev_change_mtu(struct net_device *netdev, int new_mtu)
> struct lan743x_adapter *adapter = netdev_priv(netdev);
> int ret = 0;
> 
> +   if (netif_running(netdev))
> +   return -EBUSY;
> +
> ret = lan743x_mac_set_mtu(adapter, new_mtu);
> if (!ret)
> netdev->mtu = new_mtu;
> --
> 2.17.1



Re: [PATCH net-next v1 1/6] lan743x: boost performance on cpu archs w/o dma cache snooping

2021-01-29 Thread Sven Van Asbroeck
Hi Andrew, thank you so much for looking at this patch !

On Fri, Jan 29, 2021 at 3:36 PM Andrew Lunn  wrote:
>
> So this patch appears to contain two different changes
> 1) You only allocate a receive buffer as big as the MTU plus overheads
> 2) You change the cache operations to operate on the received length.
>
> The first change should be completely safe, and i guess, is giving
> most of the benefits. The second one is where interesting things might
> happen. So please split this patch into two.  If it does break, we can
> git bisect, and probably end up on the second patch.
>

Yes, I tested this extensively on arm7, but you're right, it might behave
differently on other platforms. I will split into two, as you suggested.


Re: [PATCH net-next v1 1/6] lan743x: boost performance on cpu archs w/o dma cache snooping

2021-01-29 Thread Sven Van Asbroeck
On Fri, Jan 29, 2021 at 5:01 PM Jakub Kicinski  wrote:
>
> You may need to rebase to see this:
>
> drivers/net/ethernet/microchip/lan743x_main.c:2123:41: warning: restricted 
> __le32 degrades to integer

Good catch. The problem goes away with the next commit in the set.
This is probably because I rebased to the little endian/big endian patch at
the last minute. I'll fix it up in v2.


Re: [PATCH net-next v1 1/6] lan743x: boost performance on cpu archs w/o dma cache snooping

2021-01-29 Thread Jakub Kicinski
On Fri, 29 Jan 2021 14:52:35 -0500 Sven Van Asbroeck wrote:
> From: Sven Van Asbroeck 
> 
> The buffers in the lan743x driver's receive ring are always 9K,
> even when the largest packet that can be received (the mtu) is
> much smaller. This performs particularly badly on cpu archs
> without dma cache snooping (such as ARM): each received packet
> results in a 9K dma_{map|unmap} operation, which is very expensive
> because cpu caches need to be invalidated.
> 
> Careful measurement of the driver rx path on armv7 reveals that
> the cpu spends the majority of its time waiting for cache
> invalidation.
> 
> Optimize as follows:
> 
> 1. set rx ring buffer size equal to the mtu. this limits the
>amount of cache that needs to be invalidated per dma_map().
> 
> 2. when dma_unmap()ping, skip cpu sync. Sync only the packet data
>actually received, the size of which the chip will indicate in
>its rx ring descriptors. this limits the amount of cache that
>needs to be invalidated per dma_unmap().
> 
> These optimizations double the rx performance on armv7.
> Third parties report 3x rx speedup on armv8.
> 
> Performance on dma cache snooping architectures (such as x86)
> is expected to stay the same.
> 
> Tested with iperf3 on a freescale imx6qp + lan7430, both sides
> set to mtu 1500 bytes, measure rx performance:
> 
> Before:
> [ ID] Interval   Transfer Bandwidth   Retr
> [  4]   0.00-20.00  sec   550 MBytes   231 Mbits/sec0
> After:
> [ ID] Interval   Transfer Bandwidth   Retr
> [  4]   0.00-20.00  sec  1.33 GBytes   570 Mbits/sec0
> 
> Test by Anders Roenningen (and...@ronningen.priv.no) on armv8,
> rx iperf3:
> Before 102 Mbits/sec
> After  279 Mbits/sec
> 
> Signed-off-by: Sven Van Asbroeck 

You may need to rebase to see this:

drivers/net/ethernet/microchip/lan743x_main.c:2123:41: warning: restricted 
__le32 degrades to integer


Re: [PATCH net-next v1 1/6] lan743x: boost performance on cpu archs w/o dma cache snooping

2021-01-29 Thread Andrew Lunn
> diff --git a/drivers/net/ethernet/microchip/lan743x_main.c 
> b/drivers/net/ethernet/microchip/lan743x_main.c
> index f1f6eba4ace4..f485320e5784 100644
> --- a/drivers/net/ethernet/microchip/lan743x_main.c
> +++ b/drivers/net/ethernet/microchip/lan743x_main.c
> @@ -1957,11 +1957,11 @@ static int lan743x_rx_next_index(struct lan743x_rx 
> *rx, int index)
>  
>  static struct sk_buff *lan743x_rx_allocate_skb(struct lan743x_rx *rx)
>  {
> - int length = 0;
> + struct net_device *netdev = rx->adapter->netdev;
>  
> - length = (LAN743X_MAX_FRAME_SIZE + ETH_HLEN + 4 + RX_HEAD_PADDING);
> - return __netdev_alloc_skb(rx->adapter->netdev,
> -   length, GFP_ATOMIC | GFP_DMA);
> + return __netdev_alloc_skb(netdev,
> +   netdev->mtu + ETH_HLEN + 4 + RX_HEAD_PADDING,
> +   GFP_ATOMIC | GFP_DMA);
>  }
>  
>  static void lan743x_rx_update_tail(struct lan743x_rx *rx, int index)
> @@ -1977,9 +1977,10 @@ static int lan743x_rx_init_ring_element(struct 
> lan743x_rx *rx, int index,
>  {
>   struct lan743x_rx_buffer_info *buffer_info;
>   struct lan743x_rx_descriptor *descriptor;
> - int length = 0;
> + struct net_device *netdev = rx->adapter->netdev;
> + int length;

Please keep to reverse christmass tree.
>  
> - length = (LAN743X_MAX_FRAME_SIZE + ETH_HLEN + 4 + RX_HEAD_PADDING);
> + length = netdev->mtu + ETH_HLEN + 4 + RX_HEAD_PADDING;
>   descriptor = &rx->ring_cpu_ptr[index];
>   buffer_info = &rx->buffer_info[index];
>   buffer_info->skb = skb;
> @@ -2148,11 +2149,18 @@ static int lan743x_rx_process_packet(struct 
> lan743x_rx *rx)
>   descriptor = &rx->ring_cpu_ptr[first_index];
>  
>   /* unmap from dma */
> + packet_length = RX_DESC_DATA0_FRAME_LENGTH_GET_
> + (descriptor->data0);
>   if (buffer_info->dma_ptr) {
> - dma_unmap_single(&rx->adapter->pdev->dev,
> -  buffer_info->dma_ptr,
> -  buffer_info->buffer_length,
> -  DMA_FROM_DEVICE);
> + dma_sync_single_for_cpu(&rx->adapter->pdev->dev,
> + buffer_info->dma_ptr,
> + packet_length,
> + DMA_FROM_DEVICE);
> + dma_unmap_single_attrs(&rx->adapter->pdev->dev,
> +buffer_info->dma_ptr,
> +
> buffer_info->buffer_length,
> +DMA_FROM_DEVICE,
> +DMA_ATTR_SKIP_CPU_SYNC);

So this patch appears to contain two different changes
1) You only allocate a receive buffer as big as the MTU plus overheads
2) You change the cache operations to operate on the received length.

The first change should be completely safe, and i guess, is giving
most of the benefits. The second one is where interesting things might
happen. So please split this patch into two.  If it does break, we can
git bisect, and probably end up on the second patch.

Thanks
Andrew