Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-12 Thread Jesper Dangaard Brouer
On Mon, 12 Sep 2016 12:56:28 -0700
Alexei Starovoitov  wrote:

> On Mon, Sep 12, 2016 at 01:30:25PM +0200, Jesper Dangaard Brouer wrote:
> > On Thu, 8 Sep 2016 23:30:50 -0700
> > Alexei Starovoitov  wrote:
> >   
> > > On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:  
> > [...]  
> > > > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > > > Every time you call the stack code, then you flush your icache.  When
> > > > returning to the driver code, you will have to reload all the icache
> > > > associated with the XDP_TX, this is a costly operation.
> > >   
> > [...]  
> > > To make further progress in this discussion can we talk about
> > > the use case you have in mind instead? Then solution will
> > > be much clear, I hope.  
> > 
> > The DDoS use-case _is_ affected by this "hidden" bulking design.
> > 
> > Lets say, I want to implement a DDoS facility. Instead of just
> > dropping the malicious packets, I want to see the bad packets.  I
> > implement this by rewriting the destination-MAC to be my monitor
> > machine and then XDP_TX the packet.  
> 
> not following the use case. you want to implement a DDoS generator?
> Or just forward all bad packets from affected host to another host
> in the same rack? so two servers will be spammed with traffic and
> even more load on the tor? I really don't see how this is useful
> for anything but stress testing.

As I wrote below, the purpose of the monitor machine is to diagnose
false positives.  If you worry about the added load I would either,
forward out another interface (which is not supported yet) or simply do
sampling of packets being forwarded to the monitor host.

> > In the DDoS use-case, you have loaded your XDP/eBPF program, and 100%
> > of the traffic is delivered to the stack. (See note 1)  
> 
> hmm. DoS prevention use case is when 99% of the traffic is dropped.

As I wrote below, until the DDoS attack starts, all packets are
delivered to the stack.

> > Once the DDoS attack starts, then the traffic pattern changes, and XDP
> > should (hopefully only) catch the malicious traffic (monitor machine can
> > help diagnose false positive).  Now, due to interleaving the DDoS
> > traffic with the clean traffic, then efficiency of XDP_TX is reduced due to
> > more icache misses...
> > 
> > 
> > 
> > Note(1): Notice I have already demonstrated that loading a XDP/eBPF
> > program with 100% delivery to the stack, actually slows down the
> > normal stack.  This is due to hitting a bottleneck in the page
> > allocator.  I'm working removing that bottleneck with page_pool, and
> > that solution is orthogonal to this problem.  
> 
> sure. no one arguing against improving page allocator.
> 
> >  It is actually an excellent argument, for why you would want to run a
> > DDoS XDP filter only on a restricted number of RX queues.  
> 
> no. it's the opposite. If the host is under DoS there is no way
> the host can tell in advance which rx queue will be seeing bad
> packets.

Sorry, this note was not related to the DoS use-case.  You
misunderstood it.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-12 Thread Alexei Starovoitov
On Mon, Sep 12, 2016 at 01:30:25PM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 8 Sep 2016 23:30:50 -0700
> Alexei Starovoitov  wrote:
> 
> > On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
> [...]
> > > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > > Every time you call the stack code, then you flush your icache.  When
> > > returning to the driver code, you will have to reload all the icache
> > > associated with the XDP_TX, this is a costly operation.  
> > 
> [...]
> > To make further progress in this discussion can we talk about
> > the use case you have in mind instead? Then solution will
> > be much clear, I hope.
> 
> The DDoS use-case _is_ affected by this "hidden" bulking design.
> 
> Lets say, I want to implement a DDoS facility. Instead of just
> dropping the malicious packets, I want to see the bad packets.  I
> implement this by rewriting the destination-MAC to be my monitor
> machine and then XDP_TX the packet.

not following the use case. you want to implement a DDoS generator?
Or just forward all bad packets from affected host to another host
in the same rack? so two servers will be spammed with traffic and
even more load on the tor? I really don't see how this is useful
for anything but stress testing.

> In the DDoS use-case, you have loaded your XDP/eBPF program, and 100%
> of the traffic is delivered to the stack. (See note 1)

hmm. DoS prevention use case is when 99% of the traffic is dropped.

> Once the DDoS attack starts, then the traffic pattern changes, and XDP
> should (hopefully only) catch the malicious traffic (monitor machine can
> help diagnose false positive).  Now, due to interleaving the DDoS
> traffic with the clean traffic, then efficiency of XDP_TX is reduced due to
> more icache misses...
> 
> 
> 
> Note(1): Notice I have already demonstrated that loading a XDP/eBPF
> program with 100% delivery to the stack, actually slows down the
> normal stack.  This is due to hitting a bottleneck in the page
> allocator.  I'm working removing that bottleneck with page_pool, and
> that solution is orthogonal to this problem.

sure. no one arguing against improving page allocator.

>  It is actually an excellent argument, for why you would want to run a
> DDoS XDP filter only on a restricted number of RX queues.

no. it's the opposite. If the host is under DoS there is no way
the host can tell in advance which rx queue will be seeing bad packets.



Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-12 Thread Alexei Starovoitov
On Mon, Sep 12, 2016 at 10:56:55AM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 8 Sep 2016 23:30:50 -0700
> Alexei Starovoitov  wrote:
> 
> > On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
> > > > >  Lets do bundling/bulking from the start!
> > > > 
> > > > mlx4 already does bulking and this proposed mlx5 set of patches
> > > > does bulking as well.
> > > > See nothing wrong about it. RX side processes the packets and
> > > > when it's done it tells TX to xmit whatever it collected.  
> > > 
> > > This is doing "hidden" bulking and not really taking advantage of using
> > > the icache more effeciently.  
> > > 
> > > Let me explain the problem I see, little more clear then, so you
> > > hopefully see where I'm going.
> > > 
> > > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > > Every time you call the stack code, then you flush your icache.  When
> > > returning to the driver code, you will have to reload all the icache
> > > associated with the XDP_TX, this is a costly operation.  
> > 
> > correct. And why is that a problem?
> 
> It is good that you can see and acknowledge the I-cache problem.
> 
> XDP is all about performance.  What I hear is, that you are arguing
> against a model that will yield better performance, that does not make
> sense to me.  Let me explain this again, in another way.

I'm arguing against your proposal that I think will be more complex and
lower performance than what Saeed and the team already implemented.
Therefore I don't think it's fair to block the patch and ask them to
reimplement it just to test an idea that may or may not improve performance.

Getting maximum performance is tricky. Good is better than perfect.
It's important to argue about user space visible bits upfront, but
on the kernel performance side we should build/test incrementally.
This particular patch 11/11 is simple, easy to review and provides
good performance. What's not to like?



Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-12 Thread Jesper Dangaard Brouer
On Thu, 8 Sep 2016 23:30:50 -0700
Alexei Starovoitov  wrote:

> On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
[...]
> > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > Every time you call the stack code, then you flush your icache.  When
> > returning to the driver code, you will have to reload all the icache
> > associated with the XDP_TX, this is a costly operation.  
> 
[...]
> To make further progress in this discussion can we talk about
> the use case you have in mind instead? Then solution will
> be much clear, I hope.

The DDoS use-case _is_ affected by this "hidden" bulking design.

Lets say, I want to implement a DDoS facility. Instead of just
dropping the malicious packets, I want to see the bad packets.  I
implement this by rewriting the destination-MAC to be my monitor
machine and then XDP_TX the packet.

In the DDoS use-case, you have loaded your XDP/eBPF program, and 100%
of the traffic is delivered to the stack. (See note 1)

Once the DDoS attack starts, then the traffic pattern changes, and XDP
should (hopefully only) catch the malicious traffic (monitor machine can
help diagnose false positive).  Now, due to interleaving the DDoS
traffic with the clean traffic, then efficiency of XDP_TX is reduced due to
more icache misses...



Note(1): Notice I have already demonstrated that loading a XDP/eBPF
program with 100% delivery to the stack, actually slows down the
normal stack.  This is due to hitting a bottleneck in the page
allocator.  I'm working removing that bottleneck with page_pool, and
that solution is orthogonal to this problem.
 It is actually an excellent argument, for why you would want to run a
DDoS XDP filter only on a restricted number of RX queues.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-12 Thread Jesper Dangaard Brouer
On Thu, 8 Sep 2016 23:30:50 -0700
Alexei Starovoitov  wrote:

> On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
> > > >  Lets do bundling/bulking from the start!
> > > 
> > > mlx4 already does bulking and this proposed mlx5 set of patches
> > > does bulking as well.
> > > See nothing wrong about it. RX side processes the packets and
> > > when it's done it tells TX to xmit whatever it collected.  
> > 
> > This is doing "hidden" bulking and not really taking advantage of using
> > the icache more effeciently.  
> > 
> > Let me explain the problem I see, little more clear then, so you
> > hopefully see where I'm going.
> > 
> > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > Every time you call the stack code, then you flush your icache.  When
> > returning to the driver code, you will have to reload all the icache
> > associated with the XDP_TX, this is a costly operation.  
> 
> correct. And why is that a problem?

It is good that you can see and acknowledge the I-cache problem.

XDP is all about performance.  What I hear is, that you are arguing
against a model that will yield better performance, that does not make
sense to me.  Let me explain this again, in another way.

This is a mental model switch.  Stop seeing the lowest driver RX as
something that works on a per packet basis.  Maybe is it is easier to
understand if we instead see this as vector processing?  This is about
having a vector of packets, where we apply some action/operation.

This is about using the CPU more efficiently, getting it to do more
instructions per cycle (directly measurable with perf, while I-cache
is not directly measurable).


Lets assume everything fits into the I-cache (XDP+driver code). The
CPU-frontend still have to decode the instructions from the I-cache
into micro-ops.  The next level of optimizations is to reuse the
decoded I-cache by running it on all elements in the packet-vector.

The Intel "64 and IA-32 Architectures Optimization Reference Manual"
(section 3.4.2.6 "Optimization for Decoded ICache"[1][2]), states make
sure each hot code block is less than about 500 instructions.  Thus,
the different "stages" working on the packet-vector, need to be rather
small and compact.

[1] 
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
[2] 
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf



Notice: The same mental model switch applies to delivery packets to
the regular netstack.  I've brought this up before[3].  Instead of
flushing the drivers I-cache for every packet, by calling the stack,
let instead bundle up N-packets in the driver before calling the
stack.  I showed 10% speedup by a naive implementation of this
approach.  Edward Cree also showed[4] a 10% performance boost, and
went further into the stack, showing a 25% increase.

A goal is also, to make optimizing netstack code-size independent of
the driver code-size, by separating the netstacks I-cache usage from
the drivers.

[3] http://lists.openwall.net/netdev/2016/01/15/51
[4] http://lists.openwall.net/netdev/2016/04/19/89
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-09 Thread Tom Herbert
On Thu, Sep 8, 2016 at 10:36 PM, Jesper Dangaard Brouer
 wrote:
> On Thu, 8 Sep 2016 20:22:04 -0700
> Alexei Starovoitov  wrote:
>
>> On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:
>> >
>> > I'm sorry but I have a problem with this patch!
>>
>> is it because the variable is called 'xdp_doorbell'?
>> Frankly I see nothing scary in this patch.
>> It extends existing code by adding a flag to ring doorbell or not.
>> The end of rx napi is used as an obvious heuristic to flush the pipe.
>> Looks pretty generic to me.
>> The same code can be used for non-xdp as well once we figure out
>> good algorithm for xmit_more in the stack.
>
> What I'm proposing can also be used by the normal stack.
>
>> > Looking at this patch, I want to bring up a fundamental architectural
>> > concern with the development direction of XDP transmit.
>> >
>> >
>> > What you are trying to implement, with delaying the doorbell, is
>> > basically TX bulking for TX_XDP.
>> >
>> >  Why not implement a TX bulking interface directly instead?!?
>> >
>> > Yes, the tailptr/doorbell is the most costly operation, but why not
>> > also take advantage of the benefits of bulking for other parts of the
>> > code? (benefit is smaller, by every cycles counts in this area)
>> >
>> > This hole XDP exercise is about avoiding having a transaction cost per
>> > packet, that reads "bulking" or "bundling" of packets, where possible.
>> >
>> >  Lets do bundling/bulking from the start!
>>
>> mlx4 already does bulking and this proposed mlx5 set of patches
>> does bulking as well.
>> See nothing wrong about it. RX side processes the packets and
>> when it's done it tells TX to xmit whatever it collected.
>
> This is doing "hidden" bulking and not really taking advantage of using
> the icache more effeciently.
>
> Let me explain the problem I see, little more clear then, so you
> hopefully see where I'm going.
>
> Imagine you have packets intermixed towards the stack and XDP_TX.
> Every time you call the stack code, then you flush your icache.  When
> returning to the driver code, you will have to reload all the icache
> associated with the XDP_TX, this is a costly operation.
>
>
>> > The reason behind the xmit_more API is that we could not change the
>> > API of all the drivers.  And we found that calling an explicit NDO
>> > flush came at a cost (only approx 7 ns IIRC), but it still a cost that
>> > would hit the common single packet use-case.
>> >
>> > It should be really easy to build a bundle of packets that need XDP_TX
>> > action, especially given you only have a single destination "port".
>> > And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.
>>
>> not sure what are you proposing here?
>> Sounds like you want to extend it to multi port in the future?
>> Sure. The proposed code is easily extendable.
>>
>> Or you want to see something like a link list of packets
>> or an array of packets that RX side is preparing and then
>> send the whole array/list to TX port?
>> I don't think that would be efficient, since it would mean
>> unnecessary copy of pointers.
>
> I just explain it will be more efficient due to better use of icache.
>
>
>> > In the future, XDP need to support XDP_FWD forwarding of packets/pages
>> > out other interfaces.  I also want bulk transmit from day-1 here.  It
>> > is slightly more tricky to sort packets for multiple outgoing
>> > interfaces efficiently in the pool loop.
>>
>> I don't think so. Multi port is natural extension to this set of patches.
>> With multi port the end of RX will tell multiple ports (that were
>> used to tx) to ring the bell. Pretty trivial and doesn't involve any
>> extra arrays or link lists.
>
> So, have you solved the problem exclusive access to a TX ring of a
> remote/different net_device when sending?
>
> In you design you assume there exist many TX ring available for other
> devices to access.  In my design I also want to support devices that
> doesn't have this HW capability, and e.g. only have one TX queue.
>
Right, but segregating TX queues used by the stack from the those used
by XDP is pretty fundamental to the design. If we start mixing them,
then we need to pull in several features (such as BQL which seems like
what you're proposing) into the XDP path. If this starts to slow
things down or we need to reinvent a bunch of existing features to not
use skbuffs that seems to run contrary to "the simple as possible"
model for XDP-- may as well use the regular stack at that point
maybe...

Tom

>
>> > But the mSwitch[1] article actually already solved this destination
>> > sorting.  Please read[1] section 3.3 "Switch Fabric Algorithm" for
>> > understanding the next steps, for a smarter data structure, when
>> > starting to have more TX "ports".  And perhaps align your single
>> > XDP_TX destination data structure to this future development.
>> >
>> > [1] 

Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-09 Thread Alexei Starovoitov
On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
> > >  Lets do bundling/bulking from the start!  
> > 
> > mlx4 already does bulking and this proposed mlx5 set of patches
> > does bulking as well.
> > See nothing wrong about it. RX side processes the packets and
> > when it's done it tells TX to xmit whatever it collected.
> 
> This is doing "hidden" bulking and not really taking advantage of using
> the icache more effeciently.  
> 
> Let me explain the problem I see, little more clear then, so you
> hopefully see where I'm going.
> 
> Imagine you have packets intermixed towards the stack and XDP_TX. 
> Every time you call the stack code, then you flush your icache.  When
> returning to the driver code, you will have to reload all the icache
> associated with the XDP_TX, this is a costly operation.

correct. And why is that a problem?
As we discussed numerous times before XDP is deliberately not trying
to work with 10% of the traffic. If most of the traffic is going into
the stack there is no reason to use XDP. We have tc and netfilter
to deal with it. The cases where most of the traffic needs
skb should not use XDP. If we try to add such uses cases to XDP we
will only hurt XDP performance, increase complexity and gain nothing back.

Let's say a user wants to send 50% into the stack->tcp->socket->user and
another 50% via XDP_TX. The performance is going to be dominated by the stack.
So everything that XDP does to receive and/or transmit is irrelevant.
If we try to optimize XDP for that, we gain nothing in performance.
The user could have used netfilter just as well in such scenario.
The performance would have been the same.

XDP only makes sense when it's servicing most of the traffic,
like L4 load balancer, ILA router or DoS prevention use cases.
Sorry for the broken record. XDP is not a solution for every
networking use case. It only makes sense for packet in and out.
When packet goes to the host it has to go through skb and
optimizing that path is a task that is orthogonal to the XDP patches.

To make further progress in this discussion can we talk about
the use case you have in mind instead? Then solution will
be much clear, I hope.



Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-08 Thread Jesper Dangaard Brouer
On Thu, 8 Sep 2016 20:22:04 -0700
Alexei Starovoitov  wrote:

> On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:
> > 
> > I'm sorry but I have a problem with this patch!  
> 
> is it because the variable is called 'xdp_doorbell'?
> Frankly I see nothing scary in this patch.
> It extends existing code by adding a flag to ring doorbell or not.
> The end of rx napi is used as an obvious heuristic to flush the pipe.
> Looks pretty generic to me.
> The same code can be used for non-xdp as well once we figure out
> good algorithm for xmit_more in the stack.

What I'm proposing can also be used by the normal stack.
 
> > Looking at this patch, I want to bring up a fundamental architectural
> > concern with the development direction of XDP transmit.
> > 
> > 
> > What you are trying to implement, with delaying the doorbell, is
> > basically TX bulking for TX_XDP.
> > 
> >  Why not implement a TX bulking interface directly instead?!?
> > 
> > Yes, the tailptr/doorbell is the most costly operation, but why not
> > also take advantage of the benefits of bulking for other parts of the
> > code? (benefit is smaller, by every cycles counts in this area)
> > 
> > This hole XDP exercise is about avoiding having a transaction cost per
> > packet, that reads "bulking" or "bundling" of packets, where possible.
> > 
> >  Lets do bundling/bulking from the start!  
> 
> mlx4 already does bulking and this proposed mlx5 set of patches
> does bulking as well.
> See nothing wrong about it. RX side processes the packets and
> when it's done it tells TX to xmit whatever it collected.

This is doing "hidden" bulking and not really taking advantage of using
the icache more effeciently.  

Let me explain the problem I see, little more clear then, so you
hopefully see where I'm going.

Imagine you have packets intermixed towards the stack and XDP_TX. 
Every time you call the stack code, then you flush your icache.  When
returning to the driver code, you will have to reload all the icache
associated with the XDP_TX, this is a costly operation.

 
> > The reason behind the xmit_more API is that we could not change the
> > API of all the drivers.  And we found that calling an explicit NDO
> > flush came at a cost (only approx 7 ns IIRC), but it still a cost that
> > would hit the common single packet use-case.
> > 
> > It should be really easy to build a bundle of packets that need XDP_TX
> > action, especially given you only have a single destination "port".
> > And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.  
> 
> not sure what are you proposing here?
> Sounds like you want to extend it to multi port in the future?
> Sure. The proposed code is easily extendable.
> 
> Or you want to see something like a link list of packets
> or an array of packets that RX side is preparing and then
> send the whole array/list to TX port?
> I don't think that would be efficient, since it would mean
> unnecessary copy of pointers.

I just explain it will be more efficient due to better use of icache.

 
> > In the future, XDP need to support XDP_FWD forwarding of packets/pages
> > out other interfaces.  I also want bulk transmit from day-1 here.  It
> > is slightly more tricky to sort packets for multiple outgoing
> > interfaces efficiently in the pool loop.  
> 
> I don't think so. Multi port is natural extension to this set of patches.
> With multi port the end of RX will tell multiple ports (that were
> used to tx) to ring the bell. Pretty trivial and doesn't involve any
> extra arrays or link lists.

So, have you solved the problem exclusive access to a TX ring of a
remote/different net_device when sending?

In you design you assume there exist many TX ring available for other
devices to access.  In my design I also want to support devices that
doesn't have this HW capability, and e.g. only have one TX queue.


> > But the mSwitch[1] article actually already solved this destination
> > sorting.  Please read[1] section 3.3 "Switch Fabric Algorithm" for
> > understanding the next steps, for a smarter data structure, when
> > starting to have more TX "ports".  And perhaps align your single
> > XDP_TX destination data structure to this future development.
> > 
> > [1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf  
> 
> I don't see how this particular paper applies to the existing kernel code.
> It's great to take ideas from research papers, but real code is different.
> 
> > --Jesper
> > (top post)  
> 
> since when it's ok to top post?

What a kneejerk reaction.  When writing something general we often
reply to the top of the email, and then often delete the rest (which
makes it hard for later comers to follow).  I was bcc'ing some people,
which needed the context, so it was a service note to you, that I
didn't write anything below.

 
> > On Wed,  7 Sep 2016 15:42:32 +0300 Saeed Mahameed  
> > wrote:
> >   
> > > Previously we rang XDP 

Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-08 Thread Alexei Starovoitov
On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:
> 
> I'm sorry but I have a problem with this patch!

is it because the variable is called 'xdp_doorbell'?
Frankly I see nothing scary in this patch.
It extends existing code by adding a flag to ring doorbell or not.
The end of rx napi is used as an obvious heuristic to flush the pipe.
Looks pretty generic to me.
The same code can be used for non-xdp as well once we figure out
good algorithm for xmit_more in the stack.

> Looking at this patch, I want to bring up a fundamental architectural
> concern with the development direction of XDP transmit.
> 
> 
> What you are trying to implement, with delaying the doorbell, is
> basically TX bulking for TX_XDP.
> 
>  Why not implement a TX bulking interface directly instead?!?
> 
> Yes, the tailptr/doorbell is the most costly operation, but why not
> also take advantage of the benefits of bulking for other parts of the
> code? (benefit is smaller, by every cycles counts in this area)
> 
> This hole XDP exercise is about avoiding having a transaction cost per
> packet, that reads "bulking" or "bundling" of packets, where possible.
> 
>  Lets do bundling/bulking from the start!

mlx4 already does bulking and this proposed mlx5 set of patches
does bulking as well.
See nothing wrong about it. RX side processes the packets and
when it's done it tells TX to xmit whatever it collected.

> The reason behind the xmit_more API is that we could not change the
> API of all the drivers.  And we found that calling an explicit NDO
> flush came at a cost (only approx 7 ns IIRC), but it still a cost that
> would hit the common single packet use-case.
> 
> It should be really easy to build a bundle of packets that need XDP_TX
> action, especially given you only have a single destination "port".
> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.

not sure what are you proposing here?
Sounds like you want to extend it to multi port in the future?
Sure. The proposed code is easily extendable.

Or you want to see something like a link list of packets
or an array of packets that RX side is preparing and then
send the whole array/list to TX port?
I don't think that would be efficient, since it would mean
unnecessary copy of pointers.

> In the future, XDP need to support XDP_FWD forwarding of packets/pages
> out other interfaces.  I also want bulk transmit from day-1 here.  It
> is slightly more tricky to sort packets for multiple outgoing
> interfaces efficiently in the pool loop.

I don't think so. Multi port is natural extension to this set of patches.
With multi port the end of RX will tell multiple ports (that were
used to tx) to ring the bell. Pretty trivial and doesn't involve any
extra arrays or link lists.

> But the mSwitch[1] article actually already solved this destination
> sorting.  Please read[1] section 3.3 "Switch Fabric Algorithm" for
> understanding the next steps, for a smarter data structure, when
> starting to have more TX "ports".  And perhaps align your single
> XDP_TX destination data structure to this future development.
> 
> [1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf

I don't see how this particular paper applies to the existing kernel code.
It's great to take ideas from research papers, but real code is different.

> --Jesper
> (top post)

since when it's ok to top post?

> On Wed,  7 Sep 2016 15:42:32 +0300 Saeed Mahameed  wrote:
> 
> > Previously we rang XDP SQ doorbell on every forwarded XDP packet.
> > 
> > Here we introduce a xmit more like mechanism that will queue up more
> > than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
> > 
> > Once RX napi budget is consumed and we exit napi RX loop, we will
> > flush (doorbell) all XDP looped packets in case there are such.
> > 
> > XDP forward packet rate:
> > 
> > Comparing XDP with and w/o xmit more (bulk transmit):
> > 
> > Streams XDP TX   XDP TX (xmit more)
> > ---
> > 1   4.90Mpps  7.50Mpps
> > 2   9.50Mpps  14.8Mpps
> > 4   16.5Mpps  25.1Mpps
> > 8   21.5Mpps  27.5Mpps*
> > 16  24.1Mpps  27.5Mpps*
> > 
> > *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
> > we will be working on the analysis and will publish the conclusions
> > later.
> > 
> > Signed-off-by: Saeed Mahameed 
> > ---
> >  drivers/net/ethernet/mellanox/mlx5/core/en.h|  9 ++--
> >  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 
> > +++--
> >  2 files changed, 49 insertions(+), 17 deletions(-)
...
> > @@ -131,7 +132,7 @@ static inline u32 mlx5e_decompress_cqes_cont(struct 
> > mlx5e_rq *rq,
> > mlx5e_read_mini_arr_slot(cq, cqcc);
> >  
> > mlx5e_tx_notify_hw(sq, >ctrl, 0);
> >  
> > +#if 0 /* enable this code only if MLX5E_XDP_TX_WQEBBS > 1 */

Saeed,
please make sure to remove