Re: multiple interface input queues

2017-11-14 Thread David Gwynne

> On 14 Nov 2017, at 23:17, Martin Pieuchot  wrote:
> 
> On 14/11/17(Tue) 14:42, David Gwynne wrote:
>> this replaces the single mbuf_queue and task in struct ifnet with
>> a new ifiqueue structure modelled on ifqueues.
> 
> The name is confusing, should we rename ifqueues 'ifoqueue' then?

yes.

> 
>> the main motivation behind this was to show mpsafe input counters.
> 
> I like that you're sharing the whole direction where you're going.
> Nice to know we don't need to address input counters :)
> 
>> ifiqueues, like ifqueues, allow a driver to configure multiple
>> queueus. this in turn allows a driver with multiple rx rings to
>> have them queue and account for packets independently.
> 
> How does a driver decide to use multiple queues?  Aren't we missing
> the interrupt on !CPU0 bits?  Once that's done how do we split traffic?

i dont know how to answer how a driver decides to use multiple queues, but i 
can show how it can use them once the decision is made. something like this:

if_attach_queues(ifp, sc->num_queues);
if_attach_iqueues(ifp, sc->num_queues);
for (i = 0; i < sc->num_queues; i++) {
struct ifqueue *ifq = ifp->if_ifqs[i];
struct tx_ring *txr = &sc->tx_rings[i];
struct ifiqueue *ifiq = ifp->if_iqs[i];
struct rx_ring *rxr = &sc->rx_rings[i];

ifq->ifq_softc = txr;
txr->ifq = ifq;

ifiq->ifiq_softc = rxr;
rxr->ifiq = ifiq;
}


> 
>> ifiq_input generally replaces if_input. if_input now simply queues
>> on ifnets builtin ifiqueue (ifp->if_rcv) to support the current set
>> of drivers that only configure a single rx context/ring.
> 
> So we always use at least `ifp->if_rcv'.  Can't we add an argument to
> if_attach_common() rather than adding a new interface?

do you want to make if_rcv optional rather than unconditional?

> 
>> ifiq counters are updated and read using the same semantic as percpu
>> counters, ie, there's a generation number updated before and after
>> a counter update. this means writers dont block, but a reader may
>> loop a couple of times waiting for a writer to finish.
> 
> Why do we need yet another copy of counters_enter()/leave()?  Something
> feels wrong.  I *know* it is not per-CPU memory, but I wish we could do
> better.  Otherwise we will end up with per data structure counter API.

i got rid of this in the next diff i sent.

if you're saying it would be nice to factor generation based counter 
updates/reads out, i agree. apart from this there hasn't been a need for it 
though.

> 
>> loop a couple of times waiting for a writer to finish. readers call
>> yield(), which causes splasserts to fire if if_getdata is still
>> holding the kernel lock.
> 
> You mean the NET_LOCK(), well this should be easily removed.  tb@ has a
> diff that should go in then you can remove the NET_LOCK()/UNLOCK() dance
> around if_getdata().

it's moving to^W^R^R^Red to RLOCK?

> 
>> ifiq_input is set up to interact with the interface rx ring moderation
>> code (ie, if_rxring code). you pass what the current rxr watermark
>> is, and it will look at the backlog of packets on that ifiq to
>> determine if the ring is producing too fast. if it is, itll return
>> 1 to slow down packet reception. i have a later diff that adds
>> functionality to the if_rxring bits so a driver can say say a ring
>> is livelocked, rather than relying on the system to detect it. so
>> drv_rxeof would have the following at the end of it:
>> 
>>  if (ifiq_input(rxr->ifiq, &ml, if_rxr_cwm(&rxr->rx_ring))
>>  if_rxr_livelocked(&rxr->rx_ring);
>>  drv_rx_refill(rxr);
> 
> This magic it worth discussing in a diff doing only that :)

ill send if_rxr_livelocked and if_rxr_cwm out today.

> 
>> ive run with that on a hacked up ix(4) that runs with 8 rx rings,
>> and this works pretty well. if you're doing a single stream, you
>> see one rx ring grow the number of descs it will handle, but if you
>> flood it with a range of traffic you'll see that one ring scale
>> down and balance out with the rest of the rings. it turns out you
>> can still get reasonable throughput even if the ifiqs are dynamically
>> scaling themselves to only 100 packets. however, the interactivity
>> of the system improves a lot.
> 
> So you're now introducing MCLGETI(9) back with support for multiple
> ring.  Nice.

yes :D :D

> 
>> currently if one interface is being DoSed, it'll end up with 8192
>> packet on if_inputqueue. that takes a while to process those packets,
>> which blocks the processing of packets from the other interface.
>> by scaling the input queues to relatively small counts, softnet can
>> service packets frmo other interfaces sooner.
> 
> I see a lot of good stuff in this diff.  I like your direction.  I'm
> not sure how/when you plan to add support for multiple input rings.
> 
> Now I'm afraid of a single diff doing refactoring + iqueues + counter +
> MCLGET

Re: multiple interface input queues

2017-11-14 Thread Martin Pieuchot
On 14/11/17(Tue) 14:42, David Gwynne wrote:
> this replaces the single mbuf_queue and task in struct ifnet with
> a new ifiqueue structure modelled on ifqueues.

The name is confusing, should we rename ifqueues 'ifoqueue' then? 

> the main motivation behind this was to show mpsafe input counters.

I like that you're sharing the whole direction where you're going.
Nice to know we don't need to address input counters :)

> ifiqueues, like ifqueues, allow a driver to configure multiple
> queueus. this in turn allows a driver with multiple rx rings to
> have them queue and account for packets independently.

How does a driver decide to use multiple queues?  Aren't we missing
the interrupt on !CPU0 bits?  Once that's done how do we split traffic?

> ifiq_input generally replaces if_input. if_input now simply queues
> on ifnets builtin ifiqueue (ifp->if_rcv) to support the current set
> of drivers that only configure a single rx context/ring.

So we always use at least `ifp->if_rcv'.  Can't we add an argument to
if_attach_common() rather than adding a new interface?

> ifiq counters are updated and read using the same semantic as percpu
> counters, ie, there's a generation number updated before and after
> a counter update. this means writers dont block, but a reader may
> loop a couple of times waiting for a writer to finish.

Why do we need yet another copy of counters_enter()/leave()?  Something
feels wrong.  I *know* it is not per-CPU memory, but I wish we could do
better.  Otherwise we will end up with per data structure counter API.

> loop a couple of times waiting for a writer to finish. readers call
> yield(), which causes splasserts to fire if if_getdata is still
> holding the kernel lock.

You mean the NET_LOCK(), well this should be easily removed.  tb@ has a
diff that should go in then you can remove the NET_LOCK()/UNLOCK() dance
around if_getdata().

> ifiq_input is set up to interact with the interface rx ring moderation
> code (ie, if_rxring code). you pass what the current rxr watermark
> is, and it will look at the backlog of packets on that ifiq to
> determine if the ring is producing too fast. if it is, itll return
> 1 to slow down packet reception. i have a later diff that adds
> functionality to the if_rxring bits so a driver can say say a ring
> is livelocked, rather than relying on the system to detect it. so
> drv_rxeof would have the following at the end of it:
> 
>   if (ifiq_input(rxr->ifiq, &ml, if_rxr_cwm(&rxr->rx_ring))
>   if_rxr_livelocked(&rxr->rx_ring);
>   drv_rx_refill(rxr);

This magic it worth discussing in a diff doing only that :)

> ive run with that on a hacked up ix(4) that runs with 8 rx rings,
> and this works pretty well. if you're doing a single stream, you
> see one rx ring grow the number of descs it will handle, but if you
> flood it with a range of traffic you'll see that one ring scale
> down and balance out with the rest of the rings. it turns out you
> can still get reasonable throughput even if the ifiqs are dynamically
> scaling themselves to only 100 packets. however, the interactivity
> of the system improves a lot.

So you're now introducing MCLGETI(9) back with support for multiple
ring.  Nice.

> currently if one interface is being DoSed, it'll end up with 8192
> packet on if_inputqueue. that takes a while to process those packets,
> which blocks the processing of packets from the other interface.
> by scaling the input queues to relatively small counts, softnet can
> service packets frmo other interfaces sooner.

I see a lot of good stuff in this diff.  I like your direction.  I'm
not sure how/when you plan to add support for multiple input rings.

Now I'm afraid of a single diff doing refactoring + iqueues + counter +
MCLGETI :o)

> 
> Index: if.c
> ===
> RCS file: /cvs/src/sys/net/if.c,v
> retrieving revision 1.527
> diff -u -p -r1.527 if.c
> --- if.c  14 Nov 2017 04:08:11 -  1.527
> +++ if.c  14 Nov 2017 04:18:41 -
> @@ -156,7 +156,6 @@ int   if_group_egress_build(void);
>  
>  void if_watchdog_task(void *);
>  
> -void if_input_process(void *);
>  void if_netisr(void *);
>  
>  #ifdef DDB
> @@ -437,8 +436,6 @@ if_attachsetup(struct ifnet *ifp)
>  
>   ifidx = ifp->if_index;
>  
> - mq_init(&ifp->if_inputqueue, 8192, IPL_NET);
> - task_set(ifp->if_inputtask, if_input_process, (void *)ifidx);
>   task_set(ifp->if_watchdogtask, if_watchdog_task, (void *)ifidx);
>   task_set(ifp->if_linkstatetask, if_linkstate_task, (void *)ifidx);
>  
> @@ -563,6 +560,30 @@ if_attach_queues(struct ifnet *ifp, unsi
>  }
>  
>  void
> +if_attach_iqueues(struct ifnet *ifp, unsigned int niqs)
> +{
> + struct ifiqueue **map;
> + struct ifiqueue *ifiq;
> + unsigned int i;
> +
> + KASSERT(niqs != 0);
> +
> + map = mallocarray(niqs, sizeof(*map), M_DEVBUF, M_WAITOK);
> +
> + ifp->if_rcv.ifiq_softc = NULL;
> + map[0] = &

Re: multiple interface input queues

2017-11-14 Thread David Gwynne
On Tue, Nov 14, 2017 at 02:42:30PM +1000, David Gwynne wrote:
> this replaces the single mbuf_queue and task in struct ifnet with
> a new ifiqueue structure modelled on ifqueues.
> 
> the main motivation behind this was to show mpsafe input counters.
> 
> ifiqueues, like ifqueues, allow a driver to configure multiple
> queueus. this in turn allows a driver with multiple rx rings to
> have them queue and account for packets independently.
> 
> ifiq_input generally replaces if_input. if_input now simply queues
> on ifnets builtin ifiqueue (ifp->if_rcv) to support the current set
> of drivers that only configure a single rx context/ring.
> 
> it is up to the driver to ensure that an ifiqueue is not used
> concurrently. in practice this should be easy cos nics generally
> only rx packets from processing a single ring from a single interrupt
> source.
> 
> ifiq counters are updated and read using the same semantic as percpu
> counters, ie, there's a generation number updated before and after
> a counter update. this means writers dont block, but a reader may
> loop a couple of times waiting for a writer to finish. readers call
> yield(), which causes splasserts to fire if if_getdata is still
> holding the kernel lock.
> 
> ifiq_input is set up to interact with the interface rx ring moderation
> code (ie, if_rxring code). you pass what the current rxr watermark
> is, and it will look at the backlog of packets on that ifiq to
> determine if the ring is producing too fast. if it is, itll return
> 1 to slow down packet reception. i have a later diff that adds
> functionality to the if_rxring bits so a driver can say say a ring
> is livelocked, rather than relying on the system to detect it. so
> drv_rxeof would have the following at the end of it:
> 
>   if (ifiq_input(rxr->ifiq, &ml, if_rxr_cwm(&rxr->rx_ring))
>   if_rxr_livelocked(&rxr->rx_ring);
>   drv_rx_refill(rxr);
> 
> ive run with that on a hacked up ix(4) that runs with 8 rx rings,
> and this works pretty well. if you're doing a single stream, you
> see one rx ring grow the number of descs it will handle, but if you
> flood it with a range of traffic you'll see that one ring scale
> down and balance out with the rest of the rings. it turns out you
> can still get reasonable throughput even if the ifiqs are dynamically
> scaling themselves to only 100 packets. however, the interactivity
> of the system improves a lot.
> 
> currently if one interface is being DoSed, it'll end up with 8192
> packet on if_inputqueue. that takes a while to process those packets,
> which blocks the processing of packets from the other interface.
> by scaling the input queues to relatively small counts, softnet can
> service packets frmo other interfaces sooner.

two things occurred to me while driving home today.

firstly, while it is true that hardware rings are the only source
of packets for an input queue, this is not true for pseudo drivers,
eg, vlan, gre, gif, etc. packets can spontaneously come in from
anywhere. the result of this is that ifiq_gen dance wont work,
particularly if it gets stuck with an odd value, which will block
readers indefinitely.

secondly, the ifiq_gen stuff exists to avoid atomic ops, but
we use a mutex to protect the list of mbufs.

the diff below gets rid of the ifiq_gen stuff and uses the ifiq_mutex
to protect the counters as well as the mbuf list.

Index: if.c
===
RCS file: /cvs/src/sys/net/if.c,v
retrieving revision 1.527
diff -u -p -r1.527 if.c
--- if.c14 Nov 2017 04:08:11 -  1.527
+++ if.c14 Nov 2017 10:51:24 -
@@ -156,7 +156,6 @@ int if_group_egress_build(void);
 
 void   if_watchdog_task(void *);
 
-void   if_input_process(void *);
 void   if_netisr(void *);
 
 #ifdef DDB
@@ -437,8 +436,6 @@ if_attachsetup(struct ifnet *ifp)
 
ifidx = ifp->if_index;
 
-   mq_init(&ifp->if_inputqueue, 8192, IPL_NET);
-   task_set(ifp->if_inputtask, if_input_process, (void *)ifidx);
task_set(ifp->if_watchdogtask, if_watchdog_task, (void *)ifidx);
task_set(ifp->if_linkstatetask, if_linkstate_task, (void *)ifidx);
 
@@ -563,6 +560,30 @@ if_attach_queues(struct ifnet *ifp, unsi
 }
 
 void
+if_attach_iqueues(struct ifnet *ifp, unsigned int niqs)
+{
+   struct ifiqueue **map;
+   struct ifiqueue *ifiq;
+   unsigned int i;
+
+   KASSERT(niqs != 0);
+
+   map = mallocarray(niqs, sizeof(*map), M_DEVBUF, M_WAITOK);
+
+   ifp->if_rcv.ifiq_softc = NULL;
+   map[0] = &ifp->if_rcv;
+
+   for (i = 1; i < niqs; i++) {
+   ifiq = malloc(sizeof(*ifiq), M_DEVBUF, M_WAITOK|M_ZERO);
+   ifiq_init(ifiq, ifp, i);
+   map[i] = ifiq;
+   }
+
+   ifp->if_iqs = map;
+   ifp->if_niqs = niqs;
+}
+
+void
 if_attach_common(struct ifnet *ifp)
 {
KASSERT(ifp->if_ioctl != NULL);
@@ -587,6 +608,12 @@ if_attach_common(struct ifnet *ifp)
ifp->if_ifqs = ifp->i

Re: multiple interface input queues

2017-11-13 Thread Hrvoje Popovski
On 14.11.2017. 5:42, David Gwynne wrote:
> this replaces the single mbuf_queue and task in struct ifnet with
> a new ifiqueue structure modelled on ifqueues.


i've tested this diff with ix, myx, em and bge with down/up interfaces
and everything is working fine ...



multiple interface input queues

2017-11-13 Thread David Gwynne
this replaces the single mbuf_queue and task in struct ifnet with
a new ifiqueue structure modelled on ifqueues.

the main motivation behind this was to show mpsafe input counters.

ifiqueues, like ifqueues, allow a driver to configure multiple
queueus. this in turn allows a driver with multiple rx rings to
have them queue and account for packets independently.

ifiq_input generally replaces if_input. if_input now simply queues
on ifnets builtin ifiqueue (ifp->if_rcv) to support the current set
of drivers that only configure a single rx context/ring.

it is up to the driver to ensure that an ifiqueue is not used
concurrently. in practice this should be easy cos nics generally
only rx packets from processing a single ring from a single interrupt
source.

ifiq counters are updated and read using the same semantic as percpu
counters, ie, there's a generation number updated before and after
a counter update. this means writers dont block, but a reader may
loop a couple of times waiting for a writer to finish. readers call
yield(), which causes splasserts to fire if if_getdata is still
holding the kernel lock.

ifiq_input is set up to interact with the interface rx ring moderation
code (ie, if_rxring code). you pass what the current rxr watermark
is, and it will look at the backlog of packets on that ifiq to
determine if the ring is producing too fast. if it is, itll return
1 to slow down packet reception. i have a later diff that adds
functionality to the if_rxring bits so a driver can say say a ring
is livelocked, rather than relying on the system to detect it. so
drv_rxeof would have the following at the end of it:

if (ifiq_input(rxr->ifiq, &ml, if_rxr_cwm(&rxr->rx_ring))
if_rxr_livelocked(&rxr->rx_ring);
drv_rx_refill(rxr);

ive run with that on a hacked up ix(4) that runs with 8 rx rings,
and this works pretty well. if you're doing a single stream, you
see one rx ring grow the number of descs it will handle, but if you
flood it with a range of traffic you'll see that one ring scale
down and balance out with the rest of the rings. it turns out you
can still get reasonable throughput even if the ifiqs are dynamically
scaling themselves to only 100 packets. however, the interactivity
of the system improves a lot.

currently if one interface is being DoSed, it'll end up with 8192
packet on if_inputqueue. that takes a while to process those packets,
which blocks the processing of packets from the other interface.
by scaling the input queues to relatively small counts, softnet can
service packets frmo other interfaces sooner.

Index: if.c
===
RCS file: /cvs/src/sys/net/if.c,v
retrieving revision 1.527
diff -u -p -r1.527 if.c
--- if.c14 Nov 2017 04:08:11 -  1.527
+++ if.c14 Nov 2017 04:18:41 -
@@ -156,7 +156,6 @@ int if_group_egress_build(void);
 
 void   if_watchdog_task(void *);
 
-void   if_input_process(void *);
 void   if_netisr(void *);
 
 #ifdef DDB
@@ -437,8 +436,6 @@ if_attachsetup(struct ifnet *ifp)
 
ifidx = ifp->if_index;
 
-   mq_init(&ifp->if_inputqueue, 8192, IPL_NET);
-   task_set(ifp->if_inputtask, if_input_process, (void *)ifidx);
task_set(ifp->if_watchdogtask, if_watchdog_task, (void *)ifidx);
task_set(ifp->if_linkstatetask, if_linkstate_task, (void *)ifidx);
 
@@ -563,6 +560,30 @@ if_attach_queues(struct ifnet *ifp, unsi
 }
 
 void
+if_attach_iqueues(struct ifnet *ifp, unsigned int niqs)
+{
+   struct ifiqueue **map;
+   struct ifiqueue *ifiq;
+   unsigned int i;
+
+   KASSERT(niqs != 0);
+
+   map = mallocarray(niqs, sizeof(*map), M_DEVBUF, M_WAITOK);
+
+   ifp->if_rcv.ifiq_softc = NULL;
+   map[0] = &ifp->if_rcv;
+
+   for (i = 1; i < niqs; i++) {
+   ifiq = malloc(sizeof(*ifiq), M_DEVBUF, M_WAITOK|M_ZERO);
+   ifiq_init(ifiq, ifp, i);
+   map[i] = ifiq;
+   }
+
+   ifp->if_iqs = map;
+   ifp->if_niqs = niqs;
+}
+
+void
 if_attach_common(struct ifnet *ifp)
 {
KASSERT(ifp->if_ioctl != NULL);
@@ -587,6 +608,12 @@ if_attach_common(struct ifnet *ifp)
ifp->if_ifqs = ifp->if_snd.ifq_ifqs;
ifp->if_nifqs = 1;
 
+   ifiq_init(&ifp->if_rcv, ifp, 0);
+
+   ifp->if_rcv.ifiq_ifiqs[0] = &ifp->if_rcv;
+   ifp->if_iqs = ifp->if_rcv.ifiq_ifiqs;
+   ifp->if_niqs = 1;
+
ifp->if_addrhooks = malloc(sizeof(*ifp->if_addrhooks),
M_TEMP, M_WAITOK);
TAILQ_INIT(ifp->if_addrhooks);
@@ -605,8 +632,6 @@ if_attach_common(struct ifnet *ifp)
M_TEMP, M_WAITOK|M_ZERO);
ifp->if_linkstatetask = malloc(sizeof(*ifp->if_linkstatetask),
M_TEMP, M_WAITOK|M_ZERO);
-   ifp->if_inputtask = malloc(sizeof(*ifp->if_inputtask),
-   M_TEMP, M_WAITOK|M_ZERO);
ifp->if_llprio = IFQ_DEFPRIO;
 
SRPL_INIT(&ifp->if_inputs);
@@ -694,47 +719,7 @@ if_enqueue(struct ifnet *ifp, struct mbu
 voi