Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines
On Fri, Sep 9, 2016 at 8:26 PM, John Fastabendwrote: > On 16-09-09 08:12 PM, Tom Herbert wrote: >> On Fri, Sep 9, 2016 at 6:40 PM, Alexei Starovoitov >> wrote: >>> On Fri, Sep 09, 2016 at 06:19:56PM -0700, Tom Herbert wrote: On Fri, Sep 9, 2016 at 6:12 PM, John Fastabend wrote: > On 16-09-09 06:04 PM, Tom Herbert wrote: >> On Fri, Sep 9, 2016 at 5:01 PM, John Fastabend >> wrote: >>> On 16-09-09 04:44 PM, Tom Herbert wrote: On Fri, Sep 9, 2016 at 2:29 PM, John Fastabend wrote: > e1000 supports a single TX queue so it is being shared with the stack > when XDP runs XDP_TX action. This requires taking the xmit lock to > ensure we don't corrupt the tx ring. To avoid taking and dropping the > lock per packet this patch adds a bundling implementation to submit > a bundle of packets to the xmit routine. > > I tested this patch running e1000 in a VM using KVM over a tap > device using pktgen to generate traffic along with 'ping -f -l 100'. > Hi John, How does this interact with BQL on e1000? Tom >>> >>> Let me check if I have the API correct. When we enqueue a packet to >>> be sent we must issue a netdev_sent_queue() call and then on actual >>> transmission issue a netdev_completed_queue(). >>> >>> The patch attached here missed a few things though. >>> >>> But it looks like I just need to call netdev_sent_queue() from the >>> e1000_xmit_raw_frame() routine and then let the tx completion logic >>> kick in which will call netdev_completed_queue() correctly. >>> >>> I'll need to add a check for the queue state as well. So if I do these >>> three things, >>> >>> check __QUEUE_STATE_XOFF before sending >>> netdev_sent_queue() -> on XDP_TX >>> netdev_completed_queue() >>> >>> It should work agree? Now should we do this even when XDP owns the >>> queue? Or is this purely an issue with sharing the queue between >>> XDP and stack. >>> >> But what is the action for XDP_TX if the queue is stopped? There is no >> qdisc to back pressure in the XDP path. Would we just start dropping >> packets then? > > Yep that is what the patch does if there is any sort of error packets > get dropped on the floor. I don't think there is anything else that > can be done. > That probably means that the stack will always win out under load. Trying to used the same queue where half of the packets are well managed by a qdisc and half aren't is going to leave someone unhappy. Maybe in the this case where we have to share the qdisc we can allocate the skb on on returning XDP_TX and send through the normal qdisc for the device. >>> >>> I wouldn't go to such extremes for e1k. >>> The only reason to have xdp in e1k is to use it for testing >>> of xdp programs. Nothing else. e1k is, best case, 1Gbps adapter. >> >> I imagine someone may want this for the non-forwarding use cases like >> early drop for DOS mitigation. Regardless of the use case, I don't >> think we can break the fundamental assumptions made for qdiscs or the >> rest of the transmit path. If XDP must transmit on a queue shared with >> the stack we need to abide by the stack's rules for transmitting on >> the queue-- which would mean alloc skbuff and go through qdisc (which > > If we require XDP_TX to go up to qdisc layer its best not to implement > it at all and just handle it in normal ingress path. That said I think > users have to expect that XDP will interfere with qdisc schemes. Even > with its own tx queue its going to interfere at the hardware level with > bandwidth as the hardware round robins through the queues or uses > whatever hardware strategy it is configured to use. Additionally it > will bypass things like BQL, etc. > Right, but not all use cases involve XDP_TX (like DOS mitigation as I pointed out). Since you've already done 95% of the work, can you take a look at creating the skbuff and injecting into the stack for XDP_TX so we can evaluate the performance and impact of that :-) With separate TX queues it's explicit which queues are managed by the stack. This is no different than what kernel bypass gives use, we are relying on HW to do something reasonable in scheduling MQ. >> really shouldn't be difficult to implement). Emulating various >> functions of the stack in the XDP TX path, like this patch seems to be >> doing for XMIT_MORE, potentially gets us into a wack-a-mole situation >> trying to keep things coherent. > > I think bundling tx xmits is fair game as an internal optimization and > doesn't need to be exposed at the XDP layer. Drivers already do this > type of optimizations for
Re: [PATCH] ATM-iphase: Use kmalloc_array() in tx_init()
From: SF Markus ElfringDate: Fri, 9 Sep 2016 20:42:16 +0200 > From: Markus Elfring > Date: Fri, 9 Sep 2016 20:40:16 +0200 > > * Multiplications for the size determination of memory allocations > indicated that array data structures should be processed. > Thus use the corresponding function "kmalloc_array". > > This issue was detected by using the Coccinelle software. > > * Replace the specification of data types by pointer dereferences > to make the corresponding size determination a bit safer according to > the Linux coding style convention. > > Signed-off-by: Markus Elfring Applied.
Re: [PATCH v3 0/9] net-next: ethernet: add sun8i-emac driver
From: Corentin LabbeDate: Fri, 9 Sep 2016 14:45:08 +0200 > This patch series add the driver for sun8i-emac which handle the > Ethernet MAC present on Allwinner H3/A83T/A64 SoCs. Please don't post a patch series with some subset of the series marked as "RFC". I will just simply toss the entire series when you do this. Thank you.
Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines
On Fri, Sep 09, 2016 at 08:12:52PM -0700, Tom Herbert wrote: > >> That probably means that the stack will always win out under load. > >> Trying to used the same queue where half of the packets are well > >> managed by a qdisc and half aren't is going to leave someone unhappy. > >> Maybe in the this case where we have to share the qdisc we can > >> allocate the skb on on returning XDP_TX and send through the normal > >> qdisc for the device. > > > > I wouldn't go to such extremes for e1k. > > The only reason to have xdp in e1k is to use it for testing > > of xdp programs. Nothing else. e1k is, best case, 1Gbps adapter. > > I imagine someone may want this for the non-forwarding use cases like > early drop for DOS mitigation. Sure and they will be doing it on the NICs that they have in their servers. e1k is not that nic. xdp e1k is for debugging xdp programs in KVM only. Performance of such xdp programs on e1k is irrelevant. There is absolutely no need to complicate the driver and the patches. All other drivers is a different story.
Re: [PATCH net-next 0/4] alx: add msi-x support
From: Tobias RegneryDate: Fri, 9 Sep 2016 12:19:51 +0200 > This patchset adds msi-x support to the alx driver. It is a preparatory > series for multi queue support, which I am currently working on. As there > is no advantage over msi interrupts without multi queue support, msi-x > interrupts are disabled by default. In order to test for regressions, a > new module parameter is added to enable msi-x interrupts. > > Based on information of the downstream driver at github.com/qca/alx Series applied, thanks.
Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines
On 16-09-09 08:12 PM, Tom Herbert wrote: > On Fri, Sep 9, 2016 at 6:40 PM, Alexei Starovoitov >wrote: >> On Fri, Sep 09, 2016 at 06:19:56PM -0700, Tom Herbert wrote: >>> On Fri, Sep 9, 2016 at 6:12 PM, John Fastabend >>> wrote: On 16-09-09 06:04 PM, Tom Herbert wrote: > On Fri, Sep 9, 2016 at 5:01 PM, John Fastabend > wrote: >> On 16-09-09 04:44 PM, Tom Herbert wrote: >>> On Fri, Sep 9, 2016 at 2:29 PM, John Fastabend >>> wrote: e1000 supports a single TX queue so it is being shared with the stack when XDP runs XDP_TX action. This requires taking the xmit lock to ensure we don't corrupt the tx ring. To avoid taking and dropping the lock per packet this patch adds a bundling implementation to submit a bundle of packets to the xmit routine. I tested this patch running e1000 in a VM using KVM over a tap device using pktgen to generate traffic along with 'ping -f -l 100'. >>> Hi John, >>> >>> How does this interact with BQL on e1000? >>> >>> Tom >>> >> >> Let me check if I have the API correct. When we enqueue a packet to >> be sent we must issue a netdev_sent_queue() call and then on actual >> transmission issue a netdev_completed_queue(). >> >> The patch attached here missed a few things though. >> >> But it looks like I just need to call netdev_sent_queue() from the >> e1000_xmit_raw_frame() routine and then let the tx completion logic >> kick in which will call netdev_completed_queue() correctly. >> >> I'll need to add a check for the queue state as well. So if I do these >> three things, >> >> check __QUEUE_STATE_XOFF before sending >> netdev_sent_queue() -> on XDP_TX >> netdev_completed_queue() >> >> It should work agree? Now should we do this even when XDP owns the >> queue? Or is this purely an issue with sharing the queue between >> XDP and stack. >> > But what is the action for XDP_TX if the queue is stopped? There is no > qdisc to back pressure in the XDP path. Would we just start dropping > packets then? Yep that is what the patch does if there is any sort of error packets get dropped on the floor. I don't think there is anything else that can be done. >>> That probably means that the stack will always win out under load. >>> Trying to used the same queue where half of the packets are well >>> managed by a qdisc and half aren't is going to leave someone unhappy. >>> Maybe in the this case where we have to share the qdisc we can >>> allocate the skb on on returning XDP_TX and send through the normal >>> qdisc for the device. >> >> I wouldn't go to such extremes for e1k. >> The only reason to have xdp in e1k is to use it for testing >> of xdp programs. Nothing else. e1k is, best case, 1Gbps adapter. > > I imagine someone may want this for the non-forwarding use cases like > early drop for DOS mitigation. Regardless of the use case, I don't > think we can break the fundamental assumptions made for qdiscs or the > rest of the transmit path. If XDP must transmit on a queue shared with > the stack we need to abide by the stack's rules for transmitting on > the queue-- which would mean alloc skbuff and go through qdisc (which If we require XDP_TX to go up to qdisc layer its best not to implement it at all and just handle it in normal ingress path. That said I think users have to expect that XDP will interfere with qdisc schemes. Even with its own tx queue its going to interfere at the hardware level with bandwidth as the hardware round robins through the queues or uses whatever hardware strategy it is configured to use. Additionally it will bypass things like BQL, etc. > really shouldn't be difficult to implement). Emulating various > functions of the stack in the XDP TX path, like this patch seems to be > doing for XMIT_MORE, potentially gets us into a wack-a-mole situation > trying to keep things coherent. I think bundling tx xmits is fair game as an internal optimization and doesn't need to be exposed at the XDP layer. Drivers already do this type of optimizations for allocating buffers. It likely doesn't matter much at the e1k level but doing a tail update on every pkt with the 40gbps drivers likely will be noticeable is my gut feeling. > >> Existing stack with skb is perfectly fine as it is. >> No need to do recycling, batching or any other complex things. >> xdp for e1k cannot be used as an example for other drivers either, >> since there is only one tx ring and any high performance adapter >> has more which makes the driver support quite different. >>
Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines
On Fri, Sep 9, 2016 at 6:40 PM, Alexei Starovoitovwrote: > On Fri, Sep 09, 2016 at 06:19:56PM -0700, Tom Herbert wrote: >> On Fri, Sep 9, 2016 at 6:12 PM, John Fastabend >> wrote: >> > On 16-09-09 06:04 PM, Tom Herbert wrote: >> >> On Fri, Sep 9, 2016 at 5:01 PM, John Fastabend >> >> wrote: >> >>> On 16-09-09 04:44 PM, Tom Herbert wrote: >> On Fri, Sep 9, 2016 at 2:29 PM, John Fastabend >> wrote: >> > e1000 supports a single TX queue so it is being shared with the stack >> > when XDP runs XDP_TX action. This requires taking the xmit lock to >> > ensure we don't corrupt the tx ring. To avoid taking and dropping the >> > lock per packet this patch adds a bundling implementation to submit >> > a bundle of packets to the xmit routine. >> > >> > I tested this patch running e1000 in a VM using KVM over a tap >> > device using pktgen to generate traffic along with 'ping -f -l 100'. >> > >> Hi John, >> >> How does this interact with BQL on e1000? >> >> Tom >> >> >>> >> >>> Let me check if I have the API correct. When we enqueue a packet to >> >>> be sent we must issue a netdev_sent_queue() call and then on actual >> >>> transmission issue a netdev_completed_queue(). >> >>> >> >>> The patch attached here missed a few things though. >> >>> >> >>> But it looks like I just need to call netdev_sent_queue() from the >> >>> e1000_xmit_raw_frame() routine and then let the tx completion logic >> >>> kick in which will call netdev_completed_queue() correctly. >> >>> >> >>> I'll need to add a check for the queue state as well. So if I do these >> >>> three things, >> >>> >> >>> check __QUEUE_STATE_XOFF before sending >> >>> netdev_sent_queue() -> on XDP_TX >> >>> netdev_completed_queue() >> >>> >> >>> It should work agree? Now should we do this even when XDP owns the >> >>> queue? Or is this purely an issue with sharing the queue between >> >>> XDP and stack. >> >>> >> >> But what is the action for XDP_TX if the queue is stopped? There is no >> >> qdisc to back pressure in the XDP path. Would we just start dropping >> >> packets then? >> > >> > Yep that is what the patch does if there is any sort of error packets >> > get dropped on the floor. I don't think there is anything else that >> > can be done. >> > >> That probably means that the stack will always win out under load. >> Trying to used the same queue where half of the packets are well >> managed by a qdisc and half aren't is going to leave someone unhappy. >> Maybe in the this case where we have to share the qdisc we can >> allocate the skb on on returning XDP_TX and send through the normal >> qdisc for the device. > > I wouldn't go to such extremes for e1k. > The only reason to have xdp in e1k is to use it for testing > of xdp programs. Nothing else. e1k is, best case, 1Gbps adapter. I imagine someone may want this for the non-forwarding use cases like early drop for DOS mitigation. Regardless of the use case, I don't think we can break the fundamental assumptions made for qdiscs or the rest of the transmit path. If XDP must transmit on a queue shared with the stack we need to abide by the stack's rules for transmitting on the queue-- which would mean alloc skbuff and go through qdisc (which really shouldn't be difficult to implement). Emulating various functions of the stack in the XDP TX path, like this patch seems to be doing for XMIT_MORE, potentially gets us into a wack-a-mole situation trying to keep things coherent. > Existing stack with skb is perfectly fine as it is. > No need to do recycling, batching or any other complex things. > xdp for e1k cannot be used as an example for other drivers either, > since there is only one tx ring and any high performance adapter > has more which makes the driver support quite different. >
Re: [PATCH net-next 0/4] Some BPF helper cleanups
From: Daniel BorkmannDate: Fri, 9 Sep 2016 02:45:27 +0200 > This series contains a couple of misc cleanups and improvements > for BPF helpers. For details please see individual patches. We > let this also sit for a few days with Fengguang's kbuild test > robot, and there were no issues seen (besides one false positive, > see last one for details). Series applied, thanks Daniel.
Re: [PATCH net-next] ip_tunnel: do not clear l4 hashes
From: Eric DumazetDate: Thu, 08 Sep 2016 15:40:48 -0700 > From: Eric Dumazet > > If skb has a valid l4 hash, there is no point clearing hash and force > a further flow dissection when a tunnel encapsulation is added. > > Signed-off-by: Eric Dumazet Applied.
Re: [PATCH] drivers: net: phy: mdio-xgene: Add hardware dependency
From: Jean DelvareDate: Thu, 8 Sep 2016 16:25:15 +0200 > The mdio-xgene driver is only useful on X-Gene SoC. > > Signed-off-by: Jean Delvare Applied.
Re: [PATCH] ATM-ForeRunnerHE: Use kmalloc_array() in he_init_group()
From: SF Markus ElfringDate: Thu, 8 Sep 2016 15:50:05 +0200 > From: Markus Elfring > Date: Thu, 8 Sep 2016 15:43:37 +0200 > > * Multiplications for the size determination of memory allocations > indicated that array data structures should be processed. > Thus use the corresponding function "kmalloc_array". > > This issue was detected by using the Coccinelle software. > > * Replace the specification of data types by pointer dereferences > to make the corresponding size determination a bit safer according to > the Linux coding style convention. > > Signed-off-by: Markus Elfring Applied.
Re: [PATCH] ATM-ENI: Use kmalloc_array() in eni_start()
From: SF Markus ElfringDate: Thu, 8 Sep 2016 14:40:06 +0200 > From: Markus Elfring > Date: Thu, 8 Sep 2016 14:20:17 +0200 > > * A multiplication for the size determination of a memory allocation > indicated that an array data structure should be processed. > Thus use the corresponding function "kmalloc_array". > > This issue was detected by using the Coccinelle software. > > * Replace the specification of a data structure by a pointer dereference > to make the corresponding size determination a bit safer according to > the Linux coding style convention. > > Signed-off-by: Markus Elfring Applied to net-next.
Re: [PATCH net-next 0/7] rxrpc: Rewrite data and ack handling
From: David HowellsDate: Thu, 08 Sep 2016 12:43:28 +0100 > This patch set constitutes the main portion of the AF_RXRPC rewrite. It > consists of five fix/helper patches: ... > And then there are two patches that form the main part: ... > With this, the majority of the AF_RXRPC rewrite is complete. ... > Tagged thusly: > > git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git > rxrpc-rewrite-20160908 Pulled, however I personally would have tried to split patch #7 up a bit, it was really huge and hard to audit/review in any meaningful way.
Re: pull-request: wireless-drivers 2016-09-08
From: Kalle ValoDate: Thu, 08 Sep 2016 14:31:56 +0300 > The following changes since commit bb87f02b7e4ccdb614a83cbf840524de81e9b321: > > Merge ath-current from ath.git (2016-08-29 21:39:04 +0300) > > are available in the git repository at: > > > git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers.git > tags/wireless-drivers-for-davem-2016-09-08 Pulled, thanks Kalle.
Re: [PATCH net] dwc_eth_qos: do not register semi-initialized device
From: Lars PerssonDate: Thu, 8 Sep 2016 13:24:21 +0200 > We move register_netdev() to the end of dwceqos_probe() to close any > races where the netdev callbacks are called before the initialization > has finished. > > Reported-by: Pavel Andrianov > Signed-off-by: Lars Persson Applied.
Re: [PATCH net] sctp: identify chunks that need to be fragmented at IP level
From: Xin LongDate: Thu, 8 Sep 2016 17:54:11 +0800 > From: Marcelo Ricardo Leitner > > Previously, without GSO, it was easy to identify it: if the chunk didn't > fit and there was no data chunk in the packet yet, we could fragment at > IP level. So if there was an auth chunk and we were bundling a big data > chunk, it would fragment regardless of the size of the auth chunk. This > also works for the context of PMTU reductions. > > But with GSO, we cannot distinguish such PMTU events anymore, as the > packet is allowed to exceed PMTU. > > So we need another check: to ensure that the chunk that we are adding, > actually fits the current PMTU. If it doesn't, trigger a flush and let > it be fragmented at IP level in the next round. > > Signed-off-by: Marcelo Ricardo Leitner Applied.
Re: [PATCH net] sctp: hold the transport before using it in sctp_hash_cmp
From: Xin LongDate: Thu, 8 Sep 2016 17:49:04 +0800 > Now sctp uses the transport without holding it in sctp_hash_cmp, > it can cause a use-after-free panic. As after it get transport from > hashtable, another CPU may free it, then the members it accesses > may be unavailable memory. > > This patch is to use sctp_transport_hold, in which it checks the > refcnt first, holds it if it's not 0. > > Signed-off-by: Xin Long Please add more detail to the commit message and add a proper "Fixes: " tag right before your signoff. Thanks.
Re: [PATCH] softirq: fix tasklet_kill() and its users
Ping !! On 8/24/2016 6:52 PM, Santosh Shilimkar wrote: Semantically the expectation from the tasklet init/kill API should be as below. tasklet_init() == Init and Enable scheduling tasklet_kill() == Disable scheduling and Destroy tasklet_init() API exibit above behavior but not the tasklet_kill(). The tasklet handler can still get scheduled and run even after the tasklet_kill(). There are 2, 3 places where drivers are working around this issue by calling tasklet_disable() which will add an usecount and there by avoiding the handlers being called. tasklet_enable/tasklet_disable is a pair API and expected to be used together. Usage of tasklet_disable() *just* to workround tasklet scheduling after kill is probably not the correct and inteded use of the API as done the API. We also happen to see similar issue where in shutdown path the tasklet_handler was getting called even after the tasklet_kill(). We fix this be making sure tasklet_kill() does right thing and there by ensuring tasklet handler won't run after tasklet_kil() with very simple change. Patch fixes the tasklet code and also few drivers workarounds. Cc: Greg Kroah-HartmanCc: Andrew Morton Cc: Thomas Gleixner Cc: Tadeusz Struk Cc: Herbert Xu Cc: "David S. Miller" Cc: Paul Bolle Cc: Giovanni Cabiddu Cc: Salvatore Benedetto Cc: Karsten Keil Cc: "Peter Zijlstra (Intel)" Signed-off-by: Santosh Shilimkar --- Removed RFC tag from last post and dropped atmel serial driver which seems to have been fixed in 4.8 https://lkml.org/lkml/2016/8/7/7 drivers/crypto/qat/qat_common/adf_isr.c| 1 - drivers/crypto/qat/qat_common/adf_sriov.c | 1 - drivers/crypto/qat/qat_common/adf_vf_isr.c | 2 -- drivers/isdn/gigaset/interface.c | 1 - kernel/softirq.c | 7 --- 5 files changed, 4 insertions(+), 8 deletions(-) diff --git a/drivers/crypto/qat/qat_common/adf_isr.c b/drivers/crypto/qat/qat_common/adf_isr.c index 06d4901..fd5e900 100644 --- a/drivers/crypto/qat/qat_common/adf_isr.c +++ b/drivers/crypto/qat/qat_common/adf_isr.c @@ -296,7 +296,6 @@ static void adf_cleanup_bh(struct adf_accel_dev *accel_dev) int i; for (i = 0; i < hw_data->num_banks; i++) { - tasklet_disable(_data->banks[i].resp_handler); tasklet_kill(_data->banks[i].resp_handler); } } diff --git a/drivers/crypto/qat/qat_common/adf_sriov.c b/drivers/crypto/qat/qat_common/adf_sriov.c index 9320ae1..bc7c2fa 100644 --- a/drivers/crypto/qat/qat_common/adf_sriov.c +++ b/drivers/crypto/qat/qat_common/adf_sriov.c @@ -204,7 +204,6 @@ void adf_disable_sriov(struct adf_accel_dev *accel_dev) } for (i = 0, vf = accel_dev->pf.vf_info; i < totalvfs; i++, vf++) { - tasklet_disable(>vf2pf_bh_tasklet); tasklet_kill(>vf2pf_bh_tasklet); mutex_destroy(>pf2vf_lock); } diff --git a/drivers/crypto/qat/qat_common/adf_vf_isr.c b/drivers/crypto/qat/qat_common/adf_vf_isr.c index bf99e11..6e38bff 100644 --- a/drivers/crypto/qat/qat_common/adf_vf_isr.c +++ b/drivers/crypto/qat/qat_common/adf_vf_isr.c @@ -191,7 +191,6 @@ static int adf_setup_pf2vf_bh(struct adf_accel_dev *accel_dev) static void adf_cleanup_pf2vf_bh(struct adf_accel_dev *accel_dev) { - tasklet_disable(_dev->vf.pf2vf_bh_tasklet); tasklet_kill(_dev->vf.pf2vf_bh_tasklet); mutex_destroy(_dev->vf.vf2pf_lock); } @@ -268,7 +267,6 @@ static void adf_cleanup_bh(struct adf_accel_dev *accel_dev) { struct adf_etr_data *priv_data = accel_dev->transport; - tasklet_disable(_data->banks[0].resp_handler); tasklet_kill(_data->banks[0].resp_handler); } diff --git a/drivers/isdn/gigaset/interface.c b/drivers/isdn/gigaset/interface.c index 600c79b..2ce63b6 100644 --- a/drivers/isdn/gigaset/interface.c +++ b/drivers/isdn/gigaset/interface.c @@ -524,7 +524,6 @@ void gigaset_if_free(struct cardstate *cs) if (!drv->have_tty) return; - tasklet_disable(>if_wake_tasklet); tasklet_kill(>if_wake_tasklet); cs->tty_dev = NULL; tty_unregister_device(drv->tty, cs->minor_index); diff --git a/kernel/softirq.c b/kernel/softirq.c index 17caf4b..21397eb 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -498,7 +498,7 @@ static void tasklet_action(struct softirq_action *a) list = list->next; if (tasklet_trylock(t)) { - if (!atomic_read(>count)) { + if (atomic_read(>count) == 1) { if (!test_and_clear_bit(TASKLET_STATE_SCHED, >state))
Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines
On Fri, Sep 09, 2016 at 06:19:56PM -0700, Tom Herbert wrote: > On Fri, Sep 9, 2016 at 6:12 PM, John Fastabend> wrote: > > On 16-09-09 06:04 PM, Tom Herbert wrote: > >> On Fri, Sep 9, 2016 at 5:01 PM, John Fastabend > >> wrote: > >>> On 16-09-09 04:44 PM, Tom Herbert wrote: > On Fri, Sep 9, 2016 at 2:29 PM, John Fastabend > wrote: > > e1000 supports a single TX queue so it is being shared with the stack > > when XDP runs XDP_TX action. This requires taking the xmit lock to > > ensure we don't corrupt the tx ring. To avoid taking and dropping the > > lock per packet this patch adds a bundling implementation to submit > > a bundle of packets to the xmit routine. > > > > I tested this patch running e1000 in a VM using KVM over a tap > > device using pktgen to generate traffic along with 'ping -f -l 100'. > > > Hi John, > > How does this interact with BQL on e1000? > > Tom > > >>> > >>> Let me check if I have the API correct. When we enqueue a packet to > >>> be sent we must issue a netdev_sent_queue() call and then on actual > >>> transmission issue a netdev_completed_queue(). > >>> > >>> The patch attached here missed a few things though. > >>> > >>> But it looks like I just need to call netdev_sent_queue() from the > >>> e1000_xmit_raw_frame() routine and then let the tx completion logic > >>> kick in which will call netdev_completed_queue() correctly. > >>> > >>> I'll need to add a check for the queue state as well. So if I do these > >>> three things, > >>> > >>> check __QUEUE_STATE_XOFF before sending > >>> netdev_sent_queue() -> on XDP_TX > >>> netdev_completed_queue() > >>> > >>> It should work agree? Now should we do this even when XDP owns the > >>> queue? Or is this purely an issue with sharing the queue between > >>> XDP and stack. > >>> > >> But what is the action for XDP_TX if the queue is stopped? There is no > >> qdisc to back pressure in the XDP path. Would we just start dropping > >> packets then? > > > > Yep that is what the patch does if there is any sort of error packets > > get dropped on the floor. I don't think there is anything else that > > can be done. > > > That probably means that the stack will always win out under load. > Trying to used the same queue where half of the packets are well > managed by a qdisc and half aren't is going to leave someone unhappy. > Maybe in the this case where we have to share the qdisc we can > allocate the skb on on returning XDP_TX and send through the normal > qdisc for the device. I wouldn't go to such extremes for e1k. The only reason to have xdp in e1k is to use it for testing of xdp programs. Nothing else. e1k is, best case, 1Gbps adapter. Existing stack with skb is perfectly fine as it is. No need to do recycling, batching or any other complex things. xdp for e1k cannot be used as an example for other drivers either, since there is only one tx ring and any high performance adapter has more which makes the driver support quite different.
Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines
On Fri, Sep 9, 2016 at 6:12 PM, John Fastabendwrote: > On 16-09-09 06:04 PM, Tom Herbert wrote: >> On Fri, Sep 9, 2016 at 5:01 PM, John Fastabend >> wrote: >>> On 16-09-09 04:44 PM, Tom Herbert wrote: On Fri, Sep 9, 2016 at 2:29 PM, John Fastabend wrote: > e1000 supports a single TX queue so it is being shared with the stack > when XDP runs XDP_TX action. This requires taking the xmit lock to > ensure we don't corrupt the tx ring. To avoid taking and dropping the > lock per packet this patch adds a bundling implementation to submit > a bundle of packets to the xmit routine. > > I tested this patch running e1000 in a VM using KVM over a tap > device using pktgen to generate traffic along with 'ping -f -l 100'. > Hi John, How does this interact with BQL on e1000? Tom >>> >>> Let me check if I have the API correct. When we enqueue a packet to >>> be sent we must issue a netdev_sent_queue() call and then on actual >>> transmission issue a netdev_completed_queue(). >>> >>> The patch attached here missed a few things though. >>> >>> But it looks like I just need to call netdev_sent_queue() from the >>> e1000_xmit_raw_frame() routine and then let the tx completion logic >>> kick in which will call netdev_completed_queue() correctly. >>> >>> I'll need to add a check for the queue state as well. So if I do these >>> three things, >>> >>> check __QUEUE_STATE_XOFF before sending >>> netdev_sent_queue() -> on XDP_TX >>> netdev_completed_queue() >>> >>> It should work agree? Now should we do this even when XDP owns the >>> queue? Or is this purely an issue with sharing the queue between >>> XDP and stack. >>> >> But what is the action for XDP_TX if the queue is stopped? There is no >> qdisc to back pressure in the XDP path. Would we just start dropping >> packets then? > > Yep that is what the patch does if there is any sort of error packets > get dropped on the floor. I don't think there is anything else that > can be done. > That probably means that the stack will always win out under load. Trying to used the same queue where half of the packets are well managed by a qdisc and half aren't is going to leave someone unhappy. Maybe in the this case where we have to share the qdisc we can allocate the skb on on returning XDP_TX and send through the normal qdisc for the device. Tom >> >> Tom >> >>> .John >>> >
Re: [PATCH] via-velocity: remove null pointer check on array tdinfo->skb_dma
From: Colin KingDate: Thu, 8 Sep 2016 10:04:24 +0100 > From: Colin Ian King > > tdinfo->skb_dma is a 7 element array of dma_addr_t hence cannot be > null, so the pull pointer check on tdinfo->skb_dma is redundant. > Remove it. > > Signed-off-by: Colin Ian King Applied, thanks Colin.
Re: [PATCH] qede: mark qede_set_features() static
From: Baoyou XieDate: Thu, 8 Sep 2016 16:43:23 +0800 > We get 1 warning when building kernel with W=1: > drivers/net/ethernet/qlogic/qede/qede_main.c:2113:5: warning: no previous > prototype for 'qede_set_features' [-Wmissing-prototypes] > > In fact, this function is only used in the file in which it is > declared and don't need a declaration, but can be made static. > so this patch marks this function with 'static'. > > Signed-off-by: Baoyou Xie Applied.
Re: [PATCH net-next 1/1] net: phy: Fixed checkpatch errors for Microsemi PHYs.
From: Raju LakkarajuDate: Thu, 8 Sep 2016 14:09:31 +0530 > From: Raju Lakkaraju > > The existing VSC85xx PHY driver did not follow the coding style and caused > "checkpatch" to complain. This commit fixes this. > > Signed-off-by: Raju Lakkaraju Applied.
Re: [PATCH] net: x25: remove null checks on arrays calling_ae and called_ae
From: Colin KingDate: Thu, 8 Sep 2016 08:42:06 +0100 > From: Colin Ian King > > dtefacs.calling_ae and called_ae are both 20 element __u8 arrays and > cannot be null and hence are redundant checks. Remove these. > > Signed-off-by: Colin Ian King Indeed, and if they were pointers they would be in userspace and would need proper uaccess handling. Applied to net-next, thanks.
Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines
On 16-09-09 06:04 PM, Tom Herbert wrote: > On Fri, Sep 9, 2016 at 5:01 PM, John Fastabend> wrote: >> On 16-09-09 04:44 PM, Tom Herbert wrote: >>> On Fri, Sep 9, 2016 at 2:29 PM, John Fastabend >>> wrote: e1000 supports a single TX queue so it is being shared with the stack when XDP runs XDP_TX action. This requires taking the xmit lock to ensure we don't corrupt the tx ring. To avoid taking and dropping the lock per packet this patch adds a bundling implementation to submit a bundle of packets to the xmit routine. I tested this patch running e1000 in a VM using KVM over a tap device using pktgen to generate traffic along with 'ping -f -l 100'. >>> Hi John, >>> >>> How does this interact with BQL on e1000? >>> >>> Tom >>> >> >> Let me check if I have the API correct. When we enqueue a packet to >> be sent we must issue a netdev_sent_queue() call and then on actual >> transmission issue a netdev_completed_queue(). >> >> The patch attached here missed a few things though. >> >> But it looks like I just need to call netdev_sent_queue() from the >> e1000_xmit_raw_frame() routine and then let the tx completion logic >> kick in which will call netdev_completed_queue() correctly. >> >> I'll need to add a check for the queue state as well. So if I do these >> three things, >> >> check __QUEUE_STATE_XOFF before sending >> netdev_sent_queue() -> on XDP_TX >> netdev_completed_queue() >> >> It should work agree? Now should we do this even when XDP owns the >> queue? Or is this purely an issue with sharing the queue between >> XDP and stack. >> > But what is the action for XDP_TX if the queue is stopped? There is no > qdisc to back pressure in the XDP path. Would we just start dropping > packets then? Yep that is what the patch does if there is any sort of error packets get dropped on the floor. I don't think there is anything else that can be done. > > Tom > >> .John >>
Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines
On Fri, Sep 9, 2016 at 5:01 PM, John Fastabendwrote: > On 16-09-09 04:44 PM, Tom Herbert wrote: >> On Fri, Sep 9, 2016 at 2:29 PM, John Fastabend >> wrote: >>> e1000 supports a single TX queue so it is being shared with the stack >>> when XDP runs XDP_TX action. This requires taking the xmit lock to >>> ensure we don't corrupt the tx ring. To avoid taking and dropping the >>> lock per packet this patch adds a bundling implementation to submit >>> a bundle of packets to the xmit routine. >>> >>> I tested this patch running e1000 in a VM using KVM over a tap >>> device using pktgen to generate traffic along with 'ping -f -l 100'. >>> >> Hi John, >> >> How does this interact with BQL on e1000? >> >> Tom >> > > Let me check if I have the API correct. When we enqueue a packet to > be sent we must issue a netdev_sent_queue() call and then on actual > transmission issue a netdev_completed_queue(). > > The patch attached here missed a few things though. > > But it looks like I just need to call netdev_sent_queue() from the > e1000_xmit_raw_frame() routine and then let the tx completion logic > kick in which will call netdev_completed_queue() correctly. > > I'll need to add a check for the queue state as well. So if I do these > three things, > > check __QUEUE_STATE_XOFF before sending > netdev_sent_queue() -> on XDP_TX > netdev_completed_queue() > > It should work agree? Now should we do this even when XDP owns the > queue? Or is this purely an issue with sharing the queue between > XDP and stack. > But what is the action for XDP_TX if the queue is stopped? There is no qdisc to back pressure in the XDP path. Would we just start dropping packets then? Tom > .John >
[PATCH -next] tipc: fix possible memory leak in tipc_udp_enable()
From: Wei Yongjun'ub' is malloced in tipc_udp_enable() and should be freed before leaving from the error handling cases, otherwise it will cause memory leak. Fixes: ba5aa84a2d22 ("tipc: split UDP nl address parsing") Signed-off-by: Wei Yongjun --- net/tipc/udp_media.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c index dd27468..d80cd3f 100644 --- a/net/tipc/udp_media.c +++ b/net/tipc/udp_media.c @@ -665,7 +665,8 @@ static int tipc_udp_enable(struct net *net, struct tipc_bearer *b, if (!opts[TIPC_NLA_UDP_LOCAL] || !opts[TIPC_NLA_UDP_REMOTE]) { pr_err("Invalid UDP bearer configuration"); - return -EINVAL; + err = -EINVAL; + goto err; } err = tipc_parse_udp_addr(opts[TIPC_NLA_UDP_LOCAL], ,
Re: [PATCH 8/9] selftests: move vDSO tests from Documentation/vDSO
Hi Shuah, [auto build test ERROR on linus/master] [also build test ERROR on v4.8-rc5 next-20160909] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] [Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for convenience) to record what (public, well-known) commit your patch series was built on] [Check https://git-scm.com/docs/git-format-patch for more information] url: https://github.com/0day-ci/linux/commits/Shuah-Khan/Move-runnable-code-tests-from-Documentation-to-selftests/20160910-063538 config: i386-tinyconfig (attached as .config) compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): >> scripts/Makefile.build:44: Documentation/vDSO/Makefile: No such file or >> directory >> make[3]: *** No rule to make target 'Documentation/vDSO/Makefile'. make[3]: Failed to remake makefile 'Documentation/vDSO/Makefile'. vim +44 scripts/Makefile.build f77bf0142 Sam Ravnborg 2007-10-15 28 ldflags-y := d72e5edbf Sam Ravnborg 2007-05-28 29 720097d89 Sam Ravnborg 2009-04-19 30 subdir-asflags-y := 720097d89 Sam Ravnborg 2009-04-19 31 subdir-ccflags-y := 720097d89 Sam Ravnborg 2009-04-19 32 3156fd052 Robert P. J. Day 2008-02-18 33 # Read auto.conf if it exists, otherwise ignore c955ccafc Roman Zippel 2006-06-08 34 -include include/config/auto.conf ^1da177e4 Linus Torvalds 2005-04-16 35 20a468b51 Sam Ravnborg 2006-01-22 36 include scripts/Kbuild.include 20a468b51 Sam Ravnborg 2006-01-22 37 3156fd052 Robert P. J. Day 2008-02-18 38 # For backward compatibility check that these variables do not change 0c53c8e6e Sam Ravnborg 2007-10-14 39 save-cflags := $(CFLAGS) 0c53c8e6e Sam Ravnborg 2007-10-14 40 2a6914703 Sam Ravnborg 2005-07-25 41 # The filename Kbuild has precedence over Makefile db8c1a7b2 Sam Ravnborg 2005-07-27 42 kbuild-dir := $(if $(filter /%,$(src)),$(src),$(srctree)/$(src)) 0c53c8e6e Sam Ravnborg 2007-10-14 43 kbuild-file := $(if $(wildcard $(kbuild-dir)/Kbuild),$(kbuild-dir)/Kbuild,$(kbuild-dir)/Makefile) 0c53c8e6e Sam Ravnborg 2007-10-14 @44 include $(kbuild-file) ^1da177e4 Linus Torvalds 2005-04-16 45 0c53c8e6e Sam Ravnborg 2007-10-14 46 # If the save-* variables changed error out 0c53c8e6e Sam Ravnborg 2007-10-14 47 ifeq ($(KBUILD_NOPEDANTIC),) 0c53c8e6e Sam Ravnborg 2007-10-14 48 ifneq ("$(save-cflags)","$(CFLAGS)") 49c57d254 Arnaud Lacombe 2011-08-15 49 $(error CFLAGS was changed in "$(kbuild-file)". Fix it to use ccflags-y) 0c53c8e6e Sam Ravnborg 2007-10-14 50 endif 0c53c8e6e Sam Ravnborg 2007-10-14 51 endif 4a5838ad9 Borislav Petkov 2011-03-01 52 :: The code at line 44 was first introduced by commit :: 0c53c8e6eb456cde30f2305421c605713856abc8 kbuild: check for wrong use of CFLAGS :: TO: Sam Ravnborg <sam@neptun.(none)> :: CC: Sam Ravnborg <sam@neptun.(none)> --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [PATCH 6/9] selftests: move ptp tests from Documentation/ptp
Hi Shuah, [auto build test ERROR on linus/master] [also build test ERROR on v4.8-rc5 next-20160909] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] [Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for convenience) to record what (public, well-known) commit your patch series was built on] [Check https://git-scm.com/docs/git-format-patch for more information] url: https://github.com/0day-ci/linux/commits/Shuah-Khan/Move-runnable-code-tests-from-Documentation-to-selftests/20160910-063538 config: i386-tinyconfig (attached as .config) compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): >> scripts/Makefile.build:44: Documentation/ptp/Makefile: No such file or >> directory >> make[3]: *** No rule to make target 'Documentation/ptp/Makefile'. make[3]: Failed to remake makefile 'Documentation/ptp/Makefile'. vim +44 scripts/Makefile.build 3156fd052 Robert P. J. Day 2008-02-18 38 # For backward compatibility check that these variables do not change 0c53c8e6e Sam Ravnborg 2007-10-14 39 save-cflags := $(CFLAGS) 0c53c8e6e Sam Ravnborg 2007-10-14 40 2a6914703 Sam Ravnborg 2005-07-25 41 # The filename Kbuild has precedence over Makefile db8c1a7b2 Sam Ravnborg 2005-07-27 42 kbuild-dir := $(if $(filter /%,$(src)),$(src),$(srctree)/$(src)) 0c53c8e6e Sam Ravnborg 2007-10-14 43 kbuild-file := $(if $(wildcard $(kbuild-dir)/Kbuild),$(kbuild-dir)/Kbuild,$(kbuild-dir)/Makefile) 0c53c8e6e Sam Ravnborg 2007-10-14 @44 include $(kbuild-file) ^1da177e4 Linus Torvalds 2005-04-16 45 0c53c8e6e Sam Ravnborg 2007-10-14 46 # If the save-* variables changed error out 0c53c8e6e Sam Ravnborg 2007-10-14 47 ifeq ($(KBUILD_NOPEDANTIC),) :: The code at line 44 was first introduced by commit :: 0c53c8e6eb456cde30f2305421c605713856abc8 kbuild: check for wrong use of CFLAGS :: TO: Sam Ravnborg <sam@neptun.(none)> :: CC: Sam Ravnborg <sam@neptun.(none)> --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [PATCH 4/9] selftests: move prctl tests from Documentation/prctl
Hi Shuah, [auto build test ERROR on linus/master] [also build test ERROR on v4.8-rc5 next-20160909] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] [Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for convenience) to record what (public, well-known) commit your patch series was built on] [Check https://git-scm.com/docs/git-format-patch for more information] url: https://github.com/0day-ci/linux/commits/Shuah-Khan/Move-runnable-code-tests-from-Documentation-to-selftests/20160910-063538 config: i386-tinyconfig (attached as .config) compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): >> scripts/Makefile.build:44: Documentation/prctl/Makefile: No such file or >> directory make[3]: *** No rule to make target 'Documentation/prctl/Makefile'. make[3]: Failed to remake makefile 'Documentation/prctl/Makefile'. vim +44 scripts/Makefile.build 3156fd052 Robert P. J. Day 2008-02-18 38 # For backward compatibility check that these variables do not change 0c53c8e6e Sam Ravnborg 2007-10-14 39 save-cflags := $(CFLAGS) 0c53c8e6e Sam Ravnborg 2007-10-14 40 2a6914703 Sam Ravnborg 2005-07-25 41 # The filename Kbuild has precedence over Makefile db8c1a7b2 Sam Ravnborg 2005-07-27 42 kbuild-dir := $(if $(filter /%,$(src)),$(src),$(srctree)/$(src)) 0c53c8e6e Sam Ravnborg 2007-10-14 43 kbuild-file := $(if $(wildcard $(kbuild-dir)/Kbuild),$(kbuild-dir)/Kbuild,$(kbuild-dir)/Makefile) 0c53c8e6e Sam Ravnborg 2007-10-14 @44 include $(kbuild-file) ^1da177e4 Linus Torvalds 2005-04-16 45 0c53c8e6e Sam Ravnborg 2007-10-14 46 # If the save-* variables changed error out 0c53c8e6e Sam Ravnborg 2007-10-14 47 ifeq ($(KBUILD_NOPEDANTIC),) :: The code at line 44 was first introduced by commit :: 0c53c8e6eb456cde30f2305421c605713856abc8 kbuild: check for wrong use of CFLAGS :: TO: Sam Ravnborg <sam@neptun.(none)> :: CC: Sam Ravnborg <sam@neptun.(none)> --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [PATCH 1/9] selftests: move dnotify_test from Documentation/filesystems
Hi Shuah, [auto build test ERROR on linus/master] [also build test ERROR on v4.8-rc5 next-20160909] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] [Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for convenience) to record what (public, well-known) commit your patch series was built on] [Check https://git-scm.com/docs/git-format-patch for more information] url: https://github.com/0day-ci/linux/commits/Shuah-Khan/Move-runnable-code-tests-from-Documentation-to-selftests/20160910-063538 config: i386-tinyconfig (attached as .config) compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): >> scripts/Makefile.build:44: Documentation/filesystems/Makefile: No such file >> or directory >> make[3]: *** No rule to make target 'Documentation/filesystems/Makefile'. make[3]: Failed to remake makefile 'Documentation/filesystems/Makefile'. vim +44 scripts/Makefile.build f77bf0142 Sam Ravnborg 2007-10-15 28 ldflags-y := d72e5edbf Sam Ravnborg 2007-05-28 29 720097d89 Sam Ravnborg 2009-04-19 30 subdir-asflags-y := 720097d89 Sam Ravnborg 2009-04-19 31 subdir-ccflags-y := 720097d89 Sam Ravnborg 2009-04-19 32 3156fd052 Robert P. J. Day 2008-02-18 33 # Read auto.conf if it exists, otherwise ignore c955ccafc Roman Zippel 2006-06-08 34 -include include/config/auto.conf ^1da177e4 Linus Torvalds 2005-04-16 35 20a468b51 Sam Ravnborg 2006-01-22 36 include scripts/Kbuild.include 20a468b51 Sam Ravnborg 2006-01-22 37 3156fd052 Robert P. J. Day 2008-02-18 38 # For backward compatibility check that these variables do not change 0c53c8e6e Sam Ravnborg 2007-10-14 39 save-cflags := $(CFLAGS) 0c53c8e6e Sam Ravnborg 2007-10-14 40 2a6914703 Sam Ravnborg 2005-07-25 41 # The filename Kbuild has precedence over Makefile db8c1a7b2 Sam Ravnborg 2005-07-27 42 kbuild-dir := $(if $(filter /%,$(src)),$(src),$(srctree)/$(src)) 0c53c8e6e Sam Ravnborg 2007-10-14 43 kbuild-file := $(if $(wildcard $(kbuild-dir)/Kbuild),$(kbuild-dir)/Kbuild,$(kbuild-dir)/Makefile) 0c53c8e6e Sam Ravnborg 2007-10-14 @44 include $(kbuild-file) ^1da177e4 Linus Torvalds 2005-04-16 45 0c53c8e6e Sam Ravnborg 2007-10-14 46 # If the save-* variables changed error out 0c53c8e6e Sam Ravnborg 2007-10-14 47 ifeq ($(KBUILD_NOPEDANTIC),) 0c53c8e6e Sam Ravnborg 2007-10-14 48 ifneq ("$(save-cflags)","$(CFLAGS)") 49c57d254 Arnaud Lacombe 2011-08-15 49 $(error CFLAGS was changed in "$(kbuild-file)". Fix it to use ccflags-y) 0c53c8e6e Sam Ravnborg 2007-10-14 50 endif 0c53c8e6e Sam Ravnborg 2007-10-14 51 endif 4a5838ad9 Borislav Petkov 2011-03-01 52 :: The code at line 44 was first introduced by commit :: 0c53c8e6eb456cde30f2305421c605713856abc8 kbuild: check for wrong use of CFLAGS :: TO: Sam Ravnborg <sam@neptun.(none)> :: CC: Sam Ravnborg <sam@neptun.(none)> --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines
On 16-09-09 04:44 PM, Tom Herbert wrote: > On Fri, Sep 9, 2016 at 2:29 PM, John Fastabend> wrote: >> e1000 supports a single TX queue so it is being shared with the stack >> when XDP runs XDP_TX action. This requires taking the xmit lock to >> ensure we don't corrupt the tx ring. To avoid taking and dropping the >> lock per packet this patch adds a bundling implementation to submit >> a bundle of packets to the xmit routine. >> >> I tested this patch running e1000 in a VM using KVM over a tap >> device using pktgen to generate traffic along with 'ping -f -l 100'. >> > Hi John, > > How does this interact with BQL on e1000? > > Tom > Let me check if I have the API correct. When we enqueue a packet to be sent we must issue a netdev_sent_queue() call and then on actual transmission issue a netdev_completed_queue(). The patch attached here missed a few things though. But it looks like I just need to call netdev_sent_queue() from the e1000_xmit_raw_frame() routine and then let the tx completion logic kick in which will call netdev_completed_queue() correctly. I'll need to add a check for the queue state as well. So if I do these three things, check __QUEUE_STATE_XOFF before sending netdev_sent_queue() -> on XDP_TX netdev_completed_queue() It should work agree? Now should we do this even when XDP owns the queue? Or is this purely an issue with sharing the queue between XDP and stack. .John
Re: [PATCH 4/9] selftests: move prctl tests from Documentation/prctl
Hi Shuah, [auto build test ERROR on linus/master] [also build test ERROR on v4.8-rc5 next-20160909] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] [Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for convenience) to record what (public, well-known) commit your patch series was built on] [Check https://git-scm.com/docs/git-format-patch for more information] url: https://github.com/0day-ci/linux/commits/Shuah-Khan/Move-runnable-code-tests-from-Documentation-to-selftests/20160910-063538 config: i386-tinyconfig (attached as .config) compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): scripts/Makefile.clean:14: Documentation/prctl/Makefile: No such file or directory >> make[3]: *** No rule to make target 'Documentation/prctl/Makefile'. make[3]: Failed to remake makefile 'Documentation/prctl/Makefile'. make[2]: *** [Documentation/prctl] Error 2 scripts/Makefile.clean:14: Documentation/filesystems/Makefile: No such file or directory make[3]: *** No rule to make target 'Documentation/filesystems/Makefile'. make[3]: Failed to remake makefile 'Documentation/filesystems/Makefile'. make[2]: *** [Documentation/filesystems] Error 2 make[2]: Target '__clean' not remade because of errors. make[1]: *** [_clean_Documentation] Error 2 make[1]: Target 'distclean' not remade because of errors. make: *** [sub-make] Error 2 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [patch net 0/2] mlxsw: couple of fixes
From: Jiri PirkoDate: Thu, 8 Sep 2016 08:16:00 +0200 > Couple of fixes from Ido and myself. Series applied, thanks Jiri.
Re: [RFC] bridge: MAC learning uevents
On 09/09/2016 01:51 AM, D. Herrendoerfer wrote: >> just like neighbor table modifications, it should be possible to listen for >> events with netlink. Doing it through uevent is the wrong model. > > I agree partially - but consider: > we plug hardware - we get an event > we remove hardware - we get an event > we add a virtual interface - we get an event > we add a bridge - event > we add an interface to that bridge - event > a kvm guest starts using the interface on that bridge - we need to monitor > netlink, poll brforward, capture traffic Yes, because now there is network activity going on, so why not ask the networking stack to get these events? > > It seems inconsistent, bridge is already emitting events. It does not seem particularly inconsistent, all networking events are already emitted using rt netlink, why should bridge be different here? (Yes, uevent is netlink too, just a special family). -- Florian
Re: [PATCH net-next] macsec: set network devtype
From: Stephen HemmingerDate: Wed, 7 Sep 2016 14:07:32 -0700 > The netdevice type structure for macsec was being defined but never used. > To set the network device type the macro SET_NETDEV_DEVTYPE must be called. > Compile tested only, I don't use macsec. > > Signed-off-by: Stephen Hemminger Applied.
Re: [PATCH net-next] rtnetlink: remove unused ifla_stats_policy
From: Stephen HemmingerDate: Wed, 7 Sep 2016 13:57:36 -0700 > This structure is defined but never used. Flagged with W=1 > > Signed-off-by: Stephen Hemminger Applied.
Re: [PATCH net 0/2] ip: fix creation flags reported in RTM_NEWROUTE events
From: Guillaume NaultDate: Wed, 7 Sep 2016 17:18:50 +0200 > Netlink messages sent to user-space upon RTM_NEWROUTE events have their > nlmsg_flags field inconsistently set. While the NLM_F_REPLACE and > NLM_F_APPEND bits are correctly handled, NLM_F_CREATE and NLM_F_EXCL > are always 0. > > This series sets the NLM_F_CREATE and NLM_F_EXCL bits when applicable, > for IPv4 and IPv6. > > Since IPv6 ignores the NLM_F_APPEND flags in requests, this flag isn't > reported in RTM_NEWROUTE IPv6 events. This keeps IPv6 internal > consistency (same flag semantic for user requests and kernel events) at > the cost of bringing different flag interpretation for IPv4 and IPv6. I'm applying this series to net-next so that it has time to cook and expose anything in userland that might break due to these changes. I briefly considered applying this to net but I think that is premature at least for the time being. Thanks.
Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines
On Fri, Sep 9, 2016 at 2:29 PM, John Fastabendwrote: > e1000 supports a single TX queue so it is being shared with the stack > when XDP runs XDP_TX action. This requires taking the xmit lock to > ensure we don't corrupt the tx ring. To avoid taking and dropping the > lock per packet this patch adds a bundling implementation to submit > a bundle of packets to the xmit routine. > > I tested this patch running e1000 in a VM using KVM over a tap > device using pktgen to generate traffic along with 'ping -f -l 100'. > Hi John, How does this interact with BQL on e1000? Tom > Suggested-by: Jesper Dangaard Brouer > Signed-off-by: John Fastabend > --- > drivers/net/ethernet/intel/e1000/e1000.h | 10 +++ > drivers/net/ethernet/intel/e1000/e1000_main.c | 81 > +++-- > 2 files changed, 71 insertions(+), 20 deletions(-) > > diff --git a/drivers/net/ethernet/intel/e1000/e1000.h > b/drivers/net/ethernet/intel/e1000/e1000.h > index 5cf8a0a..877b377 100644 > --- a/drivers/net/ethernet/intel/e1000/e1000.h > +++ b/drivers/net/ethernet/intel/e1000/e1000.h > @@ -133,6 +133,8 @@ struct e1000_adapter; > #define E1000_TX_QUEUE_WAKE16 > /* How many Rx Buffers do we bundle into one write to the hardware ? */ > #define E1000_RX_BUFFER_WRITE 16 /* Must be power of 2 */ > +/* How many XDP XMIT buffers to bundle into one xmit transaction */ > +#define E1000_XDP_XMIT_BUNDLE_MAX E1000_RX_BUFFER_WRITE > > #define AUTO_ALL_MODES 0 > #define E1000_EEPROM_82544_APM 0x0004 > @@ -168,6 +170,11 @@ struct e1000_rx_buffer { > dma_addr_t dma; > }; > > +struct e1000_rx_buffer_bundle { > + struct e1000_rx_buffer *buffer; > + u32 length; > +}; > + > struct e1000_tx_ring { > /* pointer to the descriptor ring memory */ > void *desc; > @@ -206,6 +213,9 @@ struct e1000_rx_ring { > struct e1000_rx_buffer *buffer_info; > struct sk_buff *rx_skb_top; > > + /* array of XDP buffer information structs */ > + struct e1000_rx_buffer_bundle *xdp_buffer; > + > /* cpu for rx queue */ > int cpu; > > diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c > b/drivers/net/ethernet/intel/e1000/e1000_main.c > index 91d5c87..b985271 100644 > --- a/drivers/net/ethernet/intel/e1000/e1000_main.c > +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c > @@ -1738,10 +1738,18 @@ static int e1000_setup_rx_resources(struct > e1000_adapter *adapter, > struct pci_dev *pdev = adapter->pdev; > int size, desc_len; > > + size = sizeof(struct e1000_rx_buffer_bundle) * > + E1000_XDP_XMIT_BUNDLE_MAX; > + rxdr->xdp_buffer = vzalloc(size); > + if (!rxdr->xdp_buffer) > + return -ENOMEM; > + > size = sizeof(struct e1000_rx_buffer) * rxdr->count; > rxdr->buffer_info = vzalloc(size); > - if (!rxdr->buffer_info) > + if (!rxdr->buffer_info) { > + vfree(rxdr->xdp_buffer); > return -ENOMEM; > + } > > desc_len = sizeof(struct e1000_rx_desc); > > @@ -1754,6 +1762,7 @@ static int e1000_setup_rx_resources(struct > e1000_adapter *adapter, > GFP_KERNEL); > if (!rxdr->desc) { > setup_rx_desc_die: > + vfree(rxdr->xdp_buffer); > vfree(rxdr->buffer_info); > return -ENOMEM; > } > @@ -2087,6 +2096,9 @@ static void e1000_free_rx_resources(struct > e1000_adapter *adapter, > > e1000_clean_rx_ring(adapter, rx_ring); > > + vfree(rx_ring->xdp_buffer); > + rx_ring->xdp_buffer = NULL; > + > vfree(rx_ring->buffer_info); > rx_ring->buffer_info = NULL; > > @@ -3369,33 +3381,52 @@ static void e1000_tx_map_rxpage(struct e1000_tx_ring > *tx_ring, > } > > static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info, > -unsigned int len, > -struct net_device *netdev, > -struct e1000_adapter *adapter) > +__u32 len, > +struct e1000_adapter *adapter, > +struct e1000_tx_ring *tx_ring) > { > - struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0); > - struct e1000_hw *hw = >hw; > - struct e1000_tx_ring *tx_ring; > - > if (len > E1000_MAX_DATA_PER_TXD) > return; > > + if (E1000_DESC_UNUSED(tx_ring) < 2) > + return; > + > + e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len); > + e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1); > +} > + > +static void e1000_xdp_xmit_bundle(struct e1000_rx_buffer_bundle *buffer_info, > + struct net_device *netdev, > + struct e1000_adapter *adapter) > +{ > +
Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines
On 16-09-09 02:29 PM, John Fastabend wrote: > e1000 supports a single TX queue so it is being shared with the stack > when XDP runs XDP_TX action. This requires taking the xmit lock to > ensure we don't corrupt the tx ring. To avoid taking and dropping the > lock per packet this patch adds a bundling implementation to submit > a bundle of packets to the xmit routine. > > I tested this patch running e1000 in a VM using KVM over a tap > device using pktgen to generate traffic along with 'ping -f -l 100'. > > Suggested-by: Jesper Dangaard Brouer> Signed-off-by: John Fastabend > --- This patch is a bit bogus in a few spots as well... > - > - if (E1000_DESC_UNUSED(tx_ring) < 2) { > - HARD_TX_UNLOCK(netdev, txq); > - return; > + for (; i < E1000_XDP_XMIT_BUNDLE_MAX && buffer_info[i].buffer; i++) { > + e1000_xmit_raw_frame(buffer_info[i].buffer, > + buffer_info[i].length, > + adapter, tx_ring); > + buffer_info[i].buffer->rxbuf.page = NULL; > + buffer_info[i].buffer = NULL; > + buffer_info[i].length = 0; > + i++;
Re: [net-next PATCH v2 1/2] e1000: add initial XDP support
On 16-09-09 03:04 PM, Eric Dumazet wrote: > On Fri, 2016-09-09 at 14:29 -0700, John Fastabend wrote: >> From: Alexei Starovoitov>> > > > So it looks like e1000_xmit_raw_frame() can return early, > say if there is no available descriptor. > >> +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info, >> + unsigned int len, >> + struct net_device *netdev, >> + struct e1000_adapter *adapter) >> +{ >> +struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0); >> +struct e1000_hw *hw = >hw; >> +struct e1000_tx_ring *tx_ring; >> + >> +if (len > E1000_MAX_DATA_PER_TXD) >> +return; >> + >> +/* e1000 only support a single txq at the moment so the queue is being >> + * shared with stack. To support this requires locking to ensure the >> + * stack and XDP are not running at the same time. Devices with >> + * multiple queues should allocate a separate queue space. >> + */ >> +HARD_TX_LOCK(netdev, txq, smp_processor_id()); >> + >> +tx_ring = adapter->tx_ring; >> + >> +if (E1000_DESC_UNUSED(tx_ring) < 2) { >> +HARD_TX_UNLOCK(netdev, txq); >> +return; >> +} >> + >> +e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len); >> +e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1); >> + >> +writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt); >> +mmiowb(); >> + >> +HARD_TX_UNLOCK(netdev, txq); >> +} >> + >> #define NUM_REGS 38 /* 1 based count */ >> static void e1000_regdump(struct e1000_adapter *adapter) >> { >> @@ -4142,6 +4247,19 @@ static struct sk_buff *e1000_alloc_rx_skb(struct >> e1000_adapter *adapter, >> return skb; >> } >> +act = e1000_call_bpf(prog, page_address(p), length); >> +switch (act) { >> +case XDP_PASS: >> +break; >> +case XDP_TX: >> +dma_sync_single_for_device(>dev, >> + dma, >> + length, >> + DMA_TO_DEVICE); >> +e1000_xmit_raw_frame(buffer_info, length, >> + netdev, adapter); >> +buffer_info->rxbuf.page = NULL; > > > So I am trying to understand how pages are not leaked ? > > Pages are being leaked thanks! v3 coming soon.
[PATCH RFC 3/6] ila: Call library function alloc_bucket_locks
To allocate the array of bucket locks for the hash table we now call library function alloc_bucket_spinlocks. Signed-off-by: Tom Herbert--- net/ipv6/ila/ila_xlat.c | 36 +--- 1 file changed, 5 insertions(+), 31 deletions(-) diff --git a/net/ipv6/ila/ila_xlat.c b/net/ipv6/ila/ila_xlat.c index e604013..7d1c34b 100644 --- a/net/ipv6/ila/ila_xlat.c +++ b/net/ipv6/ila/ila_xlat.c @@ -30,34 +30,6 @@ struct ila_net { bool hooks_registered; }; -#defineLOCKS_PER_CPU 10 - -static int alloc_ila_locks(struct ila_net *ilan) -{ - unsigned int i, size; - unsigned int nr_pcpus = num_possible_cpus(); - - nr_pcpus = min_t(unsigned int, nr_pcpus, 32UL); - size = roundup_pow_of_two(nr_pcpus * LOCKS_PER_CPU); - - if (sizeof(spinlock_t) != 0) { -#ifdef CONFIG_NUMA - if (size * sizeof(spinlock_t) > PAGE_SIZE) - ilan->locks = vmalloc(size * sizeof(spinlock_t)); - else -#endif - ilan->locks = kmalloc_array(size, sizeof(spinlock_t), - GFP_KERNEL); - if (!ilan->locks) - return -ENOMEM; - for (i = 0; i < size; i++) - spin_lock_init(>locks[i]); - } - ilan->locks_mask = size - 1; - - return 0; -} - static u32 hashrnd __read_mostly; static __always_inline void __ila_hash_secret_init(void) { @@ -561,14 +533,16 @@ static const struct genl_ops ila_nl_ops[] = { }, }; -#define ILA_HASH_TABLE_SIZE 1024 +#define LOCKS_PER_CPU 10 +#define MAX_LOCKS 1024 static __net_init int ila_init_net(struct net *net) { int err; struct ila_net *ilan = net_generic(net, ila_net_id); - err = alloc_ila_locks(ilan); + err = alloc_bucket_spinlocks(>locks, >locks_mask, +MAX_LOCKS, LOCKS_PER_CPU, GFP_KERNEL); if (err) return err; @@ -583,7 +557,7 @@ static __net_exit void ila_exit_net(struct net *net) rhashtable_free_and_destroy(>rhash_table, ila_free_cb, NULL); - kvfree(ilan->locks); + free_bucket_spinlocks(ilan->locks); if (ilan->hooks_registered) nf_unregister_net_hooks(net, ila_nf_hook_ops, -- 2.8.0.rc2
[PATCH RFC 6/6] ila: Resolver mechanism
Implement an ILA resolver. This uses LWT to implement the hook to a userspace resolver and tracks pending unresolved address using the backend net resolver. The idea is that the kernel sets an ILA resolver route to the SIR prefix, something like: ip route add ::/64 encap ila-resolve \ via 2401:db00:20:911a::27:0 dev eth0 When a packet hits the route the address is looked up in a resolver table. If the entry is created (no entry with the address already exists) then an rtnl message is generated with group RTNLGRP_ILA_NOTIFY and type RTM_ADDR_RESOLVE. A userspace daemon can listen for such messages and perform an ILA resolution protocol to determine the ILA mapping. If the mapping is resolved then a /128 ila encap router is set so that host can perform ILA translation and send directly to destination. Signed-off-by: Tom Herbert--- include/uapi/linux/lwtunnel.h | 1 + include/uapi/linux/rtnetlink.h | 5 ++ net/ipv6/Kconfig | 1 + net/ipv6/ila/Makefile | 2 +- net/ipv6/ila/ila.h | 16 net/ipv6/ila/ila_common.c | 7 ++ net/ipv6/ila/ila_lwt.c | 9 ++ net/ipv6/ila/ila_resolver.c| 192 + net/ipv6/ila/ila_xlat.c| 15 ++-- 9 files changed, 239 insertions(+), 9 deletions(-) create mode 100644 net/ipv6/ila/ila_resolver.c diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h index a478fe8..d880e49 100644 --- a/include/uapi/linux/lwtunnel.h +++ b/include/uapi/linux/lwtunnel.h @@ -9,6 +9,7 @@ enum lwtunnel_encap_types { LWTUNNEL_ENCAP_IP, LWTUNNEL_ENCAP_ILA, LWTUNNEL_ENCAP_IP6, + LWTUNNEL_ENCAP_ILA_NOTIFY, __LWTUNNEL_ENCAP_MAX, }; diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index 262f037..271215f 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -144,6 +144,9 @@ enum { RTM_GETSTATS = 94, #define RTM_GETSTATS RTM_GETSTATS + RTM_ADDR_RESOLVE = 95, +#define RTM_ADDR_RESOLVE RTM_ADDR_RESOLVE + __RTM_MAX, #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1) }; @@ -656,6 +659,8 @@ enum rtnetlink_groups { #define RTNLGRP_MPLS_ROUTE RTNLGRP_MPLS_ROUTE RTNLGRP_NSID, #define RTNLGRP_NSID RTNLGRP_NSID + RTNLGRP_ILA_NOTIFY, +#define RTNLGRP_ILA_NOTIFY RTNLGRP_ILA_NOTIFY __RTNLGRP_MAX }; #define RTNLGRP_MAX(__RTNLGRP_MAX - 1) diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig index 2343e4f..cf3ea8e 100644 --- a/net/ipv6/Kconfig +++ b/net/ipv6/Kconfig @@ -97,6 +97,7 @@ config IPV6_ILA tristate "IPv6: Identifier Locator Addressing (ILA)" depends on NETFILTER select LWTUNNEL + select NET_EXT_RESOLVER ---help--- Support for IPv6 Identifier Locator Addressing (ILA). diff --git a/net/ipv6/ila/Makefile b/net/ipv6/ila/Makefile index 4b32e59..f2aadc3 100644 --- a/net/ipv6/ila/Makefile +++ b/net/ipv6/ila/Makefile @@ -4,4 +4,4 @@ obj-$(CONFIG_IPV6_ILA) += ila.o -ila-objs := ila_common.o ila_lwt.o ila_xlat.o +ila-objs := ila_common.o ila_lwt.o ila_xlat.o ila_resolver.o diff --git a/net/ipv6/ila/ila.h b/net/ipv6/ila/ila.h index e0170f6..e369611 100644 --- a/net/ipv6/ila/ila.h +++ b/net/ipv6/ila/ila.h @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include @@ -23,6 +24,16 @@ #include #include +extern unsigned int ila_net_id; + +struct ila_net { + struct rhashtable rhash_table; + spinlock_t *locks; /* Bucket locks for entry manipulation */ + unsigned int locks_mask; + bool hooks_registered; + struct net_rslv *nrslv; +}; + struct ila_locator { union { __u8v8[8]; @@ -114,9 +125,14 @@ void ila_update_ipv6_locator(struct sk_buff *skb, struct ila_params *p, void ila_init_saved_csum(struct ila_params *p); +void ila_rslv_resolved(struct ila_net *ilan, struct ila_addr *iaddr); int ila_lwt_init(void); void ila_lwt_fini(void); int ila_xlat_init(void); void ila_xlat_fini(void); +int ila_rslv_init(void); +void ila_rslv_fini(void); +int ila_init_resolver_net(struct ila_net *ilan); +void ila_exit_resolver_net(struct ila_net *ilan); #endif /* __ILA_H */ diff --git a/net/ipv6/ila/ila_common.c b/net/ipv6/ila/ila_common.c index aba0998..83c7d4a 100644 --- a/net/ipv6/ila/ila_common.c +++ b/net/ipv6/ila/ila_common.c @@ -157,7 +157,13 @@ static int __init ila_init(void) if (ret) goto fail_xlat; + ret = ila_rslv_init(); + if (ret) + goto fail_rslv; + return 0; +fail_rslv: + ila_xlat_fini(); fail_xlat: ila_lwt_fini(); fail_lwt: @@ -168,6 +174,7 @@ static void __exit ila_fini(void) { ila_xlat_fini(); ila_lwt_fini(); + ila_rslv_fini(); } module_init(ila_init); diff --git a/net/ipv6/ila/ila_lwt.c b/net/ipv6/ila/ila_lwt.c index
[PATCH RFC 5/6] net: Generic resolver backend
This patch implements the backend of a resolver, specifically it provides a means to track unresolved addresses and to time them out. The resolver is mostly a frontend to an rhashtable where the key of the table is whatever address type or object is tracked. A resolver instance is created by net_rslv_create. A resolver is destroyed by net_rslv_destroy. There are two functions that are used to manipulate entries in the table: net_rslv_lookup_and_create and net_rslv_resolved. net_rslv_lookup_and_create is called with an unresolved address as the argument. It returns a structure of type net_rslv_ent. When called a lookup is performed to see if an entry for the address is already in the table, if it is the entry is return and the false is returned in the new bool pointer argument to indicate that the entry was preexisting. If an entry is not found, one is create and true is returned on the new pointer argument. It is expected that when an entry is new the address resolution protocol is initiated (for instance a RTM_ADDR_RESOLVE message may be sent to a userspace daemon as we will do in ILA). If net_rslv_lookup_and_create returns NULL then presumably the hash table has reached the limit of number of outstanding unresolved addresses, the caller should take appropriate actions to avoid spamming the resolution protocol. net_rslv_resolved is called when resolution is completely (e.g. ILA locator mapping was instantiated for a locator. The entry is removed for the hash table. An argument to net_rslv_create indicates a time for the pending resolution in milliseconds. If the timer fires before resolution then the entry is removed from the table. Subsequently, another attempt to resolve the same address will result in a new entry in the table. net_rslv_lookup_and_create allocates an net_rslv_ent struct and includes allocating related user data. This is the object[] field in the structure. The key (unresolved address) is always the first field in the the object. Following that the caller may add it's own private field data. The key length and size of the user object (including the key) are specific in net_rslv_create. There are three callback functions that can be set as arugments in net_rslv_create: - cmp_fn: Compare function for hash table. Arguments are the key and an object in the table. If this is NULL then the default memcmp of rhashtable is used. - init_fn: Initial a new net_rslv_ent structure. This allows initialization of the user portion of the structure (the object[]). - destroy_fn: Called right before a net_rslv_ent is freed. This allows cleanup of user data associated with the entry. Note that the resolver backend only tracks unresolved addresses, it is up to the caller to perform the mechanism of resolution. This includes the possible of queuing packets awaiting resolution; this can be accomplished for instance by maintaining an skbuff queue in the net_rslv_ent user object[] data. DOS mitigation is done by limiting the number of entries in the resolver table (the max_size which argument of net_rslv_create) and setting a timeout. IF the timeout is set then the maximum rate of new resolution requests is max_table_size / timeout. For instance, with a maximum size of 1000 entries and a timeout of 100 msecs the maximum rate of resolutions requests is 1/s. Signed-off-by: Tom Herbert--- include/net/resolver.h | 58 +++ net/Kconfig| 4 + net/core/Makefile | 1 + net/core/resolver.c| 267 + 4 files changed, 330 insertions(+) create mode 100644 include/net/resolver.h create mode 100644 net/core/resolver.c diff --git a/include/net/resolver.h b/include/net/resolver.h new file mode 100644 index 000..8f73b5c --- /dev/null +++ b/include/net/resolver.h @@ -0,0 +1,58 @@ +#ifndef __NET_RESOLVER_H +#define __NET_RESOLVER_H + +#include + +struct net_rslv; +struct net_rslv_ent; + +typedef int (*net_rslv_cmpfn)(struct net_rslv *nrslv, const void *key, + const void *object); +typedef void (*net_rslv_initfn)(struct net_rslv *nrslv, void *object); +typedef void (*net_rslv_destroyfn)(struct net_rslv_ent *nrent); + +struct net_rslv { + struct rhashtable rhash_table; + struct rhashtable_params params; + net_rslv_cmpfn rslv_cmp; + net_rslv_initfn rslv_init; + net_rslv_destroyfn rslv_destroy; + size_t obj_size; + spinlock_t *locks; + unsigned int locks_mask; + unsigned int hash_rnd; + long timeout; +}; + +struct net_rslv_ent { + struct rcu_head rcu; + union { + /* Fields set when entry is in hash table */ + struct { + struct rhash_head node; + struct delayed_work timeout_work; + struct net_rslv *nrslv; + }; + + /* Fields set when rcu
[PATCH RFC 4/6] rhashtable: abstract out function to get hash
Split out most of rht_key_hashfn which is calculating the hash into its own function. This way the hash function can be called separately to get the hash value. Signed-off-by: Tom Herbert--- include/linux/rhashtable.h | 28 ++-- 1 file changed, 18 insertions(+), 10 deletions(-) diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h index fd82584..e398a62 100644 --- a/include/linux/rhashtable.h +++ b/include/linux/rhashtable.h @@ -208,34 +208,42 @@ static inline unsigned int rht_bucket_index(const struct bucket_table *tbl, return (hash >> RHT_HASH_RESERVED_SPACE) & (tbl->size - 1); } -static inline unsigned int rht_key_hashfn( - struct rhashtable *ht, const struct bucket_table *tbl, - const void *key, const struct rhashtable_params params) +static inline unsigned int rht_key_get_hash(struct rhashtable *ht, + const void *key, const struct rhashtable_params params, + unsigned int hash_rnd) { unsigned int hash; /* params must be equal to ht->p if it isn't constant. */ if (!__builtin_constant_p(params.key_len)) - hash = ht->p.hashfn(key, ht->key_len, tbl->hash_rnd); + hash = ht->p.hashfn(key, ht->key_len, hash_rnd); else if (params.key_len) { unsigned int key_len = params.key_len; if (params.hashfn) - hash = params.hashfn(key, key_len, tbl->hash_rnd); + hash = params.hashfn(key, key_len, hash_rnd); else if (key_len & (sizeof(u32) - 1)) - hash = jhash(key, key_len, tbl->hash_rnd); + hash = jhash(key, key_len, hash_rnd); else - hash = jhash2(key, key_len / sizeof(u32), - tbl->hash_rnd); + hash = jhash2(key, key_len / sizeof(u32), hash_rnd); } else { unsigned int key_len = ht->p.key_len; if (params.hashfn) - hash = params.hashfn(key, key_len, tbl->hash_rnd); + hash = params.hashfn(key, key_len, hash_rnd); else - hash = jhash(key, key_len, tbl->hash_rnd); + hash = jhash(key, key_len, hash_rnd); } + return hash; +} + +static inline unsigned int rht_key_hashfn( + struct rhashtable *ht, const struct bucket_table *tbl, + const void *key, const struct rhashtable_params params) +{ + unsigned int hash = rht_key_get_hash(ht, key, params, tbl->hash_rnd); + return rht_bucket_index(tbl, hash); } -- 2.8.0.rc2
[PATCH RFC 0/6] net: ILA resolver and generic resolver backend
This patch sets implements an ILA host side resolver. This uses LWT to implement the hook to a userspace resolver and tracks pending unresolved address using the backend net resolver. This patch set contains: - An new library function to allocate an array of spinlocks for use with locking hash buckets. - Make hash function in rhashtable directly callable. - A generic resolver backend infrastructure. This primary does two things: track unsesolved addresses and implement a timeout for resolution not happening. These mechanisms provides rate limiting control over resolution requests (for instance in ILA it use used to rate limit requests to userspace to resolve addresses). - The ILA resolver. This is implements to path from the kernel ILA implementation to a userspace daemon that an identifier address needs to be resolved. Tom Herbert (6): spinlock: Add library function to allocate spinlock buckets array rhashtable: Call library function alloc_bucket_locks ila: Call library function alloc_bucket_locks rhashtable: abstract out function to get hash net: Generic resolver backend ila: Resolver mechanism include/linux/rhashtable.h | 28 +++-- include/linux/spinlock.h | 6 + include/net/resolver.h | 58 + include/uapi/linux/lwtunnel.h | 1 + include/uapi/linux/rtnetlink.h | 5 + lib/Makefile | 2 +- lib/bucket_locks.c | 63 ++ lib/rhashtable.c | 46 +-- net/Kconfig| 4 + net/core/Makefile | 1 + net/core/resolver.c| 267 + net/ipv6/Kconfig | 1 + net/ipv6/ila/Makefile | 2 +- net/ipv6/ila/ila.h | 16 +++ net/ipv6/ila/ila_common.c | 7 ++ net/ipv6/ila/ila_lwt.c | 9 ++ net/ipv6/ila/ila_resolver.c| 192 + net/ipv6/ila/ila_xlat.c| 51 ++-- 18 files changed, 666 insertions(+), 93 deletions(-) create mode 100644 include/net/resolver.h create mode 100644 lib/bucket_locks.c create mode 100644 net/core/resolver.c create mode 100644 net/ipv6/ila/ila_resolver.c -- 2.8.0.rc2
[PATCH RFC 2/6] rhashtable: Call library function alloc_bucket_locks
To allocate the array of bucket locks for the hash table we now call library function alloc_bucket_spinlocks. This function is based on the old alloc_bucket_locks in rhashtable and should produce the same effect. Signed-off-by: Tom Herbert--- lib/rhashtable.c | 46 -- 1 file changed, 4 insertions(+), 42 deletions(-) diff --git a/lib/rhashtable.c b/lib/rhashtable.c index 06c2872..5b53304 100644 --- a/lib/rhashtable.c +++ b/lib/rhashtable.c @@ -59,50 +59,10 @@ EXPORT_SYMBOL_GPL(lockdep_rht_bucket_is_held); #define ASSERT_RHT_MUTEX(HT) #endif - -static int alloc_bucket_locks(struct rhashtable *ht, struct bucket_table *tbl, - gfp_t gfp) -{ - unsigned int i, size; -#if defined(CONFIG_PROVE_LOCKING) - unsigned int nr_pcpus = 2; -#else - unsigned int nr_pcpus = num_possible_cpus(); -#endif - - nr_pcpus = min_t(unsigned int, nr_pcpus, 64UL); - size = roundup_pow_of_two(nr_pcpus * ht->p.locks_mul); - - /* Never allocate more than 0.5 locks per bucket */ - size = min_t(unsigned int, size, tbl->size >> 1); - - if (sizeof(spinlock_t) != 0) { - tbl->locks = NULL; -#ifdef CONFIG_NUMA - if (size * sizeof(spinlock_t) > PAGE_SIZE && - gfp == GFP_KERNEL) - tbl->locks = vmalloc(size * sizeof(spinlock_t)); -#endif - if (gfp != GFP_KERNEL) - gfp |= __GFP_NOWARN | __GFP_NORETRY; - - if (!tbl->locks) - tbl->locks = kmalloc_array(size, sizeof(spinlock_t), - gfp); - if (!tbl->locks) - return -ENOMEM; - for (i = 0; i < size; i++) - spin_lock_init(>locks[i]); - } - tbl->locks_mask = size - 1; - - return 0; -} - static void bucket_table_free(const struct bucket_table *tbl) { if (tbl) - kvfree(tbl->locks); + free_bucket_spinlocks(tbl->locks); kvfree(tbl); } @@ -131,7 +91,9 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht, tbl->size = nbuckets; - if (alloc_bucket_locks(ht, tbl, gfp) < 0) { + /* Never allocate more than 0.5 locks per bucket */ + if (alloc_bucket_spinlocks(>locks, >locks_mask, + tbl->size >> 1, ht->p.locks_mul, gfp)) { bucket_table_free(tbl); return NULL; } -- 2.8.0.rc2
[PATCH RFC 1/6] spinlock: Add library function to allocate spinlock buckets array
Add two new library functions alloc_bucket_spinlocks and free_bucket_spinlocks. These are use to allocate and free an array of spinlocks that are useful as locks for hash buckets. The interface specifies the maximum number of spinlocks in the array as well as a CPU multiplier to derive the number of spinlocks to allocate. The number to allocated is rounded up to a power of two to make the array amenable to hash lookup. Signed-off-by: Tom Herbert--- include/linux/spinlock.h | 6 + lib/Makefile | 2 +- lib/bucket_locks.c | 63 3 files changed, 70 insertions(+), 1 deletion(-) create mode 100644 lib/bucket_locks.c diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h index 47dd0ce..4ebdfbf 100644 --- a/include/linux/spinlock.h +++ b/include/linux/spinlock.h @@ -416,4 +416,10 @@ extern int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock); #define atomic_dec_and_lock(atomic, lock) \ __cond_lock(lock, _atomic_dec_and_lock(atomic, lock)) +int alloc_bucket_spinlocks(spinlock_t **locks, unsigned int *lock_mask, + unsigned int max_size, unsigned int cpu_mult, + gfp_t gfp); + +void free_bucket_spinlocks(spinlock_t *locks); + #endif /* __LINUX_SPINLOCK_H */ diff --git a/lib/Makefile b/lib/Makefile index cfa68eb..a1dedf1 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -37,7 +37,7 @@ obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \ gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \ bsearch.o find_bit.o llist.o memweight.o kfifo.o \ percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \ -once.o +once.o bucket_locks.o obj-y += string_helpers.o obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o obj-y += hexdump.o diff --git a/lib/bucket_locks.c b/lib/bucket_locks.c new file mode 100644 index 000..bb9bf11 --- /dev/null +++ b/lib/bucket_locks.c @@ -0,0 +1,63 @@ +#include +#include +#include +#include +#include + +/* Allocate an array of spinlocks to be accessed by a hash. Two arguments + * indicate the number of elements to allocate in the array. max_size + * gives the maximum number of elements to allocate. cpu_mult gives + * the number of locks per CPU to allocate. The size is rounded up + * to a power of 2 to be suitable as a hash table. + */ +int alloc_bucket_spinlocks(spinlock_t **locks, unsigned int *locks_mask, + unsigned int max_size, unsigned int cpu_mult, + gfp_t gfp) +{ + unsigned int i, size; +#if defined(CONFIG_PROVE_LOCKING) + unsigned int nr_pcpus = 2; +#else + unsigned int nr_pcpus = num_possible_cpus(); +#endif + spinlock_t *tlocks = NULL; + + if (cpu_mult) { + nr_pcpus = min_t(unsigned int, nr_pcpus, 64UL); + size = min_t(unsigned int, nr_pcpus * cpu_mult, max_size); + } else { + size = max_size; + } + size = roundup_pow_of_two(size); + + if (!size) + return -EINVAL; + + if (sizeof(spinlock_t) != 0) { +#ifdef CONFIG_NUMA + if (size * sizeof(spinlock_t) > PAGE_SIZE && + gfp == GFP_KERNEL) + tlocks = vmalloc(size * sizeof(spinlock_t)); +#endif + if (gfp != GFP_KERNEL) + gfp |= __GFP_NOWARN | __GFP_NORETRY; + + if (!tlocks) + tlocks = kmalloc_array(size, sizeof(spinlock_t), gfp); + if (!tlocks) + return -ENOMEM; + for (i = 0; i < size; i++) + spin_lock_init([i]); + } + *locks = tlocks; + *locks_mask = size - 1; + + return 0; +} +EXPORT_SYMBOL(alloc_bucket_spinlocks); + +void free_bucket_spinlocks(spinlock_t *locks) +{ + kvfree(locks); +} +EXPORT_SYMBOL(free_bucket_spinlocks); -- 2.8.0.rc2
Re: [PATCH next 3/3] ipvlan: Introduce l3s mode
On Fri, Sep 9, 2016 at 3:26 PM, David Ahernwrote: > On 9/9/16 3:53 PM, Mahesh Bandewar wrote: >> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig >> index 0c5415b05ea9..95edd1737ab5 100644 >> --- a/drivers/net/Kconfig >> +++ b/drivers/net/Kconfig >> @@ -149,6 +149,7 @@ config IPVLAN >> tristate "IP-VLAN support" >> depends on INET >> depends on IPV6 >> +select NET_L3_MASTER_DEV > > depends on instead of select? The kbuild/kconfig-language.txt suggests that for "depends on" the option _must_ be selected otherwise menuconfig wont even present the dependent option while select positively sets the option. INET and IPv6 are well understood and almost all configs select that. L3_MASTER is very new and not well understood so chances of someone _not_ putting them (IPvlan and L3_MASTER) in same context are very high.
[PATCH 5/9] selftests: Update prctl Makefile to work under selftests
Update prctl Makefile to work under selftests. prctl will not be run as part of selftests suite and will not included in install targets. They can be built separately for now. Signed-off-by: Shuah Khan--- tools/testing/selftests/prctl/Makefile | 19 --- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/prctl/Makefile b/tools/testing/selftests/prctl/Makefile index 44de308..35aa1c8 100644 --- a/tools/testing/selftests/prctl/Makefile +++ b/tools/testing/selftests/prctl/Makefile @@ -1,10 +1,15 @@ ifndef CROSS_COMPILE -# List of programs to build -hostprogs-$(CONFIG_X86) := disable-tsc-ctxt-sw-stress-test disable-tsc-on-off-stress-test disable-tsc-test -# Tell kbuild to always build the programs -always := $(hostprogs-y) +uname_M := $(shell uname -m 2>/dev/null || echo not) +ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/) -HOSTCFLAGS_disable-tsc-ctxt-sw-stress-test.o += -I$(objtree)/usr/include -HOSTCFLAGS_disable-tsc-on-off-stress-test.o += -I$(objtree)/usr/include -HOSTCFLAGS_disable-tsc-test.o += -I$(objtree)/usr/include +ifeq ($(ARCH),x86) +TEST_PROGS := disable-tsc-ctxt-sw-stress-test disable-tsc-on-off-stress-test \ + disable-tsc-test +all: $(TEST_PROGS) + +include ../lib.mk + +clean: + rm -fr $(TEST_PROGS) +endif endif -- 2.7.4
[PATCH 0/9] Move runnable code (tests) from Documentation to selftests
Move runnable code (tests) from Documentation to selftests and update Makefiles to work under selftests. Jon Corbet and I discussed this in an email thread and as per that discussion, this patch series moves all the tests that are under the Documentation directory to selftests. There is more runnable code in the form of examples and utils and that is going to be another patch series. I moved just the tests and left the documentation files as is. Checkpatch isn't happy with a few of the patches as some of the renamed files have existing checkpatch errors and warnings. I am working another patch series that will address those. Shuah Khan (9): selftests: move dnotify_test from Documentation/filesystems selftests: update filesystems Makefile to work under selftests selftests: move .gitignore from Documentation/filesystems selftests: move prctl tests from Documentation/prctl selftests: Update prctl Makefile to work under selftests selftests: move ptp tests from Documentation/ptp selftests: Update ptp Makefile to work under selftests selftests: move vDSO tests from Documentation/vDSO selftests: Update vDSO Makefile to work under selftests Documentation/filesystems/.gitignore | 1 - Documentation/filesystems/Makefile | 5 - Documentation/filesystems/dnotify_test.c | 34 -- Documentation/prctl/.gitignore | 3 - Documentation/prctl/Makefile | 10 - .../prctl/disable-tsc-ctxt-sw-stress-test.c| 97 .../prctl/disable-tsc-on-off-stress-test.c | 96 Documentation/prctl/disable-tsc-test.c | 95 Documentation/ptp/.gitignore | 1 - Documentation/ptp/Makefile | 8 - Documentation/ptp/testptp.c| 523 - Documentation/ptp/testptp.mk | 33 -- Documentation/vDSO/.gitignore | 2 - Documentation/vDSO/Makefile| 17 - Documentation/vDSO/parse_vdso.c| 269 --- Documentation/vDSO/vdso_standalone_test_x86.c | 128 - Documentation/vDSO/vdso_test.c | 52 -- tools/testing/selftests/filesystems/.gitignore | 1 + tools/testing/selftests/filesystems/Makefile | 7 + tools/testing/selftests/filesystems/dnotify_test.c | 34 ++ tools/testing/selftests/prctl/.gitignore | 3 + tools/testing/selftests/prctl/Makefile | 15 + .../prctl/disable-tsc-ctxt-sw-stress-test.c| 97 .../prctl/disable-tsc-on-off-stress-test.c | 96 tools/testing/selftests/prctl/disable-tsc-test.c | 95 tools/testing/selftests/ptp/.gitignore | 1 + tools/testing/selftests/ptp/Makefile | 8 + tools/testing/selftests/ptp/testptp.c | 523 + tools/testing/selftests/ptp/testptp.mk | 33 ++ tools/testing/selftests/vDSO/.gitignore| 2 + tools/testing/selftests/vDSO/Makefile | 20 + tools/testing/selftests/vDSO/parse_vdso.c | 269 +++ .../selftests/vDSO/vdso_standalone_test_x86.c | 128 + tools/testing/selftests/vDSO/vdso_test.c | 52 ++ 34 files changed, 1384 insertions(+), 1374 deletions(-) delete mode 100644 Documentation/filesystems/.gitignore delete mode 100644 Documentation/filesystems/Makefile delete mode 100644 Documentation/filesystems/dnotify_test.c delete mode 100644 Documentation/prctl/.gitignore delete mode 100644 Documentation/prctl/Makefile delete mode 100644 Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c delete mode 100644 Documentation/prctl/disable-tsc-on-off-stress-test.c delete mode 100644 Documentation/prctl/disable-tsc-test.c delete mode 100644 Documentation/ptp/.gitignore delete mode 100644 Documentation/ptp/Makefile delete mode 100644 Documentation/ptp/testptp.c delete mode 100644 Documentation/ptp/testptp.mk delete mode 100644 Documentation/vDSO/.gitignore delete mode 100644 Documentation/vDSO/Makefile delete mode 100644 Documentation/vDSO/parse_vdso.c delete mode 100644 Documentation/vDSO/vdso_standalone_test_x86.c delete mode 100644 Documentation/vDSO/vdso_test.c create mode 100644 tools/testing/selftests/filesystems/.gitignore create mode 100644 tools/testing/selftests/filesystems/Makefile create mode 100644 tools/testing/selftests/filesystems/dnotify_test.c create mode 100644 tools/testing/selftests/prctl/.gitignore create mode 100644 tools/testing/selftests/prctl/Makefile create mode 100644 tools/testing/selftests/prctl/disable-tsc-ctxt-sw-stress-test.c create mode 100644 tools/testing/selftests/prctl/disable-tsc-on-off-stress-test.c create mode 100644 tools/testing/selftests/prctl/disable-tsc-test.c create mode 100644 tools/testing/selftests/ptp/.gitignore create mode 100644 tools/testing/selftests/ptp/Makefile create mode 100644
[PATCH 3/9] selftests: move .gitignore from Documentation/filesystems
Move .gitignore for dnotify_test from Documentation/filesystems to selftests/filesystems. Signed-off-by: Shuah Khan--- Documentation/filesystems/.gitignore | 1 - tools/testing/selftests/filesystems/.gitignore | 1 + 2 files changed, 1 insertion(+), 1 deletion(-) delete mode 100644 Documentation/filesystems/.gitignore create mode 100644 tools/testing/selftests/filesystems/.gitignore diff --git a/Documentation/filesystems/.gitignore b/Documentation/filesystems/.gitignore deleted file mode 100644 index 31d6e42..000 --- a/Documentation/filesystems/.gitignore +++ /dev/null @@ -1 +0,0 @@ -dnotify_test diff --git a/tools/testing/selftests/filesystems/.gitignore b/tools/testing/selftests/filesystems/.gitignore new file mode 100644 index 000..31d6e42 --- /dev/null +++ b/tools/testing/selftests/filesystems/.gitignore @@ -0,0 +1 @@ +dnotify_test -- 2.7.4
[PATCH 4/9] selftests: move prctl tests from Documentation/prctl
Move prctl tests from Documentation/prctl to selftests/prctl. Signed-off-by: Shuah Khan--- Documentation/prctl/.gitignore | 3 - Documentation/prctl/Makefile | 10 --- .../prctl/disable-tsc-ctxt-sw-stress-test.c| 97 -- .../prctl/disable-tsc-on-off-stress-test.c | 96 - Documentation/prctl/disable-tsc-test.c | 95 - tools/testing/selftests/prctl/.gitignore | 3 + tools/testing/selftests/prctl/Makefile | 10 +++ .../prctl/disable-tsc-ctxt-sw-stress-test.c| 97 ++ .../prctl/disable-tsc-on-off-stress-test.c | 96 + tools/testing/selftests/prctl/disable-tsc-test.c | 95 + 10 files changed, 301 insertions(+), 301 deletions(-) delete mode 100644 Documentation/prctl/.gitignore delete mode 100644 Documentation/prctl/Makefile delete mode 100644 Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c delete mode 100644 Documentation/prctl/disable-tsc-on-off-stress-test.c delete mode 100644 Documentation/prctl/disable-tsc-test.c create mode 100644 tools/testing/selftests/prctl/.gitignore create mode 100644 tools/testing/selftests/prctl/Makefile create mode 100644 tools/testing/selftests/prctl/disable-tsc-ctxt-sw-stress-test.c create mode 100644 tools/testing/selftests/prctl/disable-tsc-on-off-stress-test.c create mode 100644 tools/testing/selftests/prctl/disable-tsc-test.c diff --git a/Documentation/prctl/.gitignore b/Documentation/prctl/.gitignore deleted file mode 100644 index 0b5c274..000 --- a/Documentation/prctl/.gitignore +++ /dev/null @@ -1,3 +0,0 @@ -disable-tsc-ctxt-sw-stress-test -disable-tsc-on-off-stress-test -disable-tsc-test diff --git a/Documentation/prctl/Makefile b/Documentation/prctl/Makefile deleted file mode 100644 index 44de308..000 --- a/Documentation/prctl/Makefile +++ /dev/null @@ -1,10 +0,0 @@ -ifndef CROSS_COMPILE -# List of programs to build -hostprogs-$(CONFIG_X86) := disable-tsc-ctxt-sw-stress-test disable-tsc-on-off-stress-test disable-tsc-test -# Tell kbuild to always build the programs -always := $(hostprogs-y) - -HOSTCFLAGS_disable-tsc-ctxt-sw-stress-test.o += -I$(objtree)/usr/include -HOSTCFLAGS_disable-tsc-on-off-stress-test.o += -I$(objtree)/usr/include -HOSTCFLAGS_disable-tsc-test.o += -I$(objtree)/usr/include -endif diff --git a/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c b/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c deleted file mode 100644 index f7499d1..000 --- a/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c +++ /dev/null @@ -1,97 +0,0 @@ -/* - * Tests for prctl(PR_GET_TSC, ...) / prctl(PR_SET_TSC, ...) - * - * Tests if the control register is updated correctly - * at context switches - * - * Warning: this test will cause a very high load for a few seconds - * - */ - -#include -#include -#include -#include -#include -#include - - -#include -#include - -/* Get/set the process' ability to use the timestamp counter instruction */ -#ifndef PR_GET_TSC -#define PR_GET_TSC 25 -#define PR_SET_TSC 26 -# define PR_TSC_ENABLE 1 /* allow the use of the timestamp counter */ -# define PR_TSC_SIGSEGV2 /* throw a SIGSEGV instead of reading the TSC */ -#endif - -static uint64_t rdtsc(void) -{ -uint32_t lo, hi; -/* We cannot use "=A", since this would use %rax on x86_64 */ -__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi)); -return (uint64_t)hi << 32 | lo; -} - -static void sigsegv_expect(int sig) -{ - /* */ -} - -static void segvtask(void) -{ - if (prctl(PR_SET_TSC, PR_TSC_SIGSEGV) < 0) - { - perror("prctl"); - exit(0); - } - signal(SIGSEGV, sigsegv_expect); - alarm(10); - rdtsc(); - fprintf(stderr, "FATAL ERROR, rdtsc() succeeded while disabled\n"); - exit(0); -} - - -static void sigsegv_fail(int sig) -{ - fprintf(stderr, "FATAL ERROR, rdtsc() failed while enabled\n"); - exit(0); -} - -static void rdtsctask(void) -{ - if (prctl(PR_SET_TSC, PR_TSC_ENABLE) < 0) - { - perror("prctl"); - exit(0); - } - signal(SIGSEGV, sigsegv_fail); - alarm(10); - for(;;) rdtsc(); -} - - -int main(void) -{ - int n_tasks = 100, i; - - fprintf(stderr, "[No further output means we're allright]\n"); - - for (i=0; i
[PATCH 7/9] selftests: Update ptp Makefile to work under selftests
Update ptp Makefile to work under selftests. ptp will not be run as part of selftests suite and will not included in install targets. They can be built separately for now. Signed-off-by: Shuah Khan--- tools/testing/selftests/ptp/Makefile | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/ptp/Makefile b/tools/testing/selftests/ptp/Makefile index 293d6c0..f4a7238 100644 --- a/tools/testing/selftests/ptp/Makefile +++ b/tools/testing/selftests/ptp/Makefile @@ -1,8 +1,8 @@ -# List of programs to build -hostprogs-y := testptp +TEST_PROGS := testptp +LDLIBS += -lrt +all: $(TEST_PROGS) -# Tell kbuild to always build the programs -always := $(hostprogs-y) +include ../lib.mk -HOSTCFLAGS_testptp.o += -I$(objtree)/usr/include -HOSTLOADLIBES_testptp := -lrt +clean: + rm -fr testptp -- 2.7.4
[PATCH 2/9] selftests: update filesystems Makefile to work under selftests
Update to work under selftests. dnotify_test will not be run as part of selftests suite and will not included in install targets. It can be built separately for now. Signed-off-by: Shuah Khan--- tools/testing/selftests/filesystems/Makefile | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/filesystems/Makefile b/tools/testing/selftests/filesystems/Makefile index 883010c..f1dce5c 100644 --- a/tools/testing/selftests/filesystems/Makefile +++ b/tools/testing/selftests/filesystems/Makefile @@ -1,5 +1,7 @@ -# List of programs to build -hostprogs-y := dnotify_test +TEST_PROGS := dnotify_test +all: $(TEST_PROGS) -# Tell kbuild to always build the programs -always := $(hostprogs-y) +include ../lib.mk + +clean: + rm -fr dnotify_test -- 2.7.4
[PATCH 1/9] selftests: move dnotify_test from Documentation/filesystems
Move dnotify_test from Documentation/filesystems to selftests/filesystems Signed-off-by: Shuah Khan--- Documentation/filesystems/Makefile | 5 Documentation/filesystems/dnotify_test.c | 34 -- tools/testing/selftests/filesystems/Makefile | 5 tools/testing/selftests/filesystems/dnotify_test.c | 34 ++ 4 files changed, 39 insertions(+), 39 deletions(-) delete mode 100644 Documentation/filesystems/Makefile delete mode 100644 Documentation/filesystems/dnotify_test.c create mode 100644 tools/testing/selftests/filesystems/Makefile create mode 100644 tools/testing/selftests/filesystems/dnotify_test.c diff --git a/Documentation/filesystems/Makefile b/Documentation/filesystems/Makefile deleted file mode 100644 index 883010c..000 --- a/Documentation/filesystems/Makefile +++ /dev/null @@ -1,5 +0,0 @@ -# List of programs to build -hostprogs-y := dnotify_test - -# Tell kbuild to always build the programs -always := $(hostprogs-y) diff --git a/Documentation/filesystems/dnotify_test.c b/Documentation/filesystems/dnotify_test.c deleted file mode 100644 index 8b37b4a..000 --- a/Documentation/filesystems/dnotify_test.c +++ /dev/null @@ -1,34 +0,0 @@ -#define _GNU_SOURCE/* needed to get the defines */ -#include /* in glibc 2.2 this has the needed - values defined */ -#include -#include -#include - -static volatile int event_fd; - -static void handler(int sig, siginfo_t *si, void *data) -{ - event_fd = si->si_fd; -} - -int main(void) -{ - struct sigaction act; - int fd; - - act.sa_sigaction = handler; - sigemptyset(_mask); - act.sa_flags = SA_SIGINFO; - sigaction(SIGRTMIN + 1, , NULL); - - fd = open(".", O_RDONLY); - fcntl(fd, F_SETSIG, SIGRTMIN + 1); - fcntl(fd, F_NOTIFY, DN_MODIFY|DN_CREATE|DN_MULTISHOT); - /* we will now be notified if any of the files - in "." is modified or new files are created */ - while (1) { - pause(); - printf("Got event on fd=%d\n", event_fd); - } -} diff --git a/tools/testing/selftests/filesystems/Makefile b/tools/testing/selftests/filesystems/Makefile new file mode 100644 index 000..883010c --- /dev/null +++ b/tools/testing/selftests/filesystems/Makefile @@ -0,0 +1,5 @@ +# List of programs to build +hostprogs-y := dnotify_test + +# Tell kbuild to always build the programs +always := $(hostprogs-y) diff --git a/tools/testing/selftests/filesystems/dnotify_test.c b/tools/testing/selftests/filesystems/dnotify_test.c new file mode 100644 index 000..8b37b4a --- /dev/null +++ b/tools/testing/selftests/filesystems/dnotify_test.c @@ -0,0 +1,34 @@ +#define _GNU_SOURCE/* needed to get the defines */ +#include /* in glibc 2.2 this has the needed + values defined */ +#include +#include +#include + +static volatile int event_fd; + +static void handler(int sig, siginfo_t *si, void *data) +{ + event_fd = si->si_fd; +} + +int main(void) +{ + struct sigaction act; + int fd; + + act.sa_sigaction = handler; + sigemptyset(_mask); + act.sa_flags = SA_SIGINFO; + sigaction(SIGRTMIN + 1, , NULL); + + fd = open(".", O_RDONLY); + fcntl(fd, F_SETSIG, SIGRTMIN + 1); + fcntl(fd, F_NOTIFY, DN_MODIFY|DN_CREATE|DN_MULTISHOT); + /* we will now be notified if any of the files + in "." is modified or new files are created */ + while (1) { + pause(); + printf("Got event on fd=%d\n", event_fd); + } +} -- 2.7.4
[PATCH 9/9] selftests: Update vDSO Makefile to work under selftests
Update vDSO Makefile to work under selftests. vDSO will not be run as part of selftests suite and will not included in install targets. They can be built separately for now. Signed-off-by: Shuah Khan--- tools/testing/selftests/vDSO/Makefile | 29 - 1 file changed, 16 insertions(+), 13 deletions(-) diff --git a/tools/testing/selftests/vDSO/Makefile b/tools/testing/selftests/vDSO/Makefile index b12e987..706b68b 100644 --- a/tools/testing/selftests/vDSO/Makefile +++ b/tools/testing/selftests/vDSO/Makefile @@ -1,17 +1,20 @@ ifndef CROSS_COMPILE -# vdso_test won't build for glibc < 2.16, so disable it -# hostprogs-y := vdso_test -hostprogs-$(CONFIG_X86) := vdso_standalone_test_x86 -vdso_standalone_test_x86-objs := vdso_standalone_test_x86.o parse_vdso.o -vdso_test-objs := parse_vdso.o vdso_test.o - -# Tell kbuild to always build the programs -always := $(hostprogs-y) - -HOSTCFLAGS := -I$(objtree)/usr/include -std=gnu99 -HOSTCFLAGS_vdso_standalone_test_x86.o := -fno-asynchronous-unwind-tables -fno-stack-protector -HOSTLOADLIBES_vdso_standalone_test_x86 := -nostdlib +CFLAGS := -std=gnu99 +CFLAGS_vdso_standalone_test_x86 := -nostdlib -fno-asynchronous-unwind-tables -fno-stack-protector ifeq ($(CONFIG_X86_32),y) -HOSTLOADLIBES_vdso_standalone_test_x86 += -lgcc_s +LDLIBS += -lgcc_s endif + +TEST_PROGS := vdso_test vdso_standalone_test_x86 + +all: $(TEST_PROGS) +vdso_test: parse_vdso.c vdso_test.c +vdso_standalone_test_x86: vdso_standalone_test_x86.c parse_vdso.c + $(CC) $(CFLAGS) $(CFLAGS_vdso_standalone_test_x86) \ + vdso_standalone_test_x86.c parse_vdso.c \ + -o vdso_standalone_test_x86 + +include ../lib.mk +clean: + rm -fr $(TEST_PROGS) endif -- 2.7.4
[PATCH 6/9] selftests: move ptp tests from Documentation/ptp
Move ptp tests from Documentation/ptp to selftests/ptp. Signed-off-by: Shuah Khan--- Documentation/ptp/.gitignore | 1 - Documentation/ptp/Makefile | 8 - Documentation/ptp/testptp.c| 523 - Documentation/ptp/testptp.mk | 33 --- tools/testing/selftests/ptp/.gitignore | 1 + tools/testing/selftests/ptp/Makefile | 8 + tools/testing/selftests/ptp/testptp.c | 523 + tools/testing/selftests/ptp/testptp.mk | 33 +++ 8 files changed, 565 insertions(+), 565 deletions(-) delete mode 100644 Documentation/ptp/.gitignore delete mode 100644 Documentation/ptp/Makefile delete mode 100644 Documentation/ptp/testptp.c delete mode 100644 Documentation/ptp/testptp.mk create mode 100644 tools/testing/selftests/ptp/.gitignore create mode 100644 tools/testing/selftests/ptp/Makefile create mode 100644 tools/testing/selftests/ptp/testptp.c create mode 100644 tools/testing/selftests/ptp/testptp.mk diff --git a/Documentation/ptp/.gitignore b/Documentation/ptp/.gitignore deleted file mode 100644 index f562e49..000 --- a/Documentation/ptp/.gitignore +++ /dev/null @@ -1 +0,0 @@ -testptp diff --git a/Documentation/ptp/Makefile b/Documentation/ptp/Makefile deleted file mode 100644 index 293d6c0..000 --- a/Documentation/ptp/Makefile +++ /dev/null @@ -1,8 +0,0 @@ -# List of programs to build -hostprogs-y := testptp - -# Tell kbuild to always build the programs -always := $(hostprogs-y) - -HOSTCFLAGS_testptp.o += -I$(objtree)/usr/include -HOSTLOADLIBES_testptp := -lrt diff --git a/Documentation/ptp/testptp.c b/Documentation/ptp/testptp.c deleted file mode 100644 index 5d2eae1..000 --- a/Documentation/ptp/testptp.c +++ /dev/null @@ -1,523 +0,0 @@ -/* - * PTP 1588 clock support - User space test program - * - * Copyright (C) 2010 OMICRON electronics GmbH - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. - */ -#define _GNU_SOURCE -#define __SANE_USERSPACE_TYPES__/* For PPC64, to get LL64 types */ -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include - -#define DEVICE "/dev/ptp0" - -#ifndef ADJ_SETOFFSET -#define ADJ_SETOFFSET 0x0100 -#endif - -#ifndef CLOCK_INVALID -#define CLOCK_INVALID -1 -#endif - -/* clock_adjtime is not available in GLIBC < 2.14 */ -#if !__GLIBC_PREREQ(2, 14) -#include -static int clock_adjtime(clockid_t id, struct timex *tx) -{ - return syscall(__NR_clock_adjtime, id, tx); -} -#endif - -static clockid_t get_clockid(int fd) -{ -#define CLOCKFD 3 -#define FD_TO_CLOCKID(fd) ((~(clockid_t) (fd) << 3) | CLOCKFD) - - return FD_TO_CLOCKID(fd); -} - -static void handle_alarm(int s) -{ - printf("received signal %d\n", s); -} - -static int install_handler(int signum, void (*handler)(int)) -{ - struct sigaction action; - sigset_t mask; - - /* Unblock the signal. */ - sigemptyset(); - sigaddset(, signum); - sigprocmask(SIG_UNBLOCK, , NULL); - - /* Install the signal handler. */ - action.sa_handler = handler; - action.sa_flags = 0; - sigemptyset(_mask); - sigaction(signum, , NULL); - - return 0; -} - -static long ppb_to_scaled_ppm(int ppb) -{ - /* -* The 'freq' field in the 'struct timex' is in parts per -* million, but with a 16 bit binary fractional field. -* Instead of calculating either one of -* -*scaled_ppm = (ppb / 1000) << 16 [1] -*scaled_ppm = (ppb << 16) / 1000 [2] -* -* we simply use double precision math, in order to avoid the -* truncation in [1] and the possible overflow in [2]. -*/ - return (long) (ppb * 65.536); -} - -static int64_t pctns(struct ptp_clock_time *t) -{ - return t->sec * 10LL + t->nsec; -} - -static void usage(char *progname) -{ - fprintf(stderr, - "usage: %s [options]\n" - " -a val request a one-shot alarm after 'val' seconds\n" - " -A val request a periodic alarm every 'val' seconds\n" - " -c query the ptp clock's capabilities\n" -
[PATCH 8/9] selftests: move vDSO tests from Documentation/vDSO
Move vDSO tests from Documentation/vDSO to selftests/vDSO. Signed-off-by: Shuah Khan--- Documentation/vDSO/.gitignore | 2 - Documentation/vDSO/Makefile| 17 -- Documentation/vDSO/parse_vdso.c| 269 - Documentation/vDSO/vdso_standalone_test_x86.c | 128 -- Documentation/vDSO/vdso_test.c | 52 tools/testing/selftests/vDSO/.gitignore| 2 + tools/testing/selftests/vDSO/Makefile | 17 ++ tools/testing/selftests/vDSO/parse_vdso.c | 269 + .../selftests/vDSO/vdso_standalone_test_x86.c | 128 ++ tools/testing/selftests/vDSO/vdso_test.c | 52 10 files changed, 468 insertions(+), 468 deletions(-) delete mode 100644 Documentation/vDSO/.gitignore delete mode 100644 Documentation/vDSO/Makefile delete mode 100644 Documentation/vDSO/parse_vdso.c delete mode 100644 Documentation/vDSO/vdso_standalone_test_x86.c delete mode 100644 Documentation/vDSO/vdso_test.c create mode 100644 tools/testing/selftests/vDSO/.gitignore create mode 100644 tools/testing/selftests/vDSO/Makefile create mode 100644 tools/testing/selftests/vDSO/parse_vdso.c create mode 100644 tools/testing/selftests/vDSO/vdso_standalone_test_x86.c create mode 100644 tools/testing/selftests/vDSO/vdso_test.c diff --git a/Documentation/vDSO/.gitignore b/Documentation/vDSO/.gitignore deleted file mode 100644 index 133bf9e..000 --- a/Documentation/vDSO/.gitignore +++ /dev/null @@ -1,2 +0,0 @@ -vdso_test -vdso_standalone_test_x86 diff --git a/Documentation/vDSO/Makefile b/Documentation/vDSO/Makefile deleted file mode 100644 index b12e987..000 --- a/Documentation/vDSO/Makefile +++ /dev/null @@ -1,17 +0,0 @@ -ifndef CROSS_COMPILE -# vdso_test won't build for glibc < 2.16, so disable it -# hostprogs-y := vdso_test -hostprogs-$(CONFIG_X86) := vdso_standalone_test_x86 -vdso_standalone_test_x86-objs := vdso_standalone_test_x86.o parse_vdso.o -vdso_test-objs := parse_vdso.o vdso_test.o - -# Tell kbuild to always build the programs -always := $(hostprogs-y) - -HOSTCFLAGS := -I$(objtree)/usr/include -std=gnu99 -HOSTCFLAGS_vdso_standalone_test_x86.o := -fno-asynchronous-unwind-tables -fno-stack-protector -HOSTLOADLIBES_vdso_standalone_test_x86 := -nostdlib -ifeq ($(CONFIG_X86_32),y) -HOSTLOADLIBES_vdso_standalone_test_x86 += -lgcc_s -endif -endif diff --git a/Documentation/vDSO/parse_vdso.c b/Documentation/vDSO/parse_vdso.c deleted file mode 100644 index 1dbb4b8..000 --- a/Documentation/vDSO/parse_vdso.c +++ /dev/null @@ -1,269 +0,0 @@ -/* - * parse_vdso.c: Linux reference vDSO parser - * Written by Andrew Lutomirski, 2011-2014. - * - * This code is meant to be linked in to various programs that run on Linux. - * As such, it is available with as few restrictions as possible. This file - * is licensed under the Creative Commons Zero License, version 1.0, - * available at http://creativecommons.org/publicdomain/zero/1.0/legalcode - * - * The vDSO is a regular ELF DSO that the kernel maps into user space when - * it starts a program. It works equally well in statically and dynamically - * linked binaries. - * - * This code is tested on x86. In principle it should work on any - * architecture that has a vDSO. - */ - -#include -#include -#include -#include -#include - -/* - * To use this vDSO parser, first call one of the vdso_init_* functions. - * If you've already parsed auxv, then pass the value of AT_SYSINFO_EHDR - * to vdso_init_from_sysinfo_ehdr. Otherwise pass auxv to vdso_init_from_auxv. - * Then call vdso_sym for each symbol you want. For example, to look up - * gettimeofday on x86_64, use: - * - * = vdso_sym("LINUX_2.6", "gettimeofday"); - * or - * = vdso_sym("LINUX_2.6", "__vdso_gettimeofday"); - * - * vdso_sym will return 0 if the symbol doesn't exist or if the init function - * failed or was not called. vdso_sym is a little slow, so its return value - * should be cached. - * - * vdso_sym is threadsafe; the init functions are not. - * - * These are the prototypes: - */ -extern void vdso_init_from_auxv(void *auxv); -extern void vdso_init_from_sysinfo_ehdr(uintptr_t base); -extern void *vdso_sym(const char *version, const char *name); - - -/* And here's the code. */ -#ifndef ELF_BITS -# if ULONG_MAX > 0xUL -# define ELF_BITS 64 -# else -# define ELF_BITS 32 -# endif -#endif - -#define ELF_BITS_XFORM2(bits, x) Elf##bits##_##x -#define ELF_BITS_XFORM(bits, x) ELF_BITS_XFORM2(bits, x) -#define ELF(x) ELF_BITS_XFORM(ELF_BITS, x) - -static struct vdso_info -{ - bool valid; - - /* Load information */ - uintptr_t load_addr; - uintptr_t load_offset; /* load_addr - recorded vaddr */ - - /* Symbol table */ - ELF(Sym) *symtab; - const char *symstrings; - ELF(Word) *bucket, *chain; - ELF(Word) nbucket, nchain; - - /* Version table
Re: [PATCH next 3/3] ipvlan: Introduce l3s mode
On 9/9/16 3:53 PM, Mahesh Bandewar wrote: > diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig > index 0c5415b05ea9..95edd1737ab5 100644 > --- a/drivers/net/Kconfig > +++ b/drivers/net/Kconfig > @@ -149,6 +149,7 @@ config IPVLAN > tristate "IP-VLAN support" > depends on INET > depends on IPV6 > +select NET_L3_MASTER_DEV depends on instead of select?
Re: [PATCH next 3/3] ipvlan: Introduce l3s mode
On Fri, Sep 9, 2016 at 3:07 PM, Rick Joneswrote: > On 09/09/2016 02:53 PM, Mahesh Bandewar wrote: > >> @@ -48,6 +48,11 @@ master device for the L2 processing and routing from >> that instance will be >> used before packets are queued on the outbound device. In this mode the >> slaves >> will not receive nor can send multicast / broadcast traffic. >> >> +4.3 L3S mode: >> + This is very similar to the L3 mode except that iptables >> conn-tracking >> +works in this mode and that is why L3-symsetric (L3s) from iptables >> perspective. >> +This will have slightly less performance but that shouldn't matter since >> you >> +are choosing this mode over plain-L3 mode to make conn-tracking work. > > > What is that first sentence trying to say? It appears to be incomplete, and > is that supposed to be "L3-symmetric?" > Apologies! Seems like I picked up wrong text file (I'll correct this in next ver). BTW it should read - " This is very similar to L3 mode except that iptables (conn-tracking) works in this mode and hence it is L3-symmetric (L3s). This will have ..." > happy benchmarking, > > rick jones
Re: [PATCH next 3/3] ipvlan: Introduce l3s mode
On 09/09/2016 02:53 PM, Mahesh Bandewar wrote: @@ -48,6 +48,11 @@ master device for the L2 processing and routing from that instance will be used before packets are queued on the outbound device. In this mode the slaves will not receive nor can send multicast / broadcast traffic. +4.3 L3S mode: + This is very similar to the L3 mode except that iptables conn-tracking +works in this mode and that is why L3-symsetric (L3s) from iptables perspective. +This will have slightly less performance but that shouldn't matter since you +are choosing this mode over plain-L3 mode to make conn-tracking work. What is that first sentence trying to say? It appears to be incomplete, and is that supposed to be "L3-symmetric?" happy benchmarking, rick jones
Re: [net-next PATCH v2 1/2] e1000: add initial XDP support
On Fri, 2016-09-09 at 14:29 -0700, John Fastabend wrote: > From: Alexei Starovoitov> So it looks like e1000_xmit_raw_frame() can return early, say if there is no available descriptor. > +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info, > + unsigned int len, > + struct net_device *netdev, > + struct e1000_adapter *adapter) > +{ > + struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0); > + struct e1000_hw *hw = >hw; > + struct e1000_tx_ring *tx_ring; > + > + if (len > E1000_MAX_DATA_PER_TXD) > + return; > + > + /* e1000 only support a single txq at the moment so the queue is being > + * shared with stack. To support this requires locking to ensure the > + * stack and XDP are not running at the same time. Devices with > + * multiple queues should allocate a separate queue space. > + */ > + HARD_TX_LOCK(netdev, txq, smp_processor_id()); > + > + tx_ring = adapter->tx_ring; > + > + if (E1000_DESC_UNUSED(tx_ring) < 2) { > + HARD_TX_UNLOCK(netdev, txq); > + return; > + } > + > + e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len); > + e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1); > + > + writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt); > + mmiowb(); > + > + HARD_TX_UNLOCK(netdev, txq); > +} > + > #define NUM_REGS 38 /* 1 based count */ > static void e1000_regdump(struct e1000_adapter *adapter) > { > @@ -4142,6 +4247,19 @@ static struct sk_buff *e1000_alloc_rx_skb(struct > e1000_adapter *adapter, > return skb; > } > + act = e1000_call_bpf(prog, page_address(p), length); > + switch (act) { > + case XDP_PASS: > + break; > + case XDP_TX: > + dma_sync_single_for_device(>dev, > +dma, > +length, > +DMA_TO_DEVICE); > + e1000_xmit_raw_frame(buffer_info, length, > + netdev, adapter); > + buffer_info->rxbuf.page = NULL; So I am trying to understand how pages are not leaked ?
[PATCH next 2/3] net: Add _nf_(un)register_hooks symbols
From: Mahesh BandewarAdd _nf_register_hooks() and _nf_unregister_hooks() calls which allow caller to hold RTNL mutex. Signed-off-by: Mahesh Bandewar --- include/linux/netfilter.h | 2 ++ net/netfilter/core.c | 51 ++- 2 files changed, 48 insertions(+), 5 deletions(-) diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h index 9230f9aee896..e82b76781bf6 100644 --- a/include/linux/netfilter.h +++ b/include/linux/netfilter.h @@ -133,6 +133,8 @@ int nf_register_hook(struct nf_hook_ops *reg); void nf_unregister_hook(struct nf_hook_ops *reg); int nf_register_hooks(struct nf_hook_ops *reg, unsigned int n); void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n); +int _nf_register_hooks(struct nf_hook_ops *reg, unsigned int n); +void _nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n); /* Functions to register get/setsockopt ranges (non-inclusive). You need to check permissions yourself! */ diff --git a/net/netfilter/core.c b/net/netfilter/core.c index f39276d1c2d7..2c5327e43a88 100644 --- a/net/netfilter/core.c +++ b/net/netfilter/core.c @@ -188,19 +188,17 @@ EXPORT_SYMBOL(nf_unregister_net_hooks); static LIST_HEAD(nf_hook_list); -int nf_register_hook(struct nf_hook_ops *reg) +static int _nf_register_hook(struct nf_hook_ops *reg) { struct net *net, *last; int ret; - rtnl_lock(); for_each_net(net) { ret = nf_register_net_hook(net, reg); if (ret && ret != -ENOENT) goto rollback; } list_add_tail(>list, _hook_list); - rtnl_unlock(); return 0; rollback: @@ -210,19 +208,34 @@ rollback: break; nf_unregister_net_hook(net, reg); } + return ret; +} + +int nf_register_hook(struct nf_hook_ops *reg) +{ + int ret; + + rtnl_lock(); + ret = _nf_register_hook(reg); rtnl_unlock(); + return ret; } EXPORT_SYMBOL(nf_register_hook); -void nf_unregister_hook(struct nf_hook_ops *reg) +static void _nf_unregister_hook(struct nf_hook_ops *reg) { struct net *net; - rtnl_lock(); list_del(>list); for_each_net(net) nf_unregister_net_hook(net, reg); +} + +void nf_unregister_hook(struct nf_hook_ops *reg) +{ + rtnl_lock(); + _nf_unregister_hook(reg); rtnl_unlock(); } EXPORT_SYMBOL(nf_unregister_hook); @@ -246,6 +259,26 @@ err: } EXPORT_SYMBOL(nf_register_hooks); +/* Caller MUST take rtnl_lock() */ +int _nf_register_hooks(struct nf_hook_ops *reg, unsigned int n) +{ + unsigned int i; + int err = 0; + + for (i = 0; i < n; i++) { + err = _nf_register_hook([i]); + if (err) + goto err; + } + return err; + +err: + if (i > 0) + _nf_unregister_hooks(reg, i); + return err; +} +EXPORT_SYMBOL(_nf_register_hooks); + void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n) { while (n-- > 0) @@ -253,6 +286,14 @@ void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n) } EXPORT_SYMBOL(nf_unregister_hooks); +/* Caller MUST take rtnl_lock */ +void _nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n) +{ + while (n-- > 0) + _nf_unregister_hook([n]); +} +EXPORT_SYMBOL(_nf_unregister_hooks); + unsigned int nf_iterate(struct list_head *head, struct sk_buff *skb, struct nf_hook_state *state, -- 2.8.0.rc3.226.g39d4020
[PATCH next 1/3] ipv6: Export p6_route_input_lookup symbol
From: Mahesh BandewarMake ip6_route_input_lookup available outside of ipv6 the module similar to ip_route_input_noref in the IPv4 world. Signed-off-by: Mahesh Bandewar --- include/net/ip6_route.h | 3 +++ net/ipv6/route.c| 7 --- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h index d97305d0e71f..e0cd318d5103 100644 --- a/include/net/ip6_route.h +++ b/include/net/ip6_route.h @@ -64,6 +64,9 @@ static inline bool rt6_need_strict(const struct in6_addr *daddr) } void ip6_route_input(struct sk_buff *skb); +struct dst_entry *ip6_route_input_lookup(struct net *net, +struct net_device *dev, +struct flowi6 *fl6, int flags); struct dst_entry *ip6_route_output_flags(struct net *net, const struct sock *sk, struct flowi6 *fl6, int flags); diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 09d43ff11a8d..9563eedd4f97 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1147,15 +1147,16 @@ static struct rt6_info *ip6_pol_route_input(struct net *net, struct fib6_table * return ip6_pol_route(net, table, fl6->flowi6_iif, fl6, flags); } -static struct dst_entry *ip6_route_input_lookup(struct net *net, - struct net_device *dev, - struct flowi6 *fl6, int flags) +struct dst_entry *ip6_route_input_lookup(struct net *net, +struct net_device *dev, +struct flowi6 *fl6, int flags) { if (rt6_need_strict(>daddr) && dev->type != ARPHRD_PIMREG) flags |= RT6_LOOKUP_F_IFACE; return fib6_rule_lookup(net, fl6, flags, ip6_pol_route_input); } +EXPORT_SYMBOL_GPL(ip6_route_input_lookup); void ip6_route_input(struct sk_buff *skb) { -- 2.8.0.rc3.226.g39d4020
[PATCH next 3/3] ipvlan: Introduce l3s mode
From: Mahesh BandewarIn a typical IPvlan L3 setup where master is in default-ns and each slave is into different (slave) ns. In this setup egress packet processing for traffic originating from slave-ns will hit all NF_HOOKs in slave-ns as well as default-ns. However same is not true for ingress processing. All these NF_HOOKs are hit only in the slave-ns skipping them in the default-ns. IPvlan in L3 mode is restrictive and if admins want to deploy iptables rules in default-ns, this asymmetric data path makes it impossible to do so. This patch makes use of the l3_rcv() (added as part of l3mdev enhancements) to perform input route lookup on RX packets without changing the skb->dev and then uses nf_hook at NF_INET_LOCAL_IN to change the skb->dev just before handing over skb to L4. Signed-off-by: Mahesh Bandewar --- Documentation/networking/ipvlan.txt | 7 ++- drivers/net/Kconfig | 1 + drivers/net/ipvlan/ipvlan.h | 7 +++ drivers/net/ipvlan/ipvlan_core.c| 94 + drivers/net/ipvlan/ipvlan_main.c| 60 --- include/uapi/linux/if_link.h| 1 + 6 files changed, 162 insertions(+), 8 deletions(-) diff --git a/Documentation/networking/ipvlan.txt b/Documentation/networking/ipvlan.txt index 14422f8fcdc4..58d3a946f66c 100644 --- a/Documentation/networking/ipvlan.txt +++ b/Documentation/networking/ipvlan.txt @@ -22,7 +22,7 @@ The driver can be built into the kernel (CONFIG_IPVLAN=y) or as a module There are no module parameters for this driver and it can be configured using IProute2/ip utility. - ip link add link type ipvlan mode { l2 | L3 } + ip link add link type ipvlan mode { l2 | l3 | l3s } e.g. ip link add link ipvl0 eth0 type ipvlan mode l2 @@ -48,6 +48,11 @@ master device for the L2 processing and routing from that instance will be used before packets are queued on the outbound device. In this mode the slaves will not receive nor can send multicast / broadcast traffic. +4.3 L3S mode: + This is very similar to the L3 mode except that iptables conn-tracking +works in this mode and that is why L3-symsetric (L3s) from iptables perspective. +This will have slightly less performance but that shouldn't matter since you +are choosing this mode over plain-L3 mode to make conn-tracking work. 5. What to choose (macvlan vs. ipvlan)? These two devices are very similar in many regards and the specific use diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 0c5415b05ea9..95edd1737ab5 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -149,6 +149,7 @@ config IPVLAN tristate "IP-VLAN support" depends on INET depends on IPV6 +select NET_L3_MASTER_DEV ---help--- This allows one to create virtual devices off of a main interface and packets will be delivered based on the dest L3 (IPv6/IPv4 addr) diff --git a/drivers/net/ipvlan/ipvlan.h b/drivers/net/ipvlan/ipvlan.h index 695a5dc9ace3..68b270b59ba9 100644 --- a/drivers/net/ipvlan/ipvlan.h +++ b/drivers/net/ipvlan/ipvlan.h @@ -23,11 +23,13 @@ #include #include #include +#include #include #include #include #include #include +#include #define IPVLAN_DRV "ipvlan" #define IPV_DRV_VER"0.1" @@ -96,6 +98,7 @@ struct ipvl_port { struct work_struct wq; struct sk_buff_head backlog; int count; + boolipt_hook_added; struct rcu_head rcu; }; @@ -124,4 +127,8 @@ struct ipvl_addr *ipvlan_find_addr(const struct ipvl_dev *ipvlan, const void *iaddr, bool is_v6); bool ipvlan_addr_busy(struct ipvl_port *port, void *iaddr, bool is_v6); void ipvlan_ht_addr_del(struct ipvl_addr *addr); +struct sk_buff *ipvlan_l3_rcv(struct net_device *dev, struct sk_buff *skb, + u16 proto); +unsigned int ipvlan_nf_input(void *priv, struct sk_buff *skb, +const struct nf_hook_state *state); #endif /* __IPVLAN_H */ diff --git a/drivers/net/ipvlan/ipvlan_core.c b/drivers/net/ipvlan/ipvlan_core.c index b5f9511d819e..b4e990743e1d 100644 --- a/drivers/net/ipvlan/ipvlan_core.c +++ b/drivers/net/ipvlan/ipvlan_core.c @@ -560,6 +560,7 @@ int ipvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev) case IPVLAN_MODE_L2: return ipvlan_xmit_mode_l2(skb, dev); case IPVLAN_MODE_L3: + case IPVLAN_MODE_L3S: return ipvlan_xmit_mode_l3(skb, dev); } @@ -664,6 +665,8 @@ rx_handler_result_t ipvlan_handle_frame(struct sk_buff **pskb) return ipvlan_handle_mode_l2(pskb, port); case IPVLAN_MODE_L3: return ipvlan_handle_mode_l3(pskb, port); + case IPVLAN_MODE_L3S: + return RX_HANDLER_PASS; } /* Should not reach here */ @@
[PATCH next 0/3] ipvlan introduce l3s mode
From: Mahesh BandewarSame old problem with new appraoch especially from suggestions from earlier patch-series. First thing is that this is introduced as a new mode rather than modifying the old (L3) mode. So the behavior of the existing modes is preserved as it is and the new L3s mode obeys iptables so that intended conn-tracking can work. To do this, the code uses newly added l3mdev_rcv() handler and an Iptables hook. l3mdev_rcv() to perform an inbound route lookup with the correct (IPvlan slave) interface and then IPtable-hook at LOCAL_INPUT to change the input device from master to the slave to complete the formality. Supporting stack changes are trivial changes to export symbol to get IPv4 equivalent code exported for IPv6 and to allow netfilter hook registration code to allow caller to hold RTNL. Please look into individual patches for details. Mahesh Bandewar (3): ipv6: Export p6_route_input_lookup symbol net: Add _nf_(un)register_hooks symbols ipvlan: Introduce l3s mode Documentation/networking/ipvlan.txt | 7 ++- drivers/net/Kconfig | 1 + drivers/net/ipvlan/ipvlan.h | 7 +++ drivers/net/ipvlan/ipvlan_core.c| 94 + drivers/net/ipvlan/ipvlan_main.c| 60 --- include/linux/netfilter.h | 2 + include/net/ip6_route.h | 3 ++ include/uapi/linux/if_link.h| 1 + net/ipv6/route.c| 7 +-- net/netfilter/core.c| 51 ++-- 10 files changed, 217 insertions(+), 16 deletions(-) -- 2.8.0.rc3.226.g39d4020
[PULL] virtio: fixes for 4.8
The following changes since commit 3eab887a55424fc2c27553b7bfe32330df83f7b8: Linux 4.8-rc4 (2016-08-28 15:04:33 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus for you to fetch changes up to 5e59d9a1aed26abcc79abe78af5cfd34e53cbe7f: virtio_console: Stop doing DMA on the stack (2016-09-09 21:12:45 +0300) virtio: fixes for 4.8 This includes a couple of bugfixs for virtio. The virtio console patch is actually also in x86/tip targeting 4.9 because it helps vmap stacks, but it also fixes IOMMU_PLATFORM which was added in 4.8, and it seems important not to ship that in a broken configuration. Signed-off-by: Michael S. TsirkinAndy Lutomirski (1): virtio_console: Stop doing DMA on the stack Baoyou Xie (1): virtio: mark vring_dma_dev() static drivers/char/virtio_console.c | 23 +++ drivers/virtio/virtio_ring.c | 2 +- 2 files changed, 16 insertions(+), 9 deletions(-)
[net-next PATCH v2 2/2] e1000: bundle xdp xmit routines
e1000 supports a single TX queue so it is being shared with the stack when XDP runs XDP_TX action. This requires taking the xmit lock to ensure we don't corrupt the tx ring. To avoid taking and dropping the lock per packet this patch adds a bundling implementation to submit a bundle of packets to the xmit routine. I tested this patch running e1000 in a VM using KVM over a tap device using pktgen to generate traffic along with 'ping -f -l 100'. Suggested-by: Jesper Dangaard BrouerSigned-off-by: John Fastabend --- drivers/net/ethernet/intel/e1000/e1000.h | 10 +++ drivers/net/ethernet/intel/e1000/e1000_main.c | 81 +++-- 2 files changed, 71 insertions(+), 20 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h index 5cf8a0a..877b377 100644 --- a/drivers/net/ethernet/intel/e1000/e1000.h +++ b/drivers/net/ethernet/intel/e1000/e1000.h @@ -133,6 +133,8 @@ struct e1000_adapter; #define E1000_TX_QUEUE_WAKE16 /* How many Rx Buffers do we bundle into one write to the hardware ? */ #define E1000_RX_BUFFER_WRITE 16 /* Must be power of 2 */ +/* How many XDP XMIT buffers to bundle into one xmit transaction */ +#define E1000_XDP_XMIT_BUNDLE_MAX E1000_RX_BUFFER_WRITE #define AUTO_ALL_MODES 0 #define E1000_EEPROM_82544_APM 0x0004 @@ -168,6 +170,11 @@ struct e1000_rx_buffer { dma_addr_t dma; }; +struct e1000_rx_buffer_bundle { + struct e1000_rx_buffer *buffer; + u32 length; +}; + struct e1000_tx_ring { /* pointer to the descriptor ring memory */ void *desc; @@ -206,6 +213,9 @@ struct e1000_rx_ring { struct e1000_rx_buffer *buffer_info; struct sk_buff *rx_skb_top; + /* array of XDP buffer information structs */ + struct e1000_rx_buffer_bundle *xdp_buffer; + /* cpu for rx queue */ int cpu; diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c index 91d5c87..b985271 100644 --- a/drivers/net/ethernet/intel/e1000/e1000_main.c +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c @@ -1738,10 +1738,18 @@ static int e1000_setup_rx_resources(struct e1000_adapter *adapter, struct pci_dev *pdev = adapter->pdev; int size, desc_len; + size = sizeof(struct e1000_rx_buffer_bundle) * + E1000_XDP_XMIT_BUNDLE_MAX; + rxdr->xdp_buffer = vzalloc(size); + if (!rxdr->xdp_buffer) + return -ENOMEM; + size = sizeof(struct e1000_rx_buffer) * rxdr->count; rxdr->buffer_info = vzalloc(size); - if (!rxdr->buffer_info) + if (!rxdr->buffer_info) { + vfree(rxdr->xdp_buffer); return -ENOMEM; + } desc_len = sizeof(struct e1000_rx_desc); @@ -1754,6 +1762,7 @@ static int e1000_setup_rx_resources(struct e1000_adapter *adapter, GFP_KERNEL); if (!rxdr->desc) { setup_rx_desc_die: + vfree(rxdr->xdp_buffer); vfree(rxdr->buffer_info); return -ENOMEM; } @@ -2087,6 +2096,9 @@ static void e1000_free_rx_resources(struct e1000_adapter *adapter, e1000_clean_rx_ring(adapter, rx_ring); + vfree(rx_ring->xdp_buffer); + rx_ring->xdp_buffer = NULL; + vfree(rx_ring->buffer_info); rx_ring->buffer_info = NULL; @@ -3369,33 +3381,52 @@ static void e1000_tx_map_rxpage(struct e1000_tx_ring *tx_ring, } static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info, -unsigned int len, -struct net_device *netdev, -struct e1000_adapter *adapter) +__u32 len, +struct e1000_adapter *adapter, +struct e1000_tx_ring *tx_ring) { - struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0); - struct e1000_hw *hw = >hw; - struct e1000_tx_ring *tx_ring; - if (len > E1000_MAX_DATA_PER_TXD) return; + if (E1000_DESC_UNUSED(tx_ring) < 2) + return; + + e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len); + e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1); +} + +static void e1000_xdp_xmit_bundle(struct e1000_rx_buffer_bundle *buffer_info, + struct net_device *netdev, + struct e1000_adapter *adapter) +{ + struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0); + struct e1000_tx_ring *tx_ring = adapter->tx_ring; + struct e1000_hw *hw = >hw; + int i = 0; + /* e1000 only support a single txq at the moment so the queue is being * shared with stack. To support this requires locking to ensure the * stack and XDP are not running at the
[net-next PATCH v2 1/2] e1000: add initial XDP support
From: Alexei StarovoitovThis patch adds initial support for XDP on e1000 driver. Note e1000 driver does not support page recycling in general which could be added as a further improvement. However XDP_DROP case will recycle. XDP_TX and XDP_PASS do not support recycling yet. I tested this patch running e1000 in a VM using KVM over a tap device. CC: William Tu Signed-off-by: Alexei Starovoitov Signed-off-by: John Fastabend --- drivers/net/ethernet/intel/e1000/e1000.h |2 drivers/net/ethernet/intel/e1000/e1000_main.c | 171 + 2 files changed, 170 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h index d7bdea7..5cf8a0a 100644 --- a/drivers/net/ethernet/intel/e1000/e1000.h +++ b/drivers/net/ethernet/intel/e1000/e1000.h @@ -150,6 +150,7 @@ struct e1000_adapter; */ struct e1000_tx_buffer { struct sk_buff *skb; + struct page *page; dma_addr_t dma; unsigned long time_stamp; u16 length; @@ -279,6 +280,7 @@ struct e1000_adapter { struct e1000_rx_ring *rx_ring, int cleaned_count); struct e1000_rx_ring *rx_ring; /* One per active queue */ + struct bpf_prog *prog; struct napi_struct napi; int num_tx_queues; diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c index f42129d..91d5c87 100644 --- a/drivers/net/ethernet/intel/e1000/e1000_main.c +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c @@ -32,6 +32,7 @@ #include #include #include +#include char e1000_driver_name[] = "e1000"; static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver"; @@ -842,6 +843,44 @@ static int e1000_set_features(struct net_device *netdev, return 0; } +static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog) +{ + struct e1000_adapter *adapter = netdev_priv(netdev); + struct bpf_prog *old_prog; + + old_prog = xchg(>prog, prog); + if (old_prog) { + synchronize_net(); + bpf_prog_put(old_prog); + } + + if (netif_running(netdev)) + e1000_reinit_locked(adapter); + else + e1000_reset(adapter); + return 0; +} + +static bool e1000_xdp_attached(struct net_device *dev) +{ + struct e1000_adapter *priv = netdev_priv(dev); + + return !!priv->prog; +} + +static int e1000_xdp(struct net_device *dev, struct netdev_xdp *xdp) +{ + switch (xdp->command) { + case XDP_SETUP_PROG: + return e1000_xdp_set(dev, xdp->prog); + case XDP_QUERY_PROG: + xdp->prog_attached = e1000_xdp_attached(dev); + return 0; + default: + return -EINVAL; + } +} + static const struct net_device_ops e1000_netdev_ops = { .ndo_open = e1000_open, .ndo_stop = e1000_close, @@ -860,6 +899,7 @@ static const struct net_device_ops e1000_netdev_ops = { #endif .ndo_fix_features = e1000_fix_features, .ndo_set_features = e1000_set_features, + .ndo_xdp= e1000_xdp, }; /** @@ -1276,6 +1316,9 @@ static void e1000_remove(struct pci_dev *pdev) e1000_down_and_stop(adapter); e1000_release_manageability(adapter); + if (adapter->prog) + bpf_prog_put(adapter->prog); + unregister_netdev(netdev); e1000_phy_hw_reset(hw); @@ -1859,7 +1902,7 @@ static void e1000_configure_rx(struct e1000_adapter *adapter) struct e1000_hw *hw = >hw; u32 rdlen, rctl, rxcsum; - if (adapter->netdev->mtu > ETH_DATA_LEN) { + if (adapter->netdev->mtu > ETH_DATA_LEN || adapter->prog) { rdlen = adapter->rx_ring[0].count * sizeof(struct e1000_rx_desc); adapter->clean_rx = e1000_clean_jumbo_rx_irq; @@ -1973,6 +2016,11 @@ e1000_unmap_and_free_tx_resource(struct e1000_adapter *adapter, dev_kfree_skb_any(buffer_info->skb); buffer_info->skb = NULL; } + if (buffer_info->page) { + put_page(buffer_info->page); + buffer_info->page = NULL; + } + buffer_info->time_stamp = 0; /* buffer_info must be completely set up in the transmit path */ } @@ -3298,6 +3346,63 @@ static netdev_tx_t e1000_xmit_frame(struct sk_buff *skb, return NETDEV_TX_OK; } +static void e1000_tx_map_rxpage(struct e1000_tx_ring *tx_ring, + struct e1000_rx_buffer *rx_buffer_info, + unsigned int len) +{ + struct e1000_tx_buffer *buffer_info; + unsigned int i = tx_ring->next_to_use; + + buffer_info = _ring->buffer_info[i]; + +
[PATCH net-next] tcp: better use ooo_last_skb in tcp_data_queue_ofo()
From: Eric DumazetWillem noticed that we could avoid an rbtree lookup if the the attempt to coalesce incoming skb to the last skb failed for some reason. Since most ooo additions are at the tail, this is definitely worth adding a test and fast path. Suggested-by: Willem de Bruijn Signed-off-by: Eric Dumazet Cc: Yaogong Wang Cc: Yuchung Cheng Cc: Neal Cardwell Cc: Ilpo Järvinen --- net/ipv4/tcp_input.c |8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index a5934c4c8cd4..2e26f3eb0293 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4461,6 +4461,12 @@ coalesce_done: skb = NULL; goto add_sack; } + /* Can avoid an rbtree lookup if we are adding skb after ooo_last_skb */ + if (!before(seq, TCP_SKB_CB(tp->ooo_last_skb)->end_seq)) { + parent = >ooo_last_skb->rbnode; + p = >rb_right; + goto insert; + } /* Find place to insert this segment. Handle overlaps on the way. */ parent = NULL; @@ -4503,7 +4509,7 @@ coalesce_done: } p = >rb_right; } - +insert: /* Insert segment into RB tree. */ rb_link_node(>rbnode, parent, p); rb_insert_color(>rbnode, >out_of_order_queue);
[PATCH] openvswitch: use alias for genetlink family names
When userspace tries to create datapaths and the module is not loaded, it will simply fail. With this patch, the module will be automatically loaded. Signed-off-by: Thadeu Lima de Souza Cascardo--- net/openvswitch/datapath.c | 4 1 file changed, 4 insertions(+) diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index 524c0fd..0536ab3 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -2437,3 +2437,7 @@ module_exit(dp_cleanup); MODULE_DESCRIPTION("Open vSwitch switching datapath"); MODULE_LICENSE("GPL"); +MODULE_ALIAS_GENL_FAMILY(OVS_DATAPATH_FAMILY); +MODULE_ALIAS_GENL_FAMILY(OVS_VPORT_FAMILY); +MODULE_ALIAS_GENL_FAMILY(OVS_FLOW_FAMILY); +MODULE_ALIAS_GENL_FAMILY(OVS_PACKET_FAMILY); -- 2.7.4
[PATCH 2/2] openvswitch: use percpu flow stats
Instead of using flow stats per NUMA node, use it per CPU. When using megaflows, the stats lock can be a bottleneck in scalability. On a E5-2690 12-core system, usual throughput went from ~4Mpps to ~15Mpps when forwarding between two 40GbE ports with a single flow configured on the datapath. This has been tested on a system with possible CPUs 0-7,16-23. After module removal, there were no corruption on the slab cache. Signed-off-by: Thadeu Lima de Souza Cascardo--- net/openvswitch/flow.c | 43 +++ net/openvswitch/flow.h | 4 ++-- net/openvswitch/flow_table.c | 23 --- 3 files changed, 37 insertions(+), 33 deletions(-) diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c index 3609f37..2970a9f 100644 --- a/net/openvswitch/flow.c +++ b/net/openvswitch/flow.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -72,32 +73,33 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags, { struct flow_stats *stats; int node = numa_node_id(); + int cpu = get_cpu(); int len = skb->len + (skb_vlan_tag_present(skb) ? VLAN_HLEN : 0); - stats = rcu_dereference(flow->stats[node]); + stats = rcu_dereference(flow->stats[cpu]); - /* Check if already have node-specific stats. */ + /* Check if already have CPU-specific stats. */ if (likely(stats)) { spin_lock(>lock); /* Mark if we write on the pre-allocated stats. */ - if (node == 0 && unlikely(flow->stats_last_writer != node)) - flow->stats_last_writer = node; + if (cpu == 0 && unlikely(flow->stats_last_writer != cpu)) + flow->stats_last_writer = cpu; } else { stats = rcu_dereference(flow->stats[0]); /* Pre-allocated. */ spin_lock(>lock); - /* If the current NUMA-node is the only writer on the + /* If the current CPU is the only writer on the * pre-allocated stats keep using them. */ - if (unlikely(flow->stats_last_writer != node)) { + if (unlikely(flow->stats_last_writer != cpu)) { /* A previous locker may have already allocated the -* stats, so we need to check again. If node-specific +* stats, so we need to check again. If CPU-specific * stats were already allocated, we update the pre- * allocated stats as we have already locked them. */ - if (likely(flow->stats_last_writer != NUMA_NO_NODE) - && likely(!rcu_access_pointer(flow->stats[node]))) { - /* Try to allocate node-specific stats. */ + if (likely(flow->stats_last_writer != -1) && + likely(!rcu_access_pointer(flow->stats[cpu]))) { + /* Try to allocate CPU-specific stats. */ struct flow_stats *new_stats; new_stats = @@ -114,12 +116,12 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags, new_stats->tcp_flags = tcp_flags; spin_lock_init(_stats->lock); - rcu_assign_pointer(flow->stats[node], + rcu_assign_pointer(flow->stats[cpu], new_stats); goto unlock; } } - flow->stats_last_writer = node; + flow->stats_last_writer = cpu; } } @@ -129,6 +131,7 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags, stats->tcp_flags |= tcp_flags; unlock: spin_unlock(>lock); + put_cpu(); } /* Must be called with rcu_read_lock or ovs_mutex. */ @@ -136,15 +139,15 @@ void ovs_flow_stats_get(const struct sw_flow *flow, struct ovs_flow_stats *ovs_stats, unsigned long *used, __be16 *tcp_flags) { - int node; + int cpu; *used = 0; *tcp_flags = 0; memset(ovs_stats, 0, sizeof(*ovs_stats)); - /* We open code this to make sure node 0 is always considered */ - for (node = 0; node < MAX_NUMNODES; node = next_node(node, node_possible_map)) { - struct flow_stats *stats = rcu_dereference_ovsl(flow->stats[node]); + /* We open code this to make sure cpu 0 is always considered */ + for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next(cpu, cpu_possible_mask)) { +
[PATCH 1/2] openvswitch: fix flow stats accounting when node 0 is not possible
On a system with only node 1 as possible, all statistics is going to be accounted on node 0 as it will have a single writer. However, when getting and clearing the statistics, node 0 is not going to be considered, as it's not a possible node. Tested that statistics are not zero on a system with only node 1 possible. Also compile-tested with CONFIG_NUMA off. Signed-off-by: Thadeu Lima de Souza Cascardo--- I am providing this intermediate patch, that will be thrown out by the next one, in case there is any need to backport this fix. --- net/openvswitch/flow.c | 6 -- net/openvswitch/flow_table.c | 5 +++-- 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c index 0ea128e..3609f37 100644 --- a/net/openvswitch/flow.c +++ b/net/openvswitch/flow.c @@ -142,7 +142,8 @@ void ovs_flow_stats_get(const struct sw_flow *flow, *tcp_flags = 0; memset(ovs_stats, 0, sizeof(*ovs_stats)); - for_each_node(node) { + /* We open code this to make sure node 0 is always considered */ + for (node = 0; node < MAX_NUMNODES; node = next_node(node, node_possible_map)) { struct flow_stats *stats = rcu_dereference_ovsl(flow->stats[node]); if (stats) { @@ -165,7 +166,8 @@ void ovs_flow_stats_clear(struct sw_flow *flow) { int node; - for_each_node(node) { + /* We open code this to make sure node 0 is always considered */ + for (node = 0; node < MAX_NUMNODES; node = next_node(node, node_possible_map)) { struct flow_stats *stats = ovsl_dereference(flow->stats[node]); if (stats) { diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c index d073fff..957a3c3 100644 --- a/net/openvswitch/flow_table.c +++ b/net/openvswitch/flow_table.c @@ -148,8 +148,9 @@ static void flow_free(struct sw_flow *flow) kfree(flow->id.unmasked_key); if (flow->sf_acts) ovs_nla_free_flow_actions((struct sw_flow_actions __force *)flow->sf_acts); - for_each_node(node) - if (flow->stats[node]) + /* We open code this to make sure node 0 is always considered */ + for (node = 0; node < MAX_NUMNODES; node = next_node(node, node_possible_map)) + if (node != 0 && flow->stats[node]) kmem_cache_free(flow_stats_cache, (struct flow_stats __force *)flow->stats[node]); kmem_cache_free(flow_cache, flow); -- 2.7.4
[PATCH net-next 0/5] liquidio CN23XX VF support
Dave, Following is the initial patch series for adding support of VF functionality on CN23XX devices. Please apply patches in the following order as some of the patches depend on earlier patches. Raghu Vatsavayi (5): liquidio CN23XX: VF config support liquidio CN23XX: sriov enable liquidio CN23XX: Mailbox support liquidio CN23XX: mailbox interrupt processing liquidio CN23XX: VF related operations drivers/net/ethernet/cavium/liquidio/Makefile | 1 + .../ethernet/cavium/liquidio/cn23xx_pf_device.c| 700 +++-- .../ethernet/cavium/liquidio/cn23xx_pf_device.h| 3 + .../net/ethernet/cavium/liquidio/cn66xx_device.c | 13 +- .../net/ethernet/cavium/liquidio/cn68xx_device.c | 13 +- drivers/net/ethernet/cavium/liquidio/lio_core.c| 32 + drivers/net/ethernet/cavium/liquidio/lio_main.c| 366 +-- .../net/ethernet/cavium/liquidio/liquidio_common.h | 11 +- .../net/ethernet/cavium/liquidio/octeon_config.h | 8 + .../net/ethernet/cavium/liquidio/octeon_console.c | 16 +- .../net/ethernet/cavium/liquidio/octeon_device.c | 11 +- .../net/ethernet/cavium/liquidio/octeon_device.h | 32 +- drivers/net/ethernet/cavium/liquidio/octeon_droq.c | 28 +- .../net/ethernet/cavium/liquidio/octeon_mailbox.c | 322 ++ .../net/ethernet/cavium/liquidio/octeon_mailbox.h | 116 drivers/net/ethernet/cavium/liquidio/octeon_main.h | 12 +- .../net/ethernet/cavium/liquidio/request_manager.c | 9 +- 17 files changed, 1409 insertions(+), 284 deletions(-) create mode 100644 drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c create mode 100644 drivers/net/ethernet/cavium/liquidio/octeon_mailbox.h -- 1.8.3.1
[PATCH net-next 1/5] liquidio CN23XX: VF config support
Adds support for VF configuration. It also limits the number of rings per VF based on total number of VFs configured. Signed-off-by: Derek ChicklesSigned-off-by: Satanand Burla Signed-off-by: Felix Manlunas Signed-off-by: Raghu Vatsavayi --- .../ethernet/cavium/liquidio/cn23xx_pf_device.c| 260 - .../net/ethernet/cavium/liquidio/cn66xx_device.c | 13 +- .../net/ethernet/cavium/liquidio/octeon_config.h | 5 + .../net/ethernet/cavium/liquidio/octeon_device.c | 10 +- .../net/ethernet/cavium/liquidio/octeon_device.h | 9 +- 5 files changed, 228 insertions(+), 69 deletions(-) diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c index bddb198..a2953d5 100644 --- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c +++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c @@ -312,11 +312,12 @@ static void cn23xx_setup_global_mac_regs(struct octeon_device *oct) u64 reg_val; u16 mac_no = oct->pcie_port; u16 pf_num = oct->pf_num; + u64 temp; /* programming SRN and TRS for each MAC(0..3) */ - dev_dbg(>pci_dev->dev, "%s:Using pcie port %d\n", - __func__, mac_no); + pr_devel("%s:Using pcie port %d\n", +__func__, mac_no); /* By default, mapping all 64 IOQs to a single MACs */ reg_val = @@ -333,13 +334,21 @@ static void cn23xx_setup_global_mac_regs(struct octeon_device *oct) /* setting TRS <23:16> */ reg_val = reg_val | (oct->sriov_info.trs << CN23XX_PKT_MAC_CTL_RINFO_TRS_BIT_POS); + /* setting RPVF <39:32> */ + temp = oct->sriov_info.rings_per_vf & 0xff; + reg_val |= (temp << CN23XX_PKT_MAC_CTL_RINFO_RPVF_BIT_POS); + + /* setting NVFS <55:48> */ + temp = oct->sriov_info.num_vfs & 0xff; + reg_val |= (temp << CN23XX_PKT_MAC_CTL_RINFO_NVFS_BIT_POS); + /* write these settings to MAC register */ octeon_write_csr64(oct, CN23XX_SLI_PKT_MAC_RINFO64(mac_no, pf_num), reg_val); - dev_dbg(>pci_dev->dev, "SLI_PKT_MAC(%d)_PF(%d)_RINFO : 0x%016llx\n", - mac_no, pf_num, (u64)octeon_read_csr64 - (oct, CN23XX_SLI_PKT_MAC_RINFO64(mac_no, pf_num))); + pr_devel("SLI_PKT_MAC(%d)_PF(%d)_RINFO : 0x%016llx\n", +mac_no, pf_num, (u64)octeon_read_csr64 +(oct, CN23XX_SLI_PKT_MAC_RINFO64(mac_no, pf_num))); } static int cn23xx_reset_io_queues(struct octeon_device *oct) @@ -404,6 +413,7 @@ static int cn23xx_pf_setup_global_input_regs(struct octeon_device *oct) u64 intr_threshold, reg_val; struct octeon_instr_queue *iq; struct octeon_cn23xx_pf *cn23xx = (struct octeon_cn23xx_pf *)oct->chip; + u64 vf_num; pf_num = oct->pf_num; @@ -420,6 +430,16 @@ static int cn23xx_pf_setup_global_input_regs(struct octeon_device *oct) */ for (q_no = 0; q_no < ern; q_no++) { reg_val = oct->pcie_port << CN23XX_PKT_INPUT_CTL_MAC_NUM_POS; + + /* for VF assigned queues. */ + if (q_no < oct->sriov_info.pf_srn) { + vf_num = q_no / oct->sriov_info.rings_per_vf; + vf_num += 1; /* VF1, VF2, */ + } else { + vf_num = 0; + } + + reg_val |= vf_num << CN23XX_PKT_INPUT_CTL_VF_NUM_POS; reg_val |= pf_num << CN23XX_PKT_INPUT_CTL_PF_NUM_POS; octeon_write_csr64(oct, CN23XX_SLI_IQ_PKT_CONTROL64(q_no), @@ -590,8 +610,8 @@ static void cn23xx_setup_iq_regs(struct octeon_device *oct, u32 iq_no) (u8 *)oct->mmio[0].hw_addr + CN23XX_SLI_IQ_DOORBELL(iq_no); iq->inst_cnt_reg = (u8 *)oct->mmio[0].hw_addr + CN23XX_SLI_IQ_INSTR_COUNT64(iq_no); - dev_dbg(>pci_dev->dev, "InstQ[%d]:dbell reg @ 0x%p instcnt_reg @ 0x%p\n", - iq_no, iq->doorbell_reg, iq->inst_cnt_reg); + pr_devel("InstQ[%d]:dbell reg @ 0x%p instcnt_reg @ 0x%p\n", +iq_no, iq->doorbell_reg, iq->inst_cnt_reg); /* Store the current instruction counter (used in flush_iq * calculation) @@ -822,7 +842,7 @@ static u64 cn23xx_pf_msix_interrupt_handler(void *dev) u64 ret = 0; struct octeon_droq *droq = oct->droq[ioq_vector->droq_index]; - dev_dbg(>pci_dev->dev, "In %s octeon_dev @ %p\n", __func__, oct); + pr_devel("In %s octeon_dev @ %p\n", __func__, oct); if (!droq) { dev_err(>pci_dev->dev, "23XX bringup FIXME: oct pfnum:%d ioq_vector->ioq_num :%d droq is NULL\n", @@ -862,7 +882,7 @@ static irqreturn_t cn23xx_interrupt_handler(void *dev) struct octeon_cn23xx_pf *cn23xx = (struct
Re: [PATCH] net: ip, diag -- Add diag interface for raw sockets
On Fri, Sep 09, 2016 at 12:55:13PM -0700, Eric Dumazet wrote: > > + > > + rep = nlmsg_new(sizeof(struct inet_diag_msg) + > > + sizeof(struct inet_diag_meminfo) + 64, > > + GFP_KERNEL); > > + if (!rep) > > There is a missing sock_put(sk) > > > + return -ENOMEM; > > + > > + err = inet_sk_diag_fill(sk, NULL, rep, r, > > + sk_user_ns(NETLINK_CB(in_skb).sk), > > + NETLINK_CB(in_skb).portid, > > + nlh->nlmsg_seq, 0, nlh); > > sock_put(sk); Oh, missed. Thanks lot, Eric, will update!
[PATCH net-next 4/5] liquidio CN23XX: mailbox interrupt processing
Adds support for mailbox interrupt processing of various commands. Signed-off-by: Derek ChicklesSigned-off-by: Satanand Burla Signed-off-by: Felix Manlunas Signed-off-by: Raghu Vatsavayi --- .../ethernet/cavium/liquidio/cn23xx_pf_device.c| 157 + drivers/net/ethernet/cavium/liquidio/lio_main.c| 8 +- .../net/ethernet/cavium/liquidio/octeon_device.c | 1 + .../net/ethernet/cavium/liquidio/octeon_device.h | 6 + drivers/net/ethernet/cavium/liquidio/octeon_droq.c | 28 ++-- 5 files changed, 184 insertions(+), 16 deletions(-) diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c index b3c61302..4d975d8 100644 --- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c +++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c @@ -30,6 +30,7 @@ #include "octeon_device.h" #include "cn23xx_pf_device.h" #include "octeon_main.h" +#include "octeon_mailbox.h" #define RESET_NOTDONE 0 #define RESET_DONE 1 @@ -682,6 +683,118 @@ static void cn23xx_setup_oq_regs(struct octeon_device *oct, u32 oq_no) } } +static void cn23xx_pf_mbox_thread(struct work_struct *work) +{ + struct cavium_wk *wk = (struct cavium_wk *)work; + struct octeon_mbox *mbox = (struct octeon_mbox *)wk->ctxptr; + struct octeon_device *oct = mbox->oct_dev; + u64 mbox_int_val, val64; + u32 q_no, i; + + if (oct->rev_id < OCTEON_CN23XX_REV_1_1) { + /*read and clear by writing 1*/ + mbox_int_val = readq(mbox->mbox_int_reg); + writeq(mbox_int_val, mbox->mbox_int_reg); + + for (i = 0; i < oct->sriov_info.num_vfs; i++) { + q_no = i * oct->sriov_info.rings_per_vf; + + val64 = readq(oct->mbox[q_no]->mbox_write_reg); + + if (val64 && (val64 != OCTEON_PFVFACK)) { + if (octeon_mbox_read(oct->mbox[q_no])) + octeon_mbox_process_message( + oct->mbox[q_no]); + } + } + + schedule_delayed_work(>work, msecs_to_jiffies(10)); + } else { + octeon_mbox_process_message(mbox); + } +} + +static int cn23xx_setup_pf_mbox(struct octeon_device *oct) +{ + u32 q_no, i; + u16 mac_no = oct->pcie_port; + u16 pf_num = oct->pf_num; + struct octeon_mbox *mbox = NULL; + + if (!oct->sriov_info.num_vfs) + return 0; + + for (i = 0; i < oct->sriov_info.num_vfs; i++) { + q_no = i * oct->sriov_info.rings_per_vf; + + mbox = vmalloc(sizeof(*mbox)); + if (!mbox) + goto free_mbox; + + memset(mbox, 0, sizeof(struct octeon_mbox)); + + spin_lock_init(>lock); + + mbox->oct_dev = oct; + + mbox->q_no = q_no; + + mbox->state = OCTEON_MBOX_STATE_IDLE; + + /* PF mbox interrupt reg */ + mbox->mbox_int_reg = (u8 *)oct->mmio[0].hw_addr + +CN23XX_SLI_MAC_PF_MBOX_INT(mac_no, pf_num); + + /* PF writes into SIG0 reg */ + mbox->mbox_write_reg = (u8 *)oct->mmio[0].hw_addr + + CN23XX_SLI_PKT_PF_VF_MBOX_SIG(q_no, 0); + + /* PF reads from SIG1 reg */ + mbox->mbox_read_reg = (u8 *)oct->mmio[0].hw_addr + + CN23XX_SLI_PKT_PF_VF_MBOX_SIG(q_no, 1); + + /*Mail Box Thread creation*/ + INIT_DELAYED_WORK(>mbox_poll_wk.work, + cn23xx_pf_mbox_thread); + mbox->mbox_poll_wk.ctxptr = (void *)mbox; + + oct->mbox[q_no] = mbox; + + writeq(OCTEON_PFVFSIG, mbox->mbox_read_reg); + } + + if (oct->rev_id < OCTEON_CN23XX_REV_1_1) + schedule_delayed_work(>mbox[0]->mbox_poll_wk.work, + msecs_to_jiffies(0)); + + return 0; + +free_mbox: + while (i) { + i--; + vfree(oct->mbox[i]); + } + + return 1; +} + +static int cn23xx_free_pf_mbox(struct octeon_device *oct) +{ + u32 q_no, i; + + if (!oct->sriov_info.num_vfs) + return 0; + + for (i = 0; i < oct->sriov_info.num_vfs; i++) { + q_no = i * oct->sriov_info.rings_per_vf; + cancel_delayed_work_sync( + >mbox[q_no]->mbox_poll_wk.work); + vfree(oct->mbox[q_no]); + } + + return 0; +} + static int cn23xx_enable_io_queues(struct octeon_device *oct) { u64 reg_val; @@ -876,6 +989,29 @@ static u64
[PATCH net-next 5/5] liquidio CN23XX: VF related operations
Adds support for VF related operations like mac address vlan and link changes. Signed-off-by: Derek ChicklesSigned-off-by: Satanand Burla Signed-off-by: Felix Manlunas Signed-off-by: Raghu Vatsavayi --- .../ethernet/cavium/liquidio/cn23xx_pf_device.c| 22 +++ .../ethernet/cavium/liquidio/cn23xx_pf_device.h| 3 + drivers/net/ethernet/cavium/liquidio/lio_main.c| 211 + .../net/ethernet/cavium/liquidio/liquidio_common.h | 5 + .../net/ethernet/cavium/liquidio/octeon_device.h | 8 + 5 files changed, 249 insertions(+) diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c index 4d975d8..49efce1 100644 --- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c +++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c @@ -23,6 +23,7 @@ #include #include #include +#include #include "liquidio_common.h" #include "octeon_droq.h" #include "octeon_iq.h" @@ -1541,3 +1542,24 @@ int cn23xx_fw_loaded(struct octeon_device *oct) val = octeon_read_csr64(oct, CN23XX_SLI_SCRATCH1); return (val >> 1) & 1ULL; } + +void cn23xx_tell_vf_its_macaddr_changed(struct octeon_device *oct, int vfidx, + u8 *mac) +{ + if (oct->sriov_info.vf_drv_loaded_mask & BIT_ULL(vfidx)) { + struct octeon_mbox_cmd mbox_cmd; + + mbox_cmd.msg.u64 = 0; + mbox_cmd.msg.s.type = OCTEON_MBOX_REQUEST; + mbox_cmd.msg.s.resp_needed = 0; + mbox_cmd.msg.s.cmd = OCTEON_PF_CHANGED_VF_MACADDR; + mbox_cmd.msg.s.len = 1; + mbox_cmd.recv_len = 0; + mbox_cmd.recv_status = 0; + mbox_cmd.fn = NULL; + mbox_cmd.fn_arg = 0; + ether_addr_copy(mbox_cmd.msg.s.params, mac); + mbox_cmd.q_no = vfidx * oct->sriov_info.rings_per_vf; + octeon_mbox_write(oct, _cmd); + } +} diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.h b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.h index 21b5c90..20a9dc5 100644 --- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.h +++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.h @@ -56,4 +56,7 @@ u32 cn23xx_pf_get_oq_ticks(struct octeon_device *oct, u32 time_intr_in_us); void cn23xx_dump_pf_initialized_regs(struct octeon_device *oct); int cn23xx_fw_loaded(struct octeon_device *oct); + +void cn23xx_tell_vf_its_macaddr_changed(struct octeon_device *oct, int vfidx, + u8 *mac); #endif diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c b/drivers/net/ethernet/cavium/liquidio/lio_main.c index e480c23..3b92036 100644 --- a/drivers/net/ethernet/cavium/liquidio/lio_main.c +++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c @@ -3592,6 +3592,148 @@ static void liquidio_del_vxlan_port(struct net_device *netdev, OCTNET_CMD_VXLAN_PORT_DEL); } +static int __liquidio_set_vf_mac(struct net_device *netdev, int vfidx, +u8 *mac, bool is_admin_assigned) +{ + struct lio *lio = GET_LIO(netdev); + struct octeon_device *oct = lio->oct_dev; + struct octnic_ctrl_pkt nctrl; + + if (!is_valid_ether_addr(mac)) + return -EINVAL; + + if (vfidx < 0 || vfidx >= oct->sriov_info.num_vfs) + return -EINVAL; + + memset(, 0, sizeof(struct octnic_ctrl_pkt)); + + nctrl.ncmd.u64 = 0; + nctrl.ncmd.s.cmd = OCTNET_CMD_CHANGE_MACADDR; + /* vfidx is 0 based, but vf_num (param1) is 1 based */ + nctrl.ncmd.s.param1 = vfidx + 1; + nctrl.ncmd.s.param2 = (is_admin_assigned ? 1 : 0); + nctrl.ncmd.s.more = 1; + nctrl.iq_no = lio->linfo.txpciq[0].s.q_no; + nctrl.cb_fn = 0; + nctrl.wait_time = 100; + + nctrl.udd[0] = 0; + /* The MAC Address is presented in network byte order. */ + ether_addr_copy((u8 *)[0] + 2, mac); + + oct->sriov_info.vf_macaddr[vfidx] = nctrl.udd[0]; + + octnet_send_nic_ctrl_pkt(oct, ); + + return 0; +} + +static int liquidio_set_vf_mac(struct net_device *netdev, int vfidx, u8 *mac) +{ + struct lio *lio = GET_LIO(netdev); + struct octeon_device *oct = lio->oct_dev; + int retval; + + retval = __liquidio_set_vf_mac(netdev, vfidx, mac, true); + if (!retval) + cn23xx_tell_vf_its_macaddr_changed(oct, vfidx, mac); + + return retval; +} + +static int liquidio_set_vf_vlan(struct net_device *netdev, int vfidx, + u16 vlan, u8 qos) +{ + struct lio *lio = GET_LIO(netdev); + struct octeon_device *oct = lio->oct_dev; + struct octnic_ctrl_pkt nctrl; + u16 vlantci; + +
[PATCH net-next 3/5] liquidio CN23XX: Mailbox support
Adds support for mailbox communication between PF and VF. Signed-off-by: Derek ChicklesSigned-off-by: Satanand Burla Signed-off-by: Felix Manlunas Signed-off-by: Raghu Vatsavayi --- drivers/net/ethernet/cavium/liquidio/Makefile | 1 + .../ethernet/cavium/liquidio/cn23xx_pf_device.c| 4 +- .../net/ethernet/cavium/liquidio/cn68xx_device.c | 13 +- drivers/net/ethernet/cavium/liquidio/lio_core.c| 32 ++ .../net/ethernet/cavium/liquidio/liquidio_common.h | 6 +- .../net/ethernet/cavium/liquidio/octeon_console.c | 16 +- .../net/ethernet/cavium/liquidio/octeon_device.h | 4 + .../net/ethernet/cavium/liquidio/octeon_mailbox.c | 322 + .../net/ethernet/cavium/liquidio/octeon_mailbox.h | 116 drivers/net/ethernet/cavium/liquidio/octeon_main.h | 12 +- .../net/ethernet/cavium/liquidio/request_manager.c | 9 +- 11 files changed, 507 insertions(+), 28 deletions(-) create mode 100644 drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c create mode 100644 drivers/net/ethernet/cavium/liquidio/octeon_mailbox.h diff --git a/drivers/net/ethernet/cavium/liquidio/Makefile b/drivers/net/ethernet/cavium/liquidio/Makefile index 5a27b2a..14958de 100644 --- a/drivers/net/ethernet/cavium/liquidio/Makefile +++ b/drivers/net/ethernet/cavium/liquidio/Makefile @@ -11,6 +11,7 @@ liquidio-$(CONFIG_LIQUIDIO) += lio_ethtool.o \ cn66xx_device.o\ cn68xx_device.o\ cn23xx_pf_device.o \ + octeon_mailbox.o \ octeon_mem_ops.o \ octeon_droq.o \ octeon_nic.o diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c index deec869..b3c61302 100644 --- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c +++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c @@ -270,8 +270,8 @@ static void cn23xx_enable_error_reporting(struct octeon_device *oct) regval |= 0xf; /* Enable Link error reporting */ - dev_dbg(>pci_dev->dev, "OCTEON[%d]: Enabling PCI-E error reporting..\n", - oct->octeon_id); + pr_devel("OCTEON[%d]: Enabling PCI-E error reporting..\n", +oct->octeon_id); pci_write_config_dword(oct->pci_dev, CN23XX_CONFIG_PCIE_DEVCTL, regval); } diff --git a/drivers/net/ethernet/cavium/liquidio/cn68xx_device.c b/drivers/net/ethernet/cavium/liquidio/cn68xx_device.c index dbf3566..424125e 100644 --- a/drivers/net/ethernet/cavium/liquidio/cn68xx_device.c +++ b/drivers/net/ethernet/cavium/liquidio/cn68xx_device.c @@ -19,6 +19,7 @@ * This file may also be available under a different license from Cavium. * Contact Cavium, Inc. for more information **/ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include #include #include "liquidio_common.h" @@ -37,8 +38,8 @@ static void lio_cn68xx_set_dpi_regs(struct octeon_device *oct) u32 fifo_sizes[6] = { 3, 3, 1, 1, 1, 8 }; lio_pci_writeq(oct, CN6XXX_DPI_DMA_CTL_MASK, CN6XXX_DPI_DMA_CONTROL); - dev_dbg(>pci_dev->dev, "DPI_DMA_CONTROL: 0x%016llx\n", - lio_pci_readq(oct, CN6XXX_DPI_DMA_CONTROL)); + pr_devel("DPI_DMA_CONTROL: 0x%016llx\n", +lio_pci_readq(oct, CN6XXX_DPI_DMA_CONTROL)); for (i = 0; i < 6; i++) { /* Prevent service of instruction queue for all DMA engines @@ -47,8 +48,8 @@ static void lio_cn68xx_set_dpi_regs(struct octeon_device *oct) */ lio_pci_writeq(oct, 0, CN6XXX_DPI_DMA_ENG_ENB(i)); lio_pci_writeq(oct, fifo_sizes[i], CN6XXX_DPI_DMA_ENG_BUF(i)); - dev_dbg(>pci_dev->dev, "DPI_ENG_BUF%d: 0x%016llx\n", i, - lio_pci_readq(oct, CN6XXX_DPI_DMA_ENG_BUF(i))); + pr_devel("DPI_ENG_BUF%d: 0x%016llx\n", i, +lio_pci_readq(oct, CN6XXX_DPI_DMA_ENG_BUF(i))); } /* DPI_SLI_PRT_CFG has MPS and MRRS settings that will be set @@ -56,8 +57,8 @@ static void lio_cn68xx_set_dpi_regs(struct octeon_device *oct) */ lio_pci_writeq(oct, 1, CN6XXX_DPI_CTL); - dev_dbg(>pci_dev->dev, "DPI_CTL: 0x%016llx\n", - lio_pci_readq(oct, CN6XXX_DPI_CTL)); + pr_devel("DPI_CTL: 0x%016llx\n", +lio_pci_readq(oct, CN6XXX_DPI_CTL)); } static int lio_cn68xx_soft_reset(struct octeon_device *oct) diff --git a/drivers/net/ethernet/cavium/liquidio/lio_core.c b/drivers/net/ethernet/cavium/liquidio/lio_core.c index 201eddb..4626b1f 100644 --- a/drivers/net/ethernet/cavium/liquidio/lio_core.c +++ b/drivers/net/ethernet/cavium/liquidio/lio_core.c @@ -264,3 +264,35 @@
[PATCH net-next 2/5] liquidio CN23XX: sriov enable
Adds support for enabling sriov on CN23XX cards. Signed-off-by: Derek ChicklesSigned-off-by: Satanand Burla Signed-off-by: Felix Manlunas Signed-off-by: Raghu Vatsavayi --- .../ethernet/cavium/liquidio/cn23xx_pf_device.c| 257 +++-- drivers/net/ethernet/cavium/liquidio/lio_main.c| 147 .../net/ethernet/cavium/liquidio/octeon_config.h | 3 + .../net/ethernet/cavium/liquidio/octeon_device.h | 5 + 4 files changed, 241 insertions(+), 171 deletions(-) diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c index a2953d5..deec869 100644 --- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c +++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c @@ -19,7 +19,7 @@ * This file may also be available under a different license from Cavium. * Contact Cavium, Inc. for more information **/ - +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include #include #include @@ -52,174 +52,174 @@ void cn23xx_dump_pf_initialized_regs(struct octeon_device *oct) struct octeon_cn23xx_pf *cn23xx = (struct octeon_cn23xx_pf *)oct->chip; /*In cn23xx_soft_reset*/ - dev_dbg(>pci_dev->dev, "%s[%llx] : 0x%llx\n", - "CN23XX_WIN_WR_MASK_REG", CVM_CAST64(CN23XX_WIN_WR_MASK_REG), - CVM_CAST64(octeon_read_csr64(oct, CN23XX_WIN_WR_MASK_REG))); - dev_dbg(>pci_dev->dev, "%s[%llx] : 0x%016llx\n", - "CN23XX_SLI_SCRATCH1", CVM_CAST64(CN23XX_SLI_SCRATCH1), - CVM_CAST64(octeon_read_csr64(oct, CN23XX_SLI_SCRATCH1))); - dev_dbg(>pci_dev->dev, "%s[%llx] : 0x%016llx\n", - "CN23XX_RST_SOFT_RST", CN23XX_RST_SOFT_RST, - lio_pci_readq(oct, CN23XX_RST_SOFT_RST)); + pr_devel("%s[%llx] : 0x%llx\n", +"CN23XX_WIN_WR_MASK_REG", CVM_CAST64(CN23XX_WIN_WR_MASK_REG), +CVM_CAST64(octeon_read_csr64(oct, CN23XX_WIN_WR_MASK_REG))); + pr_devel("%s[%llx] : 0x%016llx\n", +"CN23XX_SLI_SCRATCH1", CVM_CAST64(CN23XX_SLI_SCRATCH1), +CVM_CAST64(octeon_read_csr64(oct, CN23XX_SLI_SCRATCH1))); + pr_devel("%s[%llx] : 0x%016llx\n", +"CN23XX_RST_SOFT_RST", CN23XX_RST_SOFT_RST, +lio_pci_readq(oct, CN23XX_RST_SOFT_RST)); /*In cn23xx_set_dpi_regs*/ - dev_dbg(>pci_dev->dev, "%s[%llx] : 0x%016llx\n", - "CN23XX_DPI_DMA_CONTROL", CN23XX_DPI_DMA_CONTROL, - lio_pci_readq(oct, CN23XX_DPI_DMA_CONTROL)); + pr_devel("%s[%llx] : 0x%016llx\n", +"CN23XX_DPI_DMA_CONTROL", CN23XX_DPI_DMA_CONTROL, +lio_pci_readq(oct, CN23XX_DPI_DMA_CONTROL)); for (i = 0; i < 6; i++) { - dev_dbg(>pci_dev->dev, "%s(%d)[%llx] : 0x%016llx\n", - "CN23XX_DPI_DMA_ENG_ENB", i, - CN23XX_DPI_DMA_ENG_ENB(i), - lio_pci_readq(oct, CN23XX_DPI_DMA_ENG_ENB(i))); - dev_dbg(>pci_dev->dev, "%s(%d)[%llx] : 0x%016llx\n", - "CN23XX_DPI_DMA_ENG_BUF", i, - CN23XX_DPI_DMA_ENG_BUF(i), - lio_pci_readq(oct, CN23XX_DPI_DMA_ENG_BUF(i))); + pr_devel("%s(%d)[%llx] : 0x%016llx\n", +"CN23XX_DPI_DMA_ENG_ENB", i, +CN23XX_DPI_DMA_ENG_ENB(i), +lio_pci_readq(oct, CN23XX_DPI_DMA_ENG_ENB(i))); + pr_devel("%s(%d)[%llx] : 0x%016llx\n", +"CN23XX_DPI_DMA_ENG_BUF", i, +CN23XX_DPI_DMA_ENG_BUF(i), +lio_pci_readq(oct, CN23XX_DPI_DMA_ENG_BUF(i))); } - dev_dbg(>pci_dev->dev, "%s[%llx] : 0x%016llx\n", "CN23XX_DPI_CTL", - CN23XX_DPI_CTL, lio_pci_readq(oct, CN23XX_DPI_CTL)); + pr_devel("%s[%llx] : 0x%016llx\n", "CN23XX_DPI_CTL", +CN23XX_DPI_CTL, lio_pci_readq(oct, CN23XX_DPI_CTL)); /*In cn23xx_setup_pcie_mps and cn23xx_setup_pcie_mrrs */ pci_read_config_dword(oct->pci_dev, CN23XX_CONFIG_PCIE_DEVCTL, ); - dev_dbg(>pci_dev->dev, "%s[%llx] : 0x%016llx\n", - "CN23XX_CONFIG_PCIE_DEVCTL", - CVM_CAST64(CN23XX_CONFIG_PCIE_DEVCTL), CVM_CAST64(regval)); + pr_devel("%s[%llx] : 0x%016llx\n", +"CN23XX_CONFIG_PCIE_DEVCTL", +CVM_CAST64(CN23XX_CONFIG_PCIE_DEVCTL), CVM_CAST64(regval)); - dev_dbg(>pci_dev->dev, "%s(%d)[%llx] : 0x%016llx\n", - "CN23XX_DPI_SLI_PRTX_CFG", oct->pcie_port, - CN23XX_DPI_SLI_PRTX_CFG(oct->pcie_port), - lio_pci_readq(oct, CN23XX_DPI_SLI_PRTX_CFG(oct->pcie_port))); +
Re: [PATCH] net: ip, diag -- Add diag interface for raw sockets
On Fri, 2016-09-09 at 21:26 +0300, Cyrill Gorcunov wrote: ... > +static int raw_diag_dump_one(struct sk_buff *in_skb, > + const struct nlmsghdr *nlh, > + const struct inet_diag_req_v2 *r) > +{ > + struct raw_hashinfo *hashinfo = raw_get_hashinfo(r); > + struct net *net = sock_net(in_skb->sk); > + struct sock *sk = NULL, *s; > + int err = -ENOENT, slot; > + struct sk_buff *rep; > + > + if (IS_ERR(hashinfo)) > + return PTR_ERR(hashinfo); > + > + read_lock(>lock); > + for (slot = 0; slot < RAW_HTABLE_SIZE; slot++) { > + sk_for_each(s, >ht[slot]) { > + sk = raw_lookup(net, s, r); > + if (sk) > + break; > + } > + } > + if (sk && !atomic_inc_not_zero(>sk_refcnt)) > + sk = NULL; > + read_unlock(>lock); > + if (!sk) > + return -ENOENT; > + > + rep = nlmsg_new(sizeof(struct inet_diag_msg) + > + sizeof(struct inet_diag_meminfo) + 64, > + GFP_KERNEL); > + if (!rep) There is a missing sock_put(sk) > + return -ENOMEM; > + > + err = inet_sk_diag_fill(sk, NULL, rep, r, > + sk_user_ns(NETLINK_CB(in_skb).sk), > + NETLINK_CB(in_skb).portid, > + nlh->nlmsg_seq, 0, nlh); sock_put(sk); > + if (err < 0) { > + kfree_skb(rep); > + return err; > + } > + > + err = netlink_unicast(net->diag_nlsk, rep, > + NETLINK_CB(in_skb).portid, > + MSG_DONTWAIT); > + if (err > 0) > + err = 0; > + return err; > +} > +
Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Thu, Sep 8, 2016 at 10:36 PM, Jesper Dangaard Brouerwrote: > On Thu, 8 Sep 2016 20:22:04 -0700 > Alexei Starovoitov wrote: > >> On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote: >> > >> > I'm sorry but I have a problem with this patch! >> >> is it because the variable is called 'xdp_doorbell'? >> Frankly I see nothing scary in this patch. >> It extends existing code by adding a flag to ring doorbell or not. >> The end of rx napi is used as an obvious heuristic to flush the pipe. >> Looks pretty generic to me. >> The same code can be used for non-xdp as well once we figure out >> good algorithm for xmit_more in the stack. > > What I'm proposing can also be used by the normal stack. > >> > Looking at this patch, I want to bring up a fundamental architectural >> > concern with the development direction of XDP transmit. >> > >> > >> > What you are trying to implement, with delaying the doorbell, is >> > basically TX bulking for TX_XDP. >> > >> > Why not implement a TX bulking interface directly instead?!? >> > >> > Yes, the tailptr/doorbell is the most costly operation, but why not >> > also take advantage of the benefits of bulking for other parts of the >> > code? (benefit is smaller, by every cycles counts in this area) >> > >> > This hole XDP exercise is about avoiding having a transaction cost per >> > packet, that reads "bulking" or "bundling" of packets, where possible. >> > >> > Lets do bundling/bulking from the start! >> >> mlx4 already does bulking and this proposed mlx5 set of patches >> does bulking as well. >> See nothing wrong about it. RX side processes the packets and >> when it's done it tells TX to xmit whatever it collected. > > This is doing "hidden" bulking and not really taking advantage of using > the icache more effeciently. > > Let me explain the problem I see, little more clear then, so you > hopefully see where I'm going. > > Imagine you have packets intermixed towards the stack and XDP_TX. > Every time you call the stack code, then you flush your icache. When > returning to the driver code, you will have to reload all the icache > associated with the XDP_TX, this is a costly operation. > > >> > The reason behind the xmit_more API is that we could not change the >> > API of all the drivers. And we found that calling an explicit NDO >> > flush came at a cost (only approx 7 ns IIRC), but it still a cost that >> > would hit the common single packet use-case. >> > >> > It should be really easy to build a bundle of packets that need XDP_TX >> > action, especially given you only have a single destination "port". >> > And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record. >> >> not sure what are you proposing here? >> Sounds like you want to extend it to multi port in the future? >> Sure. The proposed code is easily extendable. >> >> Or you want to see something like a link list of packets >> or an array of packets that RX side is preparing and then >> send the whole array/list to TX port? >> I don't think that would be efficient, since it would mean >> unnecessary copy of pointers. > > I just explain it will be more efficient due to better use of icache. > > >> > In the future, XDP need to support XDP_FWD forwarding of packets/pages >> > out other interfaces. I also want bulk transmit from day-1 here. It >> > is slightly more tricky to sort packets for multiple outgoing >> > interfaces efficiently in the pool loop. >> >> I don't think so. Multi port is natural extension to this set of patches. >> With multi port the end of RX will tell multiple ports (that were >> used to tx) to ring the bell. Pretty trivial and doesn't involve any >> extra arrays or link lists. > > So, have you solved the problem exclusive access to a TX ring of a > remote/different net_device when sending? > > In you design you assume there exist many TX ring available for other > devices to access. In my design I also want to support devices that > doesn't have this HW capability, and e.g. only have one TX queue. > Right, but segregating TX queues used by the stack from the those used by XDP is pretty fundamental to the design. If we start mixing them, then we need to pull in several features (such as BQL which seems like what you're proposing) into the XDP path. If this starts to slow things down or we need to reinvent a bunch of existing features to not use skbuffs that seems to run contrary to "the simple as possible" model for XDP-- may as well use the regular stack at that point maybe... Tom > >> > But the mSwitch[1] article actually already solved this destination >> > sorting. Please read[1] section 3.3 "Switch Fabric Algorithm" for >> > understanding the next steps, for a smarter data structure, when >> > starting to have more TX "ports". And perhaps align your single >> > XDP_TX destination data structure to this future development. >> > >> > [1]
[PATCH] ATM-iphase: Use kmalloc_array() in tx_init()
From: Markus ElfringDate: Fri, 9 Sep 2016 20:40:16 +0200 * Multiplications for the size determination of memory allocations indicated that array data structures should be processed. Thus use the corresponding function "kmalloc_array". This issue was detected by using the Coccinelle software. * Replace the specification of data types by pointer dereferences to make the corresponding size determination a bit safer according to the Linux coding style convention. Signed-off-by: Markus Elfring --- drivers/atm/iphase.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/atm/iphase.c b/drivers/atm/iphase.c index 809dd1e..9d8807e 100644 --- a/drivers/atm/iphase.c +++ b/drivers/atm/iphase.c @@ -1975,7 +1975,9 @@ static int tx_init(struct atm_dev *dev) buf_desc_ptr++; tx_pkt_start += iadev->tx_buf_sz; } -iadev->tx_buf = kmalloc(iadev->num_tx_desc*sizeof(struct cpcs_trailer_desc), GFP_KERNEL); + iadev->tx_buf = kmalloc_array(iadev->num_tx_desc, + sizeof(*iadev->tx_buf), + GFP_KERNEL); if (!iadev->tx_buf) { printk(KERN_ERR DEV_LABEL " couldn't get mem\n"); goto err_free_dle; @@ -1995,8 +1997,9 @@ static int tx_init(struct atm_dev *dev) sizeof(*cpcs), DMA_TO_DEVICE); } -iadev->desc_tbl = kmalloc(iadev->num_tx_desc * - sizeof(struct desc_tbl_t), GFP_KERNEL); + iadev->desc_tbl = kmalloc_array(iadev->num_tx_desc, + sizeof(*iadev->desc_tbl), + GFP_KERNEL); if (!iadev->desc_tbl) { printk(KERN_ERR DEV_LABEL " couldn't get mem\n"); goto err_free_all_tx_bufs; @@ -2124,7 +2127,9 @@ static int tx_init(struct atm_dev *dev) memset((caddr_t)(iadev->seg_ram+i), 0, iadev->num_vc*4); vc = (struct main_vc *)iadev->MAIN_VC_TABLE_ADDR; evc = (struct ext_vc *)iadev->EXT_VC_TABLE_ADDR; -iadev->testTable = kmalloc(sizeof(long)*iadev->num_vc, GFP_KERNEL); + iadev->testTable = kmalloc_array(iadev->num_vc, +sizeof(*iadev->testTable), +GFP_KERNEL); if (!iadev->testTable) { printk("Get freepage failed\n"); goto err_free_desc_tbl; -- 2.10.0
[PATCH] net: ip, diag -- Add diag interface for raw sockets
In criu we are actively using diag interface to collect sockets present in the system when dumping applications. And while for unix, tcp, udp[lite], packet, netlink it works as expected, the raw sockets do not have. Thus add it. CC: David S. MillerCC: Eric Dumazet CC: Alexey Kuznetsov CC: James Morris CC: Hideaki YOSHIFUJI CC: Patrick McHardy CC: Andrey Vagin CC: Stephen Hemminger Signed-off-by: Cyrill Gorcunov --- Take a look please, once time permit. Hopefully I didn't miss something obvious, tested as "ss -n -A raw" for modified iproute2 instance and c/r for trivial application which has raw sockets opened. A patch for ss tool is at https://goo.gl/VFQ93L for the reference, will send it out then. include/net/raw.h |5 + include/net/rawv6.h |5 + net/ipv4/Kconfig|8 ++ net/ipv4/Makefile |1 net/ipv4/raw.c |6 + net/ipv4/raw_diag.c | 192 net/ipv6/raw.c |6 + 7 files changed, 219 insertions(+), 4 deletions(-) Index: linux-ml.git/include/net/raw.h === --- linux-ml.git.orig/include/net/raw.h +++ linux-ml.git/include/net/raw.h @@ -23,6 +23,11 @@ extern struct proto raw_prot; +extern struct raw_hashinfo raw_v4_hashinfo; +struct sock *__raw_v4_lookup(struct net *net, struct sock *sk, +unsigned short num, __be32 raddr, +__be32 laddr, int dif); + void raw_icmp_error(struct sk_buff *, int, u32); int raw_local_deliver(struct sk_buff *, int); Index: linux-ml.git/include/net/rawv6.h === --- linux-ml.git.orig/include/net/rawv6.h +++ linux-ml.git/include/net/rawv6.h @@ -3,6 +3,11 @@ #include +extern struct raw_hashinfo raw_v6_hashinfo; +struct sock *__raw_v6_lookup(struct net *net, struct sock *sk, +unsigned short num, const struct in6_addr *loc_addr, +const struct in6_addr *rmt_addr, int dif); + void raw6_icmp_error(struct sk_buff *, int nexthdr, u8 type, u8 code, int inner_offset, __be32); bool raw6_local_deliver(struct sk_buff *, int); Index: linux-ml.git/net/ipv4/Kconfig === --- linux-ml.git.orig/net/ipv4/Kconfig +++ linux-ml.git/net/ipv4/Kconfig @@ -430,6 +430,14 @@ config INET_UDP_DIAG Support for UDP socket monitoring interface used by the ss tool. If unsure, say Y. +config INET_RAW_DIAG + tristate "RAW: socket monitoring interface" + depends on INET_DIAG && (IPV6 || IPV6=n) + default n + ---help--- + Support for RAW socket monitoring interface used by the ss tool. + If unsure, say Y. + config INET_DIAG_DESTROY bool "INET: allow privileged process to administratively close sockets" depends on INET_DIAG Index: linux-ml.git/net/ipv4/Makefile === --- linux-ml.git.orig/net/ipv4/Makefile +++ linux-ml.git/net/ipv4/Makefile @@ -40,6 +40,7 @@ obj-$(CONFIG_NETFILTER) += netfilter.o n obj-$(CONFIG_INET_DIAG) += inet_diag.o obj-$(CONFIG_INET_TCP_DIAG) += tcp_diag.o obj-$(CONFIG_INET_UDP_DIAG) += udp_diag.o +obj-$(CONFIG_INET_RAW_DIAG) += raw_diag.o obj-$(CONFIG_NET_TCPPROBE) += tcp_probe.o obj-$(CONFIG_TCP_CONG_BIC) += tcp_bic.o obj-$(CONFIG_TCP_CONG_CDG) += tcp_cdg.o Index: linux-ml.git/net/ipv4/raw.c === --- linux-ml.git.orig/net/ipv4/raw.c +++ linux-ml.git/net/ipv4/raw.c @@ -89,9 +89,10 @@ struct raw_frag_vec { int hlen; }; -static struct raw_hashinfo raw_v4_hashinfo = { +struct raw_hashinfo raw_v4_hashinfo = { .lock = __RW_LOCK_UNLOCKED(raw_v4_hashinfo.lock), }; +EXPORT_SYMBOL_GPL(raw_v4_hashinfo); int raw_hash_sk(struct sock *sk) { @@ -120,7 +121,7 @@ void raw_unhash_sk(struct sock *sk) } EXPORT_SYMBOL_GPL(raw_unhash_sk); -static struct sock *__raw_v4_lookup(struct net *net, struct sock *sk, +struct sock *__raw_v4_lookup(struct net *net, struct sock *sk, unsigned short num, __be32 raddr, __be32 laddr, int dif) { sk_for_each_from(sk) { @@ -136,6 +137,7 @@ static struct sock *__raw_v4_lookup(stru found: return sk; } +EXPORT_SYMBOL_GPL(__raw_v4_lookup); /* * 0 - deliver Index: linux-ml.git/net/ipv4/raw_diag.c === --- /dev/null +++ linux-ml.git/net/ipv4/raw_diag.c @@ -0,0 +1,192 @@ +#include + +#include +#include + +#include +#include + +#ifdef pr_fmt +# undef pr_fmt +#endif + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +static
RE: [PATCH v2] net: ethernet: renesas: sh_eth: add POST registers for rz
On 9/9/2016, Sergei Shtylyov wrote: > > sh_eth_private *mdp) { > > if (sh_eth_is_rz_fast_ether(mdp)) { > > sh_eth_tsu_write(mdp, 0, TSU_TEN); /* Disable all CAM entry */ > > + sh_eth_tsu_write(mdp, TSU_FWSLC_POSTENU | TSU_FWSLC_POSTENL, > > +TSU_FWSLC);/* Enable POST registers */ > > return; > > } > > Wait, don't you also need to write 0s to the POST registers like done > at the end of this function? Nope. The sh_eth_chip_reset() function will write to register ARSTR which will do a HW reset on the block and clear all the registers, including all the POST registers. static struct sh_eth_cpu_data r7s72100_data = { .chip_reset = sh_eth_chip_reset, So, before sh_eth_tsu_init() is ever called, the hardware will always be reset. /* initialize first or needed device */ if (!devno || pd->needs_init) { if (mdp->cd->chip_reset) mdp->cd->chip_reset(ndev); if (mdp->cd->tsu) { /* TSU init (Init only)*/ sh_eth_tsu_init(mdp); } } Therefore there is no reason to set the POST registers back to 0 because they are already at 0 from the reset. Chris
Re: Minimum MTU Mess
On Thu, Sep 08, 2016 at 03:24:13AM +0200, Andrew Lunn wrote: > > This is definitely going to require a few passes... (Working my way > > through every driver with an ndo_change_mtu wired up right now to > > see just how crazy this might get). > > It might be something Coccinelle can help you with. Try describing the > transformation you want to do, to their mailing list, and they might > come up with a script for you. >From looking everything over, I'd be very surprised if they could. The places where things need changing vary quite wildly by driver, but I've actually got a full set of compiling changes with a cumulative diffstat of: 153 files changed, 599 insertions(+), 1002 deletions(-) Actually breaking this up into easily digestable/mergeable chunks is going to be kind of entertaining... Suggestions welcomed on that. First up is obviously the core change, which touches just net/ethernet/eth.c, net/core/dev.c, include/linux/netdevice.h and include/uapi/linux/if_ether.h, and should let existing code continue to Just Work(tm), though devices using ether_setup() that had no MTU range checking (or one or the other missing) will wind up with new bounds. For the most part, after the initial patch, very few of the others would have any direct interaction with any others, so they could all be singletons, or small batches per-vendor, or whatever. Full diffstat for the aid of discussion on how to break it up: drivers/char/pcmcia/synclink_cs.c | 1 - drivers/firewire/net.c | 14 ++--- drivers/infiniband/hw/nes/nes.c| 1 - drivers/infiniband/hw/nes/nes.h| 4 +- drivers/infiniband/hw/nes/nes_nic.c| 7 +-- drivers/misc/sgi-xp/xpnet.c| 21 ++-- drivers/net/ethernet/agere/et131x.c| 7 +-- drivers/net/ethernet/altera/altera_tse.h | 1 - drivers/net/ethernet/altera/altera_tse_main.c | 12 ++--- drivers/net/ethernet/amd/amd8111e.c| 5 +- drivers/net/ethernet/atheros/alx/hw.h | 1 - drivers/net/ethernet/atheros/alx/main.c| 9 +--- drivers/net/ethernet/atheros/atl1c/atl1c_main.c| 41 +- drivers/net/ethernet/atheros/atl1e/atl1e_main.c| 11 ++-- drivers/net/ethernet/atheros/atlx/atl1.c | 15 +++--- drivers/net/ethernet/atheros/atlx/atl2.c | 14 +++-- drivers/net/ethernet/broadcom/b44.c| 5 +- drivers/net/ethernet/broadcom/bcm63xx_enet.c | 30 +++ drivers/net/ethernet/broadcom/bnx2.c | 8 ++- drivers/net/ethernet/broadcom/bnx2.h | 6 +-- drivers/net/ethernet/broadcom/bnx2x/bnx2x.h| 2 +- drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c| 8 +-- drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c | 22 +++- drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 4 ++ drivers/net/ethernet/broadcom/bnxt/bnxt.c | 7 +-- drivers/net/ethernet/broadcom/tg3.c| 7 +-- drivers/net/ethernet/brocade/bna/bnad.c| 7 +-- drivers/net/ethernet/cadence/macb.c| 17 +++--- drivers/net/ethernet/calxeda/xgmac.c | 18 ++- drivers/net/ethernet/cavium/liquidio/lio_main.c| 15 ++ .../net/ethernet/cavium/liquidio/octeon_network.h | 2 +- drivers/net/ethernet/cavium/octeon/octeon_mgmt.c | 5 +- drivers/net/ethernet/cavium/thunder/nicvf_main.c | 10 ++-- drivers/net/ethernet/chelsio/cxgb/cxgb2.c | 2 - drivers/net/ethernet/cisco/enic/enic_main.c| 7 +-- drivers/net/ethernet/cisco/enic/enic_res.h | 2 +- drivers/net/ethernet/dlink/dl2k.c | 22 ++-- drivers/net/ethernet/dlink/sundance.c | 6 ++- drivers/net/ethernet/freescale/gianfar.c | 9 ++-- drivers/net/ethernet/hisilicon/hns/hns_enet.c | 4 -- drivers/net/ethernet/ibm/ehea/ehea_main.c | 13 ++--- drivers/net/ethernet/ibm/emac/core.c | 7 +-- drivers/net/ethernet/intel/e100.c | 9 drivers/net/ethernet/intel/e1000/e1000_main.c | 12 ++--- drivers/net/ethernet/intel/e1000e/netdev.c | 14 +++-- drivers/net/ethernet/intel/fm10k/fm10k_netdev.c| 15 ++ drivers/net/ethernet/intel/i40e/i40e_main.c| 10 ++-- drivers/net/ethernet/intel/i40evf/i40evf_main.c| 8 +-- drivers/net/ethernet/intel/igb/e1000_defines.h | 3 +- drivers/net/ethernet/intel/igb/igb_main.c | 16 ++ drivers/net/ethernet/intel/igbvf/defines.h | 3 +- drivers/net/ethernet/intel/igbvf/netdev.c | 14 ++--- drivers/net/ethernet/intel/ixgb/ixgb_main.c| 16 ++ drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 11 ++-- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 33 ++-- drivers/net/ethernet/marvell/mvneta.c | 36 - drivers/net/ethernet/marvell/mvpp2.c | 36 -
[PATCH net-next] Revert "hv_netvsc: make inline functions static"
From: Stephen HemmingerThese functions are used by other code misc-next tree. This reverts commit 30d1de08c87ddde6f73936c3350e7e153988fe02. Signed-off-by: Stephen Hemminger --- drivers/net/hyperv/netvsc.c | 85 + include/linux/hyperv.h | 84 2 files changed, 85 insertions(+), 84 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 2a9ccc4..ff05b9b 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -34,89 +34,6 @@ #include "hyperv_net.h" /* - * An API to support in-place processing of incoming VMBUS packets. - */ -#define VMBUS_PKT_TRAILER 8 - -static struct vmpacket_descriptor * -get_next_pkt_raw(struct vmbus_channel *channel) -{ - struct hv_ring_buffer_info *ring_info = >inbound; - u32 read_loc = ring_info->priv_read_index; - void *ring_buffer = hv_get_ring_buffer(ring_info); - struct vmpacket_descriptor *cur_desc; - u32 packetlen; - u32 dsize = ring_info->ring_datasize; - u32 delta = read_loc - ring_info->ring_buffer->read_index; - u32 bytes_avail_toread = (hv_get_bytes_to_read(ring_info) - delta); - - if (bytes_avail_toread < sizeof(struct vmpacket_descriptor)) - return NULL; - - if ((read_loc + sizeof(*cur_desc)) > dsize) - return NULL; - - cur_desc = ring_buffer + read_loc; - packetlen = cur_desc->len8 << 3; - - /* -* If the packet under consideration is wrapping around, -* return failure. -*/ - if ((read_loc + packetlen + VMBUS_PKT_TRAILER) > (dsize - 1)) - return NULL; - - return cur_desc; -} - -/* - * A helper function to step through packets "in-place" - * This API is to be called after each successful call - * get_next_pkt_raw(). - */ -static void put_pkt_raw(struct vmbus_channel *channel, - struct vmpacket_descriptor *desc) -{ - struct hv_ring_buffer_info *ring_info = >inbound; - u32 read_loc = ring_info->priv_read_index; - u32 packetlen = desc->len8 << 3; - u32 dsize = ring_info->ring_datasize; - - BUG_ON((read_loc + packetlen + VMBUS_PKT_TRAILER) > dsize); - - /* -* Include the packet trailer. -*/ - ring_info->priv_read_index += packetlen + VMBUS_PKT_TRAILER; -} - -/* - * This call commits the read index and potentially signals the host. - * Here is the pattern for using the "in-place" consumption APIs: - * - * while (get_next_pkt_raw() { - * process the packet "in-place"; - * put_pkt_raw(); - * } - * if (packets processed in place) - * commit_rd_index(); - */ -static void commit_rd_index(struct vmbus_channel *channel) -{ - struct hv_ring_buffer_info *ring_info = >inbound; - /* -* Make sure all reads are done before we update the read index since -* the writer may start writing to the read area once the read index -* is updated. -*/ - virt_rmb(); - ring_info->ring_buffer->read_index = ring_info->priv_read_index; - - if (hv_need_to_signal_on_read(ring_info)) - vmbus_set_event(channel); -} - -/* * Switch the data path from the synthetic interface to the VF * interface. */ @@ -840,7 +757,7 @@ static u32 netvsc_copy_to_send_buf(struct netvsc_device *net_device, return msg_size; } -static int netvsc_send_pkt( +static inline int netvsc_send_pkt( struct hv_device *device, struct hv_netvsc_packet *packet, struct netvsc_device *net_device, diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index b01c8c3..5df444b 100644 --- a/include/linux/hyperv.h +++ b/include/linux/hyperv.h @@ -1429,4 +1429,88 @@ static inline bool hv_need_to_signal_on_read(struct hv_ring_buffer_info *rbi) return false; } +/* + * An API to support in-place processing of incoming VMBUS packets. + */ +#define VMBUS_PKT_TRAILER 8 + +static inline struct vmpacket_descriptor * +get_next_pkt_raw(struct vmbus_channel *channel) +{ + struct hv_ring_buffer_info *ring_info = >inbound; + u32 read_loc = ring_info->priv_read_index; + void *ring_buffer = hv_get_ring_buffer(ring_info); + struct vmpacket_descriptor *cur_desc; + u32 packetlen; + u32 dsize = ring_info->ring_datasize; + u32 delta = read_loc - ring_info->ring_buffer->read_index; + u32 bytes_avail_toread = (hv_get_bytes_to_read(ring_info) - delta); + + if (bytes_avail_toread < sizeof(struct vmpacket_descriptor)) + return NULL; + + if ((read_loc + sizeof(*cur_desc)) > dsize) + return NULL; + + cur_desc = ring_buffer + read_loc; + packetlen = cur_desc->len8 << 3; + + /* +* If the packet under consideration is wrapping around, +* return failure. +*/ +
Re: [PATCH v2] net: ethernet: renesas: sh_eth: add POST registers for rz
On 09/07/2016 09:57 PM, Chris Brandt wrote: Due to a mistake in the hardware manual, the FWSLC and POST1-4 registers were not documented and left out of the driver for RZ/A making the CAM feature non-operational. Additionally, when the offset values for POST1-4 are left blank, the driver attempts to set them using an offset of 0x which can cause a memory corruption or panic. This patch fixes the panic and properly enables CAM. Reported-by: Daniel PalmerSigned-off-by: Chris Brandt --- v2: * POST registers really do exist, so just add them --- drivers/net/ethernet/renesas/sh_eth.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c index 1f8240a..440ae27 100644 --- a/drivers/net/ethernet/renesas/sh_eth.c +++ b/drivers/net/ethernet/renesas/sh_eth.c [...] @@ -2781,6 +2786,8 @@ static void sh_eth_tsu_init(struct sh_eth_private *mdp) { if (sh_eth_is_rz_fast_ether(mdp)) { sh_eth_tsu_write(mdp, 0, TSU_TEN); /* Disable all CAM entry */ + sh_eth_tsu_write(mdp, TSU_FWSLC_POSTENU | TSU_FWSLC_POSTENL, +TSU_FWSLC);/* Enable POST registers */ return; } Wait, don't you also need to write 0s to the POST registers like done at the end of this function? MBR, Sergei
Re: [PATCH v2] net: ethernet: renesas: sh_eth: add POST registers for rz
Hello. On 09/07/2016 09:57 PM, Chris Brandt wrote: Due to a mistake in the hardware manual, the FWSLC and POST1-4 registers were not documented and left out of the driver for RZ/A making the CAM feature non-operational. Additionally, when the offset values for POST1-4 are left blank, the driver attempts to set them using an offset of 0x which can cause a memory corruption or panic. You didn't really fix the root cause here... This patch fixes the panic and properly enables CAM. Reported-by: Daniel PalmerSigned-off-by: Chris Brandt --- v2: * POST registers really do exist, so just add them Acked-by: Sergei Shtylyov MBR, Sergei
Re: [PATCH 7/8] sctp: use IS_ENABLED() instead of checking for built-in or module
On Fri, Sep 09, 2016 at 08:43:19AM -0400, Javier Martinez Canillas wrote: > The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either > built-in or as a module, use that macro instead of open coding the same. > > Using the macro makes the code more readable by helping abstract away some > of the Kconfig built-in and module enable details. > > Signed-off-by: Javier Martinez Canillas> --- > > net/sctp/auth.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/sctp/auth.c b/net/sctp/auth.c > index 912eb1685a5d..f99d4855d3de 100644 > --- a/net/sctp/auth.c > +++ b/net/sctp/auth.c > @@ -48,7 +48,7 @@ static struct sctp_hmac sctp_hmac_list[SCTP_AUTH_NUM_HMACS] > = { > /* id 2 is reserved as well */ > .hmac_id = SCTP_AUTH_HMAC_ID_RESERVED_2, > }, > -#if defined (CONFIG_CRYPTO_SHA256) || defined (CONFIG_CRYPTO_SHA256_MODULE) > +#if IS_ENABLED(CONFIG_CRYPTO_SHA256) > { > .hmac_id = SCTP_AUTH_HMAC_ID_SHA256, > .hmac_name = "hmac(sha256)", > -- > 2.7.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-sctp" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Acked-by: Neil Horman
Re: [PATCH net 1/6] sctp: remove the unnecessary state check in sctp_outq_tail
> I don't know, I still don't feel safe about it. I agree the socket lock keeps > the state from changing during a single transmission, which makes the use case > you are focused on correct. ok, :-) > > That said, have you considered the retransmit case? That is to say, if you > queue and flush the outq, and some packets fail delivery, and in the time > between the intial send and the expiration of the RTX timer (during which the > socket lock will have been released), an event may occur which changes the > transport state, which will then be ignored with your patch. Sorry, I'm not sure if I got it. You mean "during which changes q->asoc->state", right ? This patch removes the check of q->asoc->state in sctp_outq_tail(). sctp_outq_tail() is called for data only in: sctp_primitive_SEND -> sctp_do_sm -> sctp_cmd_send_msg -> sctp_cmd_interpreter -> sctp_cmd_send_msg() -> sctp_outq_tail() before calling sctp_primitive_SEND, hold sock lock first. then sctp_primitive_SEND choose FUNC according: #define TYPE_SCTP_PRIMITIVE_SEND { if asoc->state is unavailable, FUNC can't be sctp_cmd_send_msg, but sctp_sf_error_closed/sctp_sf_error_shutdown, sctp_outq_tail can't be called, either. I mean sctp_primitive_SEND do the same check for asoc->state already actually. so the code in sctp_outq_tail is redundant actually. > > Neil >
Re: [PATCH net-next] macsec: set network devtype
2016-09-08, 17:24:07 -0700, David Miller wrote: > From: Stephen Hemminger> Date: Wed, 7 Sep 2016 14:07:32 -0700 > > > The netdevice type structure for macsec was being defined but never used. > > To set the network device type the macro SET_NETDEV_DEVTYPE must be called. > > Compile tested only, I don't use macsec. > > > > Signed-off-by: Stephen Hemminger > > Sabrina, please review. > > Thanks. Sorry for the delay. LGTM: Acked-by: Sabrina Dubroca -- Sabrina
Re: [iproute PATCH] macsec: fix input range of 'icvlen' parameter
2016-09-09, 16:02:22 +0200, Davide Caratti wrote: > the maximum possible ICV length in a MACsec frame is 16 octects, not 32: > fix get_icvlen() accordingly, so that a proper error message is displayed > in case input 'icvlen' is greater than 16. > > Signed-off-by: Davide CarattiAcked-by: Sabrina Dubroca -- Sabrina
Re: [PATCH net] net_sched: act_mirred: full rcu conversion
On 16-09-08 10:26 PM, Cong Wang wrote: > On Thu, Sep 8, 2016 at 8:51 AM, Eric Dumazetwrote: >> On Thu, 2016-09-08 at 08:47 -0700, John Fastabend wrote: >> >>> Works for me. FWIW I find this plenty straightforward and don't really >>> see the need to make the hash table itself rcu friendly. >>> >>> Acked-by: John Fastabend >>> >> >> Yes, it seems this hash table is used in control path, with RTNL held >> anyway. > > Seriously? You never read hashtable in fast path?? I think you need > to wake up. > But the actions use refcnt'ing and should never be decremented to zero as long as they can still be referenced by an active filter. If each action handles its parameters like mirred/gact then I don't see why its necessary. I believe though that the refcnt needs to be fixed a bit though most likely by making it atomic. I original assumed it was protected by RTNL lock but because its getting decremented from rcu callback this is not true. .John
Re: [PATCH net-next V7 4/4] net/sched: Introduce act_tunnel_key
On 16-09-09 06:19 AM, Eric Dumazet wrote: > On Thu, 2016-09-08 at 22:30 -0700, Cong Wang wrote: >> On Thu, Sep 8, 2016 at 9:15 AM, John Fastabend>> wrote: >>> >>> This should be rtnl_derefence(t->params) and drop the read_lock/unlock >>> pair. This is always called with RTNL lock unless you have a path I'm >>> not seeing. >> >> You missed the previous discussion on V6, John. >> >> BTW, you really should follow the whole discussion instead of >> jumping in the middle, like what you did for my patchset. >> I understand you are eager to comment, but please don't waste >> others' time in this way Please. > > But John is right, and he definitely is welcome to give his feedback > even at V13 if he wants to. > > tunnel_key_dump() is called with RTNL being held. > > Take a deep breath, vacations, and come back when you are relaxed. > > Thanks. > > Also v6 discussion was around cleanup() call back I see nothing about the dump() callbacks. And if there was it wasn't fixed so it should be resolved. Anyways Dave/Hadar feel free to submit a follow up patch or v8 it doesn't much matter to me as noted in the original post. .John
Re: [RFC Patch net-next 5/6] net_sched: use rcu in fast path
On 16-09-08 10:54 PM, Cong Wang wrote: > On Thu, Sep 8, 2016 at 8:49 AM, John Fastabend> wrote: >> Agreed not sure why you would ever want to do a late binding and >> replace on a tc_mirred actions. But it is supported... > > I will let Jamal teach you on this, /me is really tired of explaining > things to you John. > This was a meta-comment on the use case for doing this with mirred action. Not necessarily about the patch itself. I was actually curious where this happens in practice. The only thing I can think of is your external logging box moved so you need to send out another port. Is there any open source software that manages 'tc' like this. If so I would like to read it. So do you know of any? .John
Re: [PATCH net-next v5] gso: Support partial splitting at the frag_list pointer
On Fri, Sep 9, 2016 at 12:25 AM, Steffen Klassertwrote: > Since commit 8a29111c7 ("net: gro: allow to build full sized skb") > gro may build buffers with a frag_list. This can hurt forwarding > because most NICs can't offload such packets, they need to be > segmented in software. This patch splits buffers with a frag_list > at the frag_list pointer into buffers that can be TSO offloaded. > > Signed-off-by: Steffen Klassert > --- > > Changes since v1: > > - Use the assumption that all buffers in the chain excluding the last > containing the same amount of data. > > - Simplify some checks against gso partial. > > - Fix the generation of IP IDs. > > Changes since v2: > > - Merge common code of gso partial and frag_list pointer splitting. > > Changes since v3: > > - Fix the checks for doing frag_list pointer splitting. > > Changes since v4: > > - Whitespace fix. > - Fix size calculations of the tail packet. > > net/core/skbuff.c | 51 > +++--- > net/ipv4/af_inet.c | 14 ++ > net/ipv4/gre_offload.c | 6 -- > net/ipv4/tcp_offload.c | 13 +++-- > net/ipv4/udp_offload.c | 6 -- > net/ipv6/ip6_offload.c | 5 - > 6 files changed, 69 insertions(+), 26 deletions(-) > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 3864b4b6..51e761a 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -3078,11 +3078,31 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, <...> > @@ -3090,6 +3110,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, > partial_segs = 0; > } > > +normal: > headroom = skb_headroom(head_skb); > pos = skb_headlen(head_skb); > > @@ -3281,21 +3302,29 @@ perform_csum_check: > */ > segs->prev = tail; > > - /* Update GSO info on first skb in partial sequence. */ > if (partial_segs) { > + struct sk_buff *iter; > int type = skb_shinfo(head_skb)->gso_type; > + unsigned short gso_size = skb_shinfo(head_skb)->gso_size; > > /* Update type to add partial and then remove dodgy if set */ > - type |= SKB_GSO_PARTIAL; > + type |= (features & NETIF_F_GSO_PARTIAL) / > NETIF_F_GSO_PARTIAL * SKB_GSO_PARTIAL; > type &= ~SKB_GSO_DODGY; > > /* Update GSO info and prepare to start updating headers on > * our way back down the stack of protocols. > */ > - skb_shinfo(segs)->gso_size = skb_shinfo(head_skb)->gso_size; > - skb_shinfo(segs)->gso_segs = partial_segs; > - skb_shinfo(segs)->gso_type = type; > - SKB_GSO_CB(segs)->data_offset = skb_headroom(segs) + doffset; > + for (iter = segs; iter; iter = iter->next) { > + skb_shinfo(iter)->gso_size = gso_size; > + skb_shinfo(iter)->gso_segs = partial_segs; > + skb_shinfo(iter)->gso_type = type; > + SKB_GSO_CB(iter)->data_offset = skb_headroom(iter) + > doffset; > + } > + > + if (tail->len <= gso_size) > + skb_shinfo(tail)->gso_size = 0; Actually we need to do tail->len - doffset up here as well. The gso_size value reflects the size of the data segment, and tail->len is the size of the entire frame so we have to remove the size of the headers to make the comparison accurate. > + else if (tail != segs) > + skb_shinfo(tail)->gso_segs = DIV_ROUND_UP(tail->len - > doffset, gso_size); > } > > /* Following permits correct backpressure, for protocols
Re: [PATCH RFC 00/11] mlx5 RX refactoring and XDP support
On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameedwrote: > Hi All, > > This patch set introduces some important data path RX refactoring > addressing mlx5e memory allocation/management improvements and XDP support. > > Submitting as RFC since we would like to get an early feedback, while we > continue reviewing testing and complete the performance analysis in house. > Hi, I am going to be out of office for the whole next week with a random mail access. I will do my best to be as active as possible, but in the meanwhile, Tariq and Or will handle any questions regarding this series or mlx5 in general while I am away. Thanks, Saeed.
Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-devwrote: > On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote: >> >> I'm sorry but I have a problem with this patch! > > is it because the variable is called 'xdp_doorbell'? > Frankly I see nothing scary in this patch. > It extends existing code by adding a flag to ring doorbell or not. > The end of rx napi is used as an obvious heuristic to flush the pipe. > Looks pretty generic to me. > The same code can be used for non-xdp as well once we figure out > good algorithm for xmit_more in the stack. > >> Looking at this patch, I want to bring up a fundamental architectural >> concern with the development direction of XDP transmit. >> >> >> What you are trying to implement, with delaying the doorbell, is >> basically TX bulking for TX_XDP. >> >> Why not implement a TX bulking interface directly instead?!? >> >> Yes, the tailptr/doorbell is the most costly operation, but why not >> also take advantage of the benefits of bulking for other parts of the >> code? (benefit is smaller, by every cycles counts in this area) >> >> This hole XDP exercise is about avoiding having a transaction cost per >> packet, that reads "bulking" or "bundling" of packets, where possible. >> >> Lets do bundling/bulking from the start! Jesper, what we did here is also bulking, instead of bulkin in a temporary list in the driver we list the packets in the HW and once done we transmit all at once via the xdp_doorbell indication. I agree with you that we can take advantage and improve the icahce by bulkin first in software and then queue all at once in the hw then ring one doorbell. but I also agree with Alexei that this will introduce an extra pointer/list handling in the diver and we need to do the comparison between both approaches before we decide which is better. this must be marked as future work and not have this from the start. > > mlx4 already does bulking and this proposed mlx5 set of patches > does bulking as well. > See nothing wrong about it. RX side processes the packets and > when it's done it tells TX to xmit whatever it collected. > >> The reason behind the xmit_more API is that we could not change the >> API of all the drivers. And we found that calling an explicit NDO >> flush came at a cost (only approx 7 ns IIRC), but it still a cost that >> would hit the common single packet use-case. >> >> It should be really easy to build a bundle of packets that need XDP_TX >> action, especially given you only have a single destination "port". >> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record. > > not sure what are you proposing here? > Sounds like you want to extend it to multi port in the future? > Sure. The proposed code is easily extendable. > > Or you want to see something like a link list of packets > or an array of packets that RX side is preparing and then > send the whole array/list to TX port? > I don't think that would be efficient, since it would mean > unnecessary copy of pointers. > >> In the future, XDP need to support XDP_FWD forwarding of packets/pages >> out other interfaces. I also want bulk transmit from day-1 here. It >> is slightly more tricky to sort packets for multiple outgoing >> interfaces efficiently in the pool loop. > > I don't think so. Multi port is natural extension to this set of patches. > With multi port the end of RX will tell multiple ports (that were > used to tx) to ring the bell. Pretty trivial and doesn't involve any > extra arrays or link lists. > >> But the mSwitch[1] article actually already solved this destination >> sorting. Please read[1] section 3.3 "Switch Fabric Algorithm" for >> understanding the next steps, for a smarter data structure, when >> starting to have more TX "ports". And perhaps align your single >> XDP_TX destination data structure to this future development. >> >> [1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf > > I don't see how this particular paper applies to the existing kernel code. > It's great to take ideas from research papers, but real code is different. > >> --Jesper >> (top post) > > since when it's ok to top post? > >> On Wed, 7 Sep 2016 15:42:32 +0300 Saeed Mahameed >> wrote: >> >> > Previously we rang XDP SQ doorbell on every forwarded XDP packet. >> > >> > Here we introduce a xmit more like mechanism that will queue up more >> > than one packet into SQ (up to RX napi budget) w/o notifying the hardware. >> > >> > Once RX napi budget is consumed and we exit napi RX loop, we will >> > flush (doorbell) all XDP looped packets in case there are such. >> > >> > XDP forward packet rate: >> > >> > Comparing XDP with and w/o xmit more (bulk transmit): >> > >> > Streams XDP TX XDP TX (xmit more) >> > --- >> > 1 4.90Mpps 7.50Mpps >> > 2 9.50Mpps 14.8Mpps >> > 4 16.5Mpps