Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines

2016-09-09 Thread Tom Herbert
On Fri, Sep 9, 2016 at 8:26 PM, John Fastabend  wrote:
> On 16-09-09 08:12 PM, Tom Herbert wrote:
>> On Fri, Sep 9, 2016 at 6:40 PM, Alexei Starovoitov
>>  wrote:
>>> On Fri, Sep 09, 2016 at 06:19:56PM -0700, Tom Herbert wrote:
 On Fri, Sep 9, 2016 at 6:12 PM, John Fastabend  
 wrote:
> On 16-09-09 06:04 PM, Tom Herbert wrote:
>> On Fri, Sep 9, 2016 at 5:01 PM, John Fastabend 
>>  wrote:
>>> On 16-09-09 04:44 PM, Tom Herbert wrote:
 On Fri, Sep 9, 2016 at 2:29 PM, John Fastabend 
  wrote:
> e1000 supports a single TX queue so it is being shared with the stack
> when XDP runs XDP_TX action. This requires taking the xmit lock to
> ensure we don't corrupt the tx ring. To avoid taking and dropping the
> lock per packet this patch adds a bundling implementation to submit
> a bundle of packets to the xmit routine.
>
> I tested this patch running e1000 in a VM using KVM over a tap
> device using pktgen to generate traffic along with 'ping -f -l 100'.
>
 Hi John,

 How does this interact with BQL on e1000?

 Tom

>>>
>>> Let me check if I have the API correct. When we enqueue a packet to
>>> be sent we must issue a netdev_sent_queue() call and then on actual
>>> transmission issue a netdev_completed_queue().
>>>
>>> The patch attached here missed a few things though.
>>>
>>> But it looks like I just need to call netdev_sent_queue() from the
>>> e1000_xmit_raw_frame() routine and then let the tx completion logic
>>> kick in which will call netdev_completed_queue() correctly.
>>>
>>> I'll need to add a check for the queue state as well. So if I do these
>>> three things,
>>>
>>> check __QUEUE_STATE_XOFF before sending
>>> netdev_sent_queue() -> on XDP_TX
>>> netdev_completed_queue()
>>>
>>> It should work agree? Now should we do this even when XDP owns the
>>> queue? Or is this purely an issue with sharing the queue between
>>> XDP and stack.
>>>
>> But what is the action for XDP_TX if the queue is stopped? There is no
>> qdisc to back pressure in the XDP path. Would we just start dropping
>> packets then?
>
> Yep that is what the patch does if there is any sort of error packets
> get dropped on the floor. I don't think there is anything else that
> can be done.
>
 That probably means that the stack will always win out under load.
 Trying to used the same queue where half of the packets are well
 managed by a qdisc and half aren't is going to leave someone unhappy.
 Maybe in the this case where we have to share the qdisc we can
 allocate the skb on on returning XDP_TX and send through the normal
 qdisc for the device.
>>>
>>> I wouldn't go to such extremes for e1k.
>>> The only reason to have xdp in e1k is to use it for testing
>>> of xdp programs. Nothing else. e1k is, best case, 1Gbps adapter.
>>
>> I imagine someone may want this for the non-forwarding use cases like
>> early drop for DOS mitigation. Regardless of the use case, I don't
>> think we can break the fundamental assumptions made for qdiscs or the
>> rest of the transmit path. If XDP must transmit on a queue shared with
>> the stack we need to abide by the stack's rules for transmitting on
>> the queue-- which would mean alloc skbuff and go through qdisc (which
>
> If we require XDP_TX to go up to qdisc layer its best not to implement
> it at all and just handle it in normal ingress path. That said I think
> users have to expect that XDP will interfere with qdisc schemes. Even
> with its own tx queue its going to interfere at the hardware level with
> bandwidth as the hardware round robins through the queues or uses
> whatever hardware strategy it is configured to use. Additionally it
> will bypass things like BQL, etc.
>
Right, but not all use cases involve XDP_TX (like DOS mitigation as I
pointed out). Since you've already done 95% of the work, can you take
a look at creating the skbuff and injecting into the stack for XDP_TX
so we can evaluate the performance and impact of that :-)

With separate TX queues it's explicit which queues are managed by the
stack. This is no different than what kernel bypass gives use, we are
relying on HW to do something reasonable in scheduling MQ.

>> really shouldn't be difficult to implement). Emulating various
>> functions of the stack in the XDP TX path, like this patch seems to be
>> doing for XMIT_MORE, potentially gets us into a wack-a-mole situation
>> trying to keep things coherent.
>
> I think bundling tx xmits is fair game as an internal optimization and
> doesn't need to be exposed at the XDP layer. Drivers already do this
> type of optimizations for 

Re: [PATCH] ATM-iphase: Use kmalloc_array() in tx_init()

2016-09-09 Thread David Miller
From: SF Markus Elfring 
Date: Fri, 9 Sep 2016 20:42:16 +0200

> From: Markus Elfring 
> Date: Fri, 9 Sep 2016 20:40:16 +0200
> 
> * Multiplications for the size determination of memory allocations
>   indicated that array data structures should be processed.
>   Thus use the corresponding function "kmalloc_array".
> 
>   This issue was detected by using the Coccinelle software.
> 
> * Replace the specification of data types by pointer dereferences
>   to make the corresponding size determination a bit safer according to
>   the Linux coding style convention.
> 
> Signed-off-by: Markus Elfring 

Applied.


Re: [PATCH v3 0/9] net-next: ethernet: add sun8i-emac driver

2016-09-09 Thread David Miller
From: Corentin Labbe 
Date: Fri,  9 Sep 2016 14:45:08 +0200

> This patch series add the driver for sun8i-emac which handle the
> Ethernet MAC present on Allwinner H3/A83T/A64 SoCs.

Please don't post a patch series with some subset of the series
marked as "RFC".  I will just simply toss the entire series when
you do this.

Thank you.


Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines

2016-09-09 Thread Alexei Starovoitov
On Fri, Sep 09, 2016 at 08:12:52PM -0700, Tom Herbert wrote:
> >> That probably means that the stack will always win out under load.
> >> Trying to used the same queue where half of the packets are well
> >> managed by a qdisc and half aren't is going to leave someone unhappy.
> >> Maybe in the this case where we have to share the qdisc we can
> >> allocate the skb on on returning XDP_TX and send through the normal
> >> qdisc for the device.
> >
> > I wouldn't go to such extremes for e1k.
> > The only reason to have xdp in e1k is to use it for testing
> > of xdp programs. Nothing else. e1k is, best case, 1Gbps adapter.
> 
> I imagine someone may want this for the non-forwarding use cases like
> early drop for DOS mitigation.

Sure and they will be doing it on the NICs that they have in their servers.
e1k is not that nic. xdp e1k is for debugging xdp programs in KVM only.
Performance of such xdp programs on e1k is irrelevant.
There is absolutely no need to complicate the driver and the patches.
All other drivers is a different story.



Re: [PATCH net-next 0/4] alx: add msi-x support

2016-09-09 Thread David Miller
From: Tobias Regnery 
Date: Fri,  9 Sep 2016 12:19:51 +0200

> This patchset adds msi-x support to the alx driver. It is a preparatory
> series for multi queue support, which I am currently working on. As there
> is no advantage over msi interrupts without multi queue support, msi-x
> interrupts are disabled by default. In order to test for regressions, a
> new module parameter is added to enable msi-x interrupts.
> 
> Based on information of the downstream driver at github.com/qca/alx

Series applied, thanks.


Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines

2016-09-09 Thread John Fastabend
On 16-09-09 08:12 PM, Tom Herbert wrote:
> On Fri, Sep 9, 2016 at 6:40 PM, Alexei Starovoitov
>  wrote:
>> On Fri, Sep 09, 2016 at 06:19:56PM -0700, Tom Herbert wrote:
>>> On Fri, Sep 9, 2016 at 6:12 PM, John Fastabend  
>>> wrote:
 On 16-09-09 06:04 PM, Tom Herbert wrote:
> On Fri, Sep 9, 2016 at 5:01 PM, John Fastabend  
> wrote:
>> On 16-09-09 04:44 PM, Tom Herbert wrote:
>>> On Fri, Sep 9, 2016 at 2:29 PM, John Fastabend 
>>>  wrote:
 e1000 supports a single TX queue so it is being shared with the stack
 when XDP runs XDP_TX action. This requires taking the xmit lock to
 ensure we don't corrupt the tx ring. To avoid taking and dropping the
 lock per packet this patch adds a bundling implementation to submit
 a bundle of packets to the xmit routine.

 I tested this patch running e1000 in a VM using KVM over a tap
 device using pktgen to generate traffic along with 'ping -f -l 100'.

>>> Hi John,
>>>
>>> How does this interact with BQL on e1000?
>>>
>>> Tom
>>>
>>
>> Let me check if I have the API correct. When we enqueue a packet to
>> be sent we must issue a netdev_sent_queue() call and then on actual
>> transmission issue a netdev_completed_queue().
>>
>> The patch attached here missed a few things though.
>>
>> But it looks like I just need to call netdev_sent_queue() from the
>> e1000_xmit_raw_frame() routine and then let the tx completion logic
>> kick in which will call netdev_completed_queue() correctly.
>>
>> I'll need to add a check for the queue state as well. So if I do these
>> three things,
>>
>> check __QUEUE_STATE_XOFF before sending
>> netdev_sent_queue() -> on XDP_TX
>> netdev_completed_queue()
>>
>> It should work agree? Now should we do this even when XDP owns the
>> queue? Or is this purely an issue with sharing the queue between
>> XDP and stack.
>>
> But what is the action for XDP_TX if the queue is stopped? There is no
> qdisc to back pressure in the XDP path. Would we just start dropping
> packets then?

 Yep that is what the patch does if there is any sort of error packets
 get dropped on the floor. I don't think there is anything else that
 can be done.

>>> That probably means that the stack will always win out under load.
>>> Trying to used the same queue where half of the packets are well
>>> managed by a qdisc and half aren't is going to leave someone unhappy.
>>> Maybe in the this case where we have to share the qdisc we can
>>> allocate the skb on on returning XDP_TX and send through the normal
>>> qdisc for the device.
>>
>> I wouldn't go to such extremes for e1k.
>> The only reason to have xdp in e1k is to use it for testing
>> of xdp programs. Nothing else. e1k is, best case, 1Gbps adapter.
> 
> I imagine someone may want this for the non-forwarding use cases like
> early drop for DOS mitigation. Regardless of the use case, I don't
> think we can break the fundamental assumptions made for qdiscs or the
> rest of the transmit path. If XDP must transmit on a queue shared with
> the stack we need to abide by the stack's rules for transmitting on
> the queue-- which would mean alloc skbuff and go through qdisc (which

If we require XDP_TX to go up to qdisc layer its best not to implement
it at all and just handle it in normal ingress path. That said I think
users have to expect that XDP will interfere with qdisc schemes. Even
with its own tx queue its going to interfere at the hardware level with
bandwidth as the hardware round robins through the queues or uses
whatever hardware strategy it is configured to use. Additionally it
will bypass things like BQL, etc.

> really shouldn't be difficult to implement). Emulating various
> functions of the stack in the XDP TX path, like this patch seems to be
> doing for XMIT_MORE, potentially gets us into a wack-a-mole situation
> trying to keep things coherent.

I think bundling tx xmits is fair game as an internal optimization and
doesn't need to be exposed at the XDP layer. Drivers already do this
type of optimizations for allocating buffers. It likely doesn't matter
much at the e1k level but doing a tail update on every pkt with the
40gbps drivers likely will be noticeable is my gut feeling.


> 
>> Existing stack with skb is perfectly fine as it is.
>> No need to do recycling, batching or any other complex things.
>> xdp for e1k cannot be used as an example for other drivers either,
>> since there is only one tx ring and any high performance adapter
>> has more which makes the driver support quite different.
>>



Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines

2016-09-09 Thread Tom Herbert
On Fri, Sep 9, 2016 at 6:40 PM, Alexei Starovoitov
 wrote:
> On Fri, Sep 09, 2016 at 06:19:56PM -0700, Tom Herbert wrote:
>> On Fri, Sep 9, 2016 at 6:12 PM, John Fastabend  
>> wrote:
>> > On 16-09-09 06:04 PM, Tom Herbert wrote:
>> >> On Fri, Sep 9, 2016 at 5:01 PM, John Fastabend  
>> >> wrote:
>> >>> On 16-09-09 04:44 PM, Tom Herbert wrote:
>>  On Fri, Sep 9, 2016 at 2:29 PM, John Fastabend 
>>   wrote:
>> > e1000 supports a single TX queue so it is being shared with the stack
>> > when XDP runs XDP_TX action. This requires taking the xmit lock to
>> > ensure we don't corrupt the tx ring. To avoid taking and dropping the
>> > lock per packet this patch adds a bundling implementation to submit
>> > a bundle of packets to the xmit routine.
>> >
>> > I tested this patch running e1000 in a VM using KVM over a tap
>> > device using pktgen to generate traffic along with 'ping -f -l 100'.
>> >
>>  Hi John,
>> 
>>  How does this interact with BQL on e1000?
>> 
>>  Tom
>> 
>> >>>
>> >>> Let me check if I have the API correct. When we enqueue a packet to
>> >>> be sent we must issue a netdev_sent_queue() call and then on actual
>> >>> transmission issue a netdev_completed_queue().
>> >>>
>> >>> The patch attached here missed a few things though.
>> >>>
>> >>> But it looks like I just need to call netdev_sent_queue() from the
>> >>> e1000_xmit_raw_frame() routine and then let the tx completion logic
>> >>> kick in which will call netdev_completed_queue() correctly.
>> >>>
>> >>> I'll need to add a check for the queue state as well. So if I do these
>> >>> three things,
>> >>>
>> >>> check __QUEUE_STATE_XOFF before sending
>> >>> netdev_sent_queue() -> on XDP_TX
>> >>> netdev_completed_queue()
>> >>>
>> >>> It should work agree? Now should we do this even when XDP owns the
>> >>> queue? Or is this purely an issue with sharing the queue between
>> >>> XDP and stack.
>> >>>
>> >> But what is the action for XDP_TX if the queue is stopped? There is no
>> >> qdisc to back pressure in the XDP path. Would we just start dropping
>> >> packets then?
>> >
>> > Yep that is what the patch does if there is any sort of error packets
>> > get dropped on the floor. I don't think there is anything else that
>> > can be done.
>> >
>> That probably means that the stack will always win out under load.
>> Trying to used the same queue where half of the packets are well
>> managed by a qdisc and half aren't is going to leave someone unhappy.
>> Maybe in the this case where we have to share the qdisc we can
>> allocate the skb on on returning XDP_TX and send through the normal
>> qdisc for the device.
>
> I wouldn't go to such extremes for e1k.
> The only reason to have xdp in e1k is to use it for testing
> of xdp programs. Nothing else. e1k is, best case, 1Gbps adapter.

I imagine someone may want this for the non-forwarding use cases like
early drop for DOS mitigation. Regardless of the use case, I don't
think we can break the fundamental assumptions made for qdiscs or the
rest of the transmit path. If XDP must transmit on a queue shared with
the stack we need to abide by the stack's rules for transmitting on
the queue-- which would mean alloc skbuff and go through qdisc (which
really shouldn't be difficult to implement). Emulating various
functions of the stack in the XDP TX path, like this patch seems to be
doing for XMIT_MORE, potentially gets us into a wack-a-mole situation
trying to keep things coherent.

> Existing stack with skb is perfectly fine as it is.
> No need to do recycling, batching or any other complex things.
> xdp for e1k cannot be used as an example for other drivers either,
> since there is only one tx ring and any high performance adapter
> has more which makes the driver support quite different.
>


Re: [PATCH net-next 0/4] Some BPF helper cleanups

2016-09-09 Thread David Miller
From: Daniel Borkmann 
Date: Fri,  9 Sep 2016 02:45:27 +0200

> This series contains a couple of misc cleanups and improvements
> for BPF helpers. For details please see individual patches. We
> let this also sit for a few days with Fengguang's kbuild test
> robot, and there were no issues seen (besides one false positive,
> see last one for details).

Series applied, thanks Daniel.


Re: [PATCH net-next] ip_tunnel: do not clear l4 hashes

2016-09-09 Thread David Miller
From: Eric Dumazet 
Date: Thu, 08 Sep 2016 15:40:48 -0700

> From: Eric Dumazet 
> 
> If skb has a valid l4 hash, there is no point clearing hash and force
> a further flow dissection when a tunnel encapsulation is added.
> 
> Signed-off-by: Eric Dumazet 

Applied.


Re: [PATCH] drivers: net: phy: mdio-xgene: Add hardware dependency

2016-09-09 Thread David Miller
From: Jean Delvare 
Date: Thu, 8 Sep 2016 16:25:15 +0200

> The mdio-xgene driver is only useful on X-Gene SoC.
> 
> Signed-off-by: Jean Delvare 

Applied.


Re: [PATCH] ATM-ForeRunnerHE: Use kmalloc_array() in he_init_group()

2016-09-09 Thread David Miller
From: SF Markus Elfring 
Date: Thu, 8 Sep 2016 15:50:05 +0200

> From: Markus Elfring 
> Date: Thu, 8 Sep 2016 15:43:37 +0200
> 
> * Multiplications for the size determination of memory allocations
>   indicated that array data structures should be processed.
>   Thus use the corresponding function "kmalloc_array".
> 
>   This issue was detected by using the Coccinelle software.
> 
> * Replace the specification of data types by pointer dereferences
>   to make the corresponding size determination a bit safer according to
>   the Linux coding style convention.
> 
> Signed-off-by: Markus Elfring 

Applied.


Re: [PATCH] ATM-ENI: Use kmalloc_array() in eni_start()

2016-09-09 Thread David Miller
From: SF Markus Elfring 
Date: Thu, 8 Sep 2016 14:40:06 +0200

> From: Markus Elfring 
> Date: Thu, 8 Sep 2016 14:20:17 +0200
> 
> * A multiplication for the size determination of a memory allocation
>   indicated that an array data structure should be processed.
>   Thus use the corresponding function "kmalloc_array".
> 
>   This issue was detected by using the Coccinelle software.
> 
> * Replace the specification of a data structure by a pointer dereference
>   to make the corresponding size determination a bit safer according to
>   the Linux coding style convention.
> 
> Signed-off-by: Markus Elfring 

Applied to net-next.


Re: [PATCH net-next 0/7] rxrpc: Rewrite data and ack handling

2016-09-09 Thread David Miller
From: David Howells 
Date: Thu, 08 Sep 2016 12:43:28 +0100

> This patch set constitutes the main portion of the AF_RXRPC rewrite.  It
> consists of five fix/helper patches:
 ...
> And then there are two patches that form the main part:
 ...
> With this, the majority of the AF_RXRPC rewrite is complete.
 ...
> Tagged thusly:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
>   rxrpc-rewrite-20160908

Pulled, however I personally would have tried to split patch #7 up a bit, it
was really huge and hard to audit/review in any meaningful way.


Re: pull-request: wireless-drivers 2016-09-08

2016-09-09 Thread David Miller
From: Kalle Valo 
Date: Thu, 08 Sep 2016 14:31:56 +0300

> The following changes since commit bb87f02b7e4ccdb614a83cbf840524de81e9b321:
> 
>   Merge ath-current from ath.git (2016-08-29 21:39:04 +0300)
> 
> are available in the git repository at:
> 
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers.git 
> tags/wireless-drivers-for-davem-2016-09-08

Pulled, thanks Kalle.


Re: [PATCH net] dwc_eth_qos: do not register semi-initialized device

2016-09-09 Thread David Miller
From: Lars Persson 
Date: Thu,  8 Sep 2016 13:24:21 +0200

> We move register_netdev() to the end of dwceqos_probe() to close any
> races where the netdev callbacks are called before the initialization
> has finished.
> 
> Reported-by: Pavel Andrianov 
> Signed-off-by: Lars Persson 

Applied.


Re: [PATCH net] sctp: identify chunks that need to be fragmented at IP level

2016-09-09 Thread David Miller
From: Xin Long 
Date: Thu,  8 Sep 2016 17:54:11 +0800

> From: Marcelo Ricardo Leitner 
> 
> Previously, without GSO, it was easy to identify it: if the chunk didn't
> fit and there was no data chunk in the packet yet, we could fragment at
> IP level. So if there was an auth chunk and we were bundling a big data
> chunk, it would fragment regardless of the size of the auth chunk. This
> also works for the context of PMTU reductions.
> 
> But with GSO, we cannot distinguish such PMTU events anymore, as the
> packet is allowed to exceed PMTU.
> 
> So we need another check: to ensure that the chunk that we are adding,
> actually fits the current PMTU. If it doesn't, trigger a flush and let
> it be fragmented at IP level in the next round.
> 
> Signed-off-by: Marcelo Ricardo Leitner 

Applied.


Re: [PATCH net] sctp: hold the transport before using it in sctp_hash_cmp

2016-09-09 Thread David Miller
From: Xin Long 
Date: Thu,  8 Sep 2016 17:49:04 +0800

> Now sctp uses the transport without holding it in sctp_hash_cmp,
> it can cause a use-after-free panic. As after it get transport from
> hashtable, another CPU may free it, then the members it accesses
> may be unavailable memory.
> 
> This patch is to use sctp_transport_hold, in which it checks the
> refcnt first, holds it if it's not 0.
> 
> Signed-off-by: Xin Long 

Please add more detail to the commit message and add a proper
"Fixes: " tag right before your signoff.

Thanks.



Re: [PATCH] softirq: fix tasklet_kill() and its users

2016-09-09 Thread Santosh Shilimkar

Ping !!

On 8/24/2016 6:52 PM, Santosh Shilimkar wrote:

Semantically the expectation from the tasklet init/kill API
should be as below.

tasklet_init() == Init and Enable scheduling
tasklet_kill() == Disable scheduling and Destroy

tasklet_init() API exibit above behavior but not the
tasklet_kill(). The tasklet handler can still get scheduled
and run even after the tasklet_kill().

There are 2, 3 places where drivers are working around
this issue by calling tasklet_disable() which will add an
usecount and there by avoiding the handlers being called.

tasklet_enable/tasklet_disable is a pair API and expected
to be used together. Usage of tasklet_disable() *just* to
workround tasklet scheduling after kill is probably not the
correct and inteded use of the API as done the API.
We also happen to see similar issue where in shutdown path
the tasklet_handler was getting called even after the
tasklet_kill().

We fix this be making sure tasklet_kill() does right
thing and there by ensuring tasklet handler won't run after
tasklet_kil() with very simple change. Patch fixes the tasklet
code and also few drivers workarounds.

Cc: Greg Kroah-Hartman 
Cc: Andrew Morton 
Cc: Thomas Gleixner 
Cc: Tadeusz Struk 
Cc: Herbert Xu 
Cc: "David S. Miller" 
Cc: Paul Bolle 
Cc: Giovanni Cabiddu 
Cc: Salvatore Benedetto 
Cc: Karsten Keil 
Cc: "Peter Zijlstra (Intel)" 

Signed-off-by: Santosh Shilimkar 
---
Removed RFC tag from last post and dropped atmel serial
driver which seems to have been fixed in 4.8

https://lkml.org/lkml/2016/8/7/7

 drivers/crypto/qat/qat_common/adf_isr.c| 1 -
 drivers/crypto/qat/qat_common/adf_sriov.c  | 1 -
 drivers/crypto/qat/qat_common/adf_vf_isr.c | 2 --
 drivers/isdn/gigaset/interface.c   | 1 -
 kernel/softirq.c   | 7 ---
 5 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/crypto/qat/qat_common/adf_isr.c 
b/drivers/crypto/qat/qat_common/adf_isr.c
index 06d4901..fd5e900 100644
--- a/drivers/crypto/qat/qat_common/adf_isr.c
+++ b/drivers/crypto/qat/qat_common/adf_isr.c
@@ -296,7 +296,6 @@ static void adf_cleanup_bh(struct adf_accel_dev *accel_dev)
int i;

for (i = 0; i < hw_data->num_banks; i++) {
-   tasklet_disable(_data->banks[i].resp_handler);
tasklet_kill(_data->banks[i].resp_handler);
}
 }
diff --git a/drivers/crypto/qat/qat_common/adf_sriov.c 
b/drivers/crypto/qat/qat_common/adf_sriov.c
index 9320ae1..bc7c2fa 100644
--- a/drivers/crypto/qat/qat_common/adf_sriov.c
+++ b/drivers/crypto/qat/qat_common/adf_sriov.c
@@ -204,7 +204,6 @@ void adf_disable_sriov(struct adf_accel_dev *accel_dev)
}

for (i = 0, vf = accel_dev->pf.vf_info; i < totalvfs; i++, vf++) {
-   tasklet_disable(>vf2pf_bh_tasklet);
tasklet_kill(>vf2pf_bh_tasklet);
mutex_destroy(>pf2vf_lock);
}
diff --git a/drivers/crypto/qat/qat_common/adf_vf_isr.c 
b/drivers/crypto/qat/qat_common/adf_vf_isr.c
index bf99e11..6e38bff 100644
--- a/drivers/crypto/qat/qat_common/adf_vf_isr.c
+++ b/drivers/crypto/qat/qat_common/adf_vf_isr.c
@@ -191,7 +191,6 @@ static int adf_setup_pf2vf_bh(struct adf_accel_dev 
*accel_dev)

 static void adf_cleanup_pf2vf_bh(struct adf_accel_dev *accel_dev)
 {
-   tasklet_disable(_dev->vf.pf2vf_bh_tasklet);
tasklet_kill(_dev->vf.pf2vf_bh_tasklet);
mutex_destroy(_dev->vf.vf2pf_lock);
 }
@@ -268,7 +267,6 @@ static void adf_cleanup_bh(struct adf_accel_dev *accel_dev)
 {
struct adf_etr_data *priv_data = accel_dev->transport;

-   tasklet_disable(_data->banks[0].resp_handler);
tasklet_kill(_data->banks[0].resp_handler);
 }

diff --git a/drivers/isdn/gigaset/interface.c b/drivers/isdn/gigaset/interface.c
index 600c79b..2ce63b6 100644
--- a/drivers/isdn/gigaset/interface.c
+++ b/drivers/isdn/gigaset/interface.c
@@ -524,7 +524,6 @@ void gigaset_if_free(struct cardstate *cs)
if (!drv->have_tty)
return;

-   tasklet_disable(>if_wake_tasklet);
tasklet_kill(>if_wake_tasklet);
cs->tty_dev = NULL;
tty_unregister_device(drv->tty, cs->minor_index);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b..21397eb 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -498,7 +498,7 @@ static void tasklet_action(struct softirq_action *a)
list = list->next;

if (tasklet_trylock(t)) {
-   if (!atomic_read(>count)) {
+   if (atomic_read(>count) == 1) {
if (!test_and_clear_bit(TASKLET_STATE_SCHED,
>state))
  

Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines

2016-09-09 Thread Alexei Starovoitov
On Fri, Sep 09, 2016 at 06:19:56PM -0700, Tom Herbert wrote:
> On Fri, Sep 9, 2016 at 6:12 PM, John Fastabend  
> wrote:
> > On 16-09-09 06:04 PM, Tom Herbert wrote:
> >> On Fri, Sep 9, 2016 at 5:01 PM, John Fastabend  
> >> wrote:
> >>> On 16-09-09 04:44 PM, Tom Herbert wrote:
>  On Fri, Sep 9, 2016 at 2:29 PM, John Fastabend 
>   wrote:
> > e1000 supports a single TX queue so it is being shared with the stack
> > when XDP runs XDP_TX action. This requires taking the xmit lock to
> > ensure we don't corrupt the tx ring. To avoid taking and dropping the
> > lock per packet this patch adds a bundling implementation to submit
> > a bundle of packets to the xmit routine.
> >
> > I tested this patch running e1000 in a VM using KVM over a tap
> > device using pktgen to generate traffic along with 'ping -f -l 100'.
> >
>  Hi John,
> 
>  How does this interact with BQL on e1000?
> 
>  Tom
> 
> >>>
> >>> Let me check if I have the API correct. When we enqueue a packet to
> >>> be sent we must issue a netdev_sent_queue() call and then on actual
> >>> transmission issue a netdev_completed_queue().
> >>>
> >>> The patch attached here missed a few things though.
> >>>
> >>> But it looks like I just need to call netdev_sent_queue() from the
> >>> e1000_xmit_raw_frame() routine and then let the tx completion logic
> >>> kick in which will call netdev_completed_queue() correctly.
> >>>
> >>> I'll need to add a check for the queue state as well. So if I do these
> >>> three things,
> >>>
> >>> check __QUEUE_STATE_XOFF before sending
> >>> netdev_sent_queue() -> on XDP_TX
> >>> netdev_completed_queue()
> >>>
> >>> It should work agree? Now should we do this even when XDP owns the
> >>> queue? Or is this purely an issue with sharing the queue between
> >>> XDP and stack.
> >>>
> >> But what is the action for XDP_TX if the queue is stopped? There is no
> >> qdisc to back pressure in the XDP path. Would we just start dropping
> >> packets then?
> >
> > Yep that is what the patch does if there is any sort of error packets
> > get dropped on the floor. I don't think there is anything else that
> > can be done.
> >
> That probably means that the stack will always win out under load.
> Trying to used the same queue where half of the packets are well
> managed by a qdisc and half aren't is going to leave someone unhappy.
> Maybe in the this case where we have to share the qdisc we can
> allocate the skb on on returning XDP_TX and send through the normal
> qdisc for the device.

I wouldn't go to such extremes for e1k.
The only reason to have xdp in e1k is to use it for testing
of xdp programs. Nothing else. e1k is, best case, 1Gbps adapter.
Existing stack with skb is perfectly fine as it is.
No need to do recycling, batching or any other complex things.
xdp for e1k cannot be used as an example for other drivers either,
since there is only one tx ring and any high performance adapter
has more which makes the driver support quite different.



Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines

2016-09-09 Thread Tom Herbert
On Fri, Sep 9, 2016 at 6:12 PM, John Fastabend  wrote:
> On 16-09-09 06:04 PM, Tom Herbert wrote:
>> On Fri, Sep 9, 2016 at 5:01 PM, John Fastabend  
>> wrote:
>>> On 16-09-09 04:44 PM, Tom Herbert wrote:
 On Fri, Sep 9, 2016 at 2:29 PM, John Fastabend  
 wrote:
> e1000 supports a single TX queue so it is being shared with the stack
> when XDP runs XDP_TX action. This requires taking the xmit lock to
> ensure we don't corrupt the tx ring. To avoid taking and dropping the
> lock per packet this patch adds a bundling implementation to submit
> a bundle of packets to the xmit routine.
>
> I tested this patch running e1000 in a VM using KVM over a tap
> device using pktgen to generate traffic along with 'ping -f -l 100'.
>
 Hi John,

 How does this interact with BQL on e1000?

 Tom

>>>
>>> Let me check if I have the API correct. When we enqueue a packet to
>>> be sent we must issue a netdev_sent_queue() call and then on actual
>>> transmission issue a netdev_completed_queue().
>>>
>>> The patch attached here missed a few things though.
>>>
>>> But it looks like I just need to call netdev_sent_queue() from the
>>> e1000_xmit_raw_frame() routine and then let the tx completion logic
>>> kick in which will call netdev_completed_queue() correctly.
>>>
>>> I'll need to add a check for the queue state as well. So if I do these
>>> three things,
>>>
>>> check __QUEUE_STATE_XOFF before sending
>>> netdev_sent_queue() -> on XDP_TX
>>> netdev_completed_queue()
>>>
>>> It should work agree? Now should we do this even when XDP owns the
>>> queue? Or is this purely an issue with sharing the queue between
>>> XDP and stack.
>>>
>> But what is the action for XDP_TX if the queue is stopped? There is no
>> qdisc to back pressure in the XDP path. Would we just start dropping
>> packets then?
>
> Yep that is what the patch does if there is any sort of error packets
> get dropped on the floor. I don't think there is anything else that
> can be done.
>
That probably means that the stack will always win out under load.
Trying to used the same queue where half of the packets are well
managed by a qdisc and half aren't is going to leave someone unhappy.
Maybe in the this case where we have to share the qdisc we can
allocate the skb on on returning XDP_TX and send through the normal
qdisc for the device.

Tom

>>
>> Tom
>>
>>> .John
>>>
>


Re: [PATCH] via-velocity: remove null pointer check on array tdinfo->skb_dma

2016-09-09 Thread David Miller
From: Colin King 
Date: Thu,  8 Sep 2016 10:04:24 +0100

> From: Colin Ian King 
> 
> tdinfo->skb_dma is a 7 element array of dma_addr_t hence cannot be
> null, so the pull pointer check on tdinfo->skb_dma  is redundant.
> Remove it.
> 
> Signed-off-by: Colin Ian King 

Applied, thanks Colin.


Re: [PATCH] qede: mark qede_set_features() static

2016-09-09 Thread David Miller
From: Baoyou Xie 
Date: Thu,  8 Sep 2016 16:43:23 +0800

> We get 1 warning when building kernel with W=1:
> drivers/net/ethernet/qlogic/qede/qede_main.c:2113:5: warning: no previous 
> prototype for 'qede_set_features' [-Wmissing-prototypes]
> 
> In fact, this function is only used in the file in which it is
> declared and don't need a declaration, but can be made static.
> so this patch marks this function with 'static'.
> 
> Signed-off-by: Baoyou Xie 

Applied.



Re: [PATCH net-next 1/1] net: phy: Fixed checkpatch errors for Microsemi PHYs.

2016-09-09 Thread David Miller
From: Raju Lakkaraju 
Date: Thu, 8 Sep 2016 14:09:31 +0530

> From: Raju Lakkaraju 
> 
> The existing VSC85xx PHY driver did not follow the coding style and caused 
> "checkpatch" to complain. This commit fixes this.
> 
> Signed-off-by: Raju Lakkaraju 

Applied.


Re: [PATCH] net: x25: remove null checks on arrays calling_ae and called_ae

2016-09-09 Thread David Miller
From: Colin King 
Date: Thu,  8 Sep 2016 08:42:06 +0100

> From: Colin Ian King 
> 
> dtefacs.calling_ae and called_ae are both 20 element __u8 arrays and
> cannot be null and hence are redundant checks. Remove these.
> 
> Signed-off-by: Colin Ian King 

Indeed, and if they were pointers they would be in userspace and would
need proper uaccess handling.

Applied to net-next, thanks.


Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines

2016-09-09 Thread John Fastabend
On 16-09-09 06:04 PM, Tom Herbert wrote:
> On Fri, Sep 9, 2016 at 5:01 PM, John Fastabend  
> wrote:
>> On 16-09-09 04:44 PM, Tom Herbert wrote:
>>> On Fri, Sep 9, 2016 at 2:29 PM, John Fastabend  
>>> wrote:
 e1000 supports a single TX queue so it is being shared with the stack
 when XDP runs XDP_TX action. This requires taking the xmit lock to
 ensure we don't corrupt the tx ring. To avoid taking and dropping the
 lock per packet this patch adds a bundling implementation to submit
 a bundle of packets to the xmit routine.

 I tested this patch running e1000 in a VM using KVM over a tap
 device using pktgen to generate traffic along with 'ping -f -l 100'.

>>> Hi John,
>>>
>>> How does this interact with BQL on e1000?
>>>
>>> Tom
>>>
>>
>> Let me check if I have the API correct. When we enqueue a packet to
>> be sent we must issue a netdev_sent_queue() call and then on actual
>> transmission issue a netdev_completed_queue().
>>
>> The patch attached here missed a few things though.
>>
>> But it looks like I just need to call netdev_sent_queue() from the
>> e1000_xmit_raw_frame() routine and then let the tx completion logic
>> kick in which will call netdev_completed_queue() correctly.
>>
>> I'll need to add a check for the queue state as well. So if I do these
>> three things,
>>
>> check __QUEUE_STATE_XOFF before sending
>> netdev_sent_queue() -> on XDP_TX
>> netdev_completed_queue()
>>
>> It should work agree? Now should we do this even when XDP owns the
>> queue? Or is this purely an issue with sharing the queue between
>> XDP and stack.
>>
> But what is the action for XDP_TX if the queue is stopped? There is no
> qdisc to back pressure in the XDP path. Would we just start dropping
> packets then?

Yep that is what the patch does if there is any sort of error packets
get dropped on the floor. I don't think there is anything else that
can be done.

> 
> Tom
> 
>> .John
>>



Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines

2016-09-09 Thread Tom Herbert
On Fri, Sep 9, 2016 at 5:01 PM, John Fastabend  wrote:
> On 16-09-09 04:44 PM, Tom Herbert wrote:
>> On Fri, Sep 9, 2016 at 2:29 PM, John Fastabend  
>> wrote:
>>> e1000 supports a single TX queue so it is being shared with the stack
>>> when XDP runs XDP_TX action. This requires taking the xmit lock to
>>> ensure we don't corrupt the tx ring. To avoid taking and dropping the
>>> lock per packet this patch adds a bundling implementation to submit
>>> a bundle of packets to the xmit routine.
>>>
>>> I tested this patch running e1000 in a VM using KVM over a tap
>>> device using pktgen to generate traffic along with 'ping -f -l 100'.
>>>
>> Hi John,
>>
>> How does this interact with BQL on e1000?
>>
>> Tom
>>
>
> Let me check if I have the API correct. When we enqueue a packet to
> be sent we must issue a netdev_sent_queue() call and then on actual
> transmission issue a netdev_completed_queue().
>
> The patch attached here missed a few things though.
>
> But it looks like I just need to call netdev_sent_queue() from the
> e1000_xmit_raw_frame() routine and then let the tx completion logic
> kick in which will call netdev_completed_queue() correctly.
>
> I'll need to add a check for the queue state as well. So if I do these
> three things,
>
> check __QUEUE_STATE_XOFF before sending
> netdev_sent_queue() -> on XDP_TX
> netdev_completed_queue()
>
> It should work agree? Now should we do this even when XDP owns the
> queue? Or is this purely an issue with sharing the queue between
> XDP and stack.
>
But what is the action for XDP_TX if the queue is stopped? There is no
qdisc to back pressure in the XDP path. Would we just start dropping
packets then?

Tom

> .John
>


[PATCH -next] tipc: fix possible memory leak in tipc_udp_enable()

2016-09-09 Thread Wei Yongjun
From: Wei Yongjun 

'ub' is malloced in tipc_udp_enable() and should be freed before
leaving from the error handling cases, otherwise it will cause
memory leak.

Fixes: ba5aa84a2d22 ("tipc: split UDP nl address parsing")
Signed-off-by: Wei Yongjun 
---
 net/tipc/udp_media.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c
index dd27468..d80cd3f 100644
--- a/net/tipc/udp_media.c
+++ b/net/tipc/udp_media.c
@@ -665,7 +665,8 @@ static int tipc_udp_enable(struct net *net, struct 
tipc_bearer *b,
 
if (!opts[TIPC_NLA_UDP_LOCAL] || !opts[TIPC_NLA_UDP_REMOTE]) {
pr_err("Invalid UDP bearer configuration");
-   return -EINVAL;
+   err = -EINVAL;
+   goto err;
}
 
err = tipc_parse_udp_addr(opts[TIPC_NLA_UDP_LOCAL], ,



Re: [PATCH 8/9] selftests: move vDSO tests from Documentation/vDSO

2016-09-09 Thread kbuild test robot
Hi Shuah,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.8-rc5 next-20160909]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]
[Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for 
convenience) to record what (public, well-known) commit your patch series was 
built on]
[Check https://git-scm.com/docs/git-format-patch for more information]

url:
https://github.com/0day-ci/linux/commits/Shuah-Khan/Move-runnable-code-tests-from-Documentation-to-selftests/20160910-063538
config: i386-tinyconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

>> scripts/Makefile.build:44: Documentation/vDSO/Makefile: No such file or 
>> directory
>> make[3]: *** No rule to make target 'Documentation/vDSO/Makefile'.
   make[3]: Failed to remake makefile 'Documentation/vDSO/Makefile'.

vim +44 scripts/Makefile.build

f77bf0142 Sam Ravnborg 2007-10-15  28  ldflags-y  :=
d72e5edbf Sam Ravnborg 2007-05-28  29  
720097d89 Sam Ravnborg 2009-04-19  30  subdir-asflags-y :=
720097d89 Sam Ravnborg 2009-04-19  31  subdir-ccflags-y :=
720097d89 Sam Ravnborg 2009-04-19  32  
3156fd052 Robert P. J. Day 2008-02-18  33  # Read auto.conf if it exists, 
otherwise ignore
c955ccafc Roman Zippel 2006-06-08  34  -include include/config/auto.conf
^1da177e4 Linus Torvalds   2005-04-16  35  
20a468b51 Sam Ravnborg 2006-01-22  36  include scripts/Kbuild.include
20a468b51 Sam Ravnborg 2006-01-22  37  
3156fd052 Robert P. J. Day 2008-02-18  38  # For backward compatibility check 
that these variables do not change
0c53c8e6e Sam Ravnborg 2007-10-14  39  save-cflags := $(CFLAGS)
0c53c8e6e Sam Ravnborg 2007-10-14  40  
2a6914703 Sam Ravnborg 2005-07-25  41  # The filename Kbuild has precedence 
over Makefile
db8c1a7b2 Sam Ravnborg 2005-07-27  42  kbuild-dir := $(if $(filter 
/%,$(src)),$(src),$(srctree)/$(src))
0c53c8e6e Sam Ravnborg 2007-10-14  43  kbuild-file := $(if $(wildcard 
$(kbuild-dir)/Kbuild),$(kbuild-dir)/Kbuild,$(kbuild-dir)/Makefile)
0c53c8e6e Sam Ravnborg 2007-10-14 @44  include $(kbuild-file)
^1da177e4 Linus Torvalds   2005-04-16  45  
0c53c8e6e Sam Ravnborg 2007-10-14  46  # If the save-* variables changed 
error out
0c53c8e6e Sam Ravnborg 2007-10-14  47  ifeq ($(KBUILD_NOPEDANTIC),)
0c53c8e6e Sam Ravnborg 2007-10-14  48  ifneq 
("$(save-cflags)","$(CFLAGS)")
49c57d254 Arnaud Lacombe   2011-08-15  49  $(error CFLAGS was 
changed in "$(kbuild-file)". Fix it to use ccflags-y)
0c53c8e6e Sam Ravnborg 2007-10-14  50  endif
0c53c8e6e Sam Ravnborg 2007-10-14  51  endif
4a5838ad9 Borislav Petkov  2011-03-01  52  

:: The code at line 44 was first introduced by commit
:: 0c53c8e6eb456cde30f2305421c605713856abc8 kbuild: check for wrong use of 
CFLAGS

:: TO: Sam Ravnborg <sam@neptun.(none)>
:: CC: Sam Ravnborg <sam@neptun.(none)>

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH 6/9] selftests: move ptp tests from Documentation/ptp

2016-09-09 Thread kbuild test robot
Hi Shuah,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.8-rc5 next-20160909]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]
[Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for 
convenience) to record what (public, well-known) commit your patch series was 
built on]
[Check https://git-scm.com/docs/git-format-patch for more information]

url:
https://github.com/0day-ci/linux/commits/Shuah-Khan/Move-runnable-code-tests-from-Documentation-to-selftests/20160910-063538
config: i386-tinyconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

>> scripts/Makefile.build:44: Documentation/ptp/Makefile: No such file or 
>> directory
>> make[3]: *** No rule to make target 'Documentation/ptp/Makefile'.
   make[3]: Failed to remake makefile 'Documentation/ptp/Makefile'.

vim +44 scripts/Makefile.build

3156fd052 Robert P. J. Day 2008-02-18  38  # For backward compatibility check 
that these variables do not change
0c53c8e6e Sam Ravnborg 2007-10-14  39  save-cflags := $(CFLAGS)
0c53c8e6e Sam Ravnborg 2007-10-14  40  
2a6914703 Sam Ravnborg 2005-07-25  41  # The filename Kbuild has precedence 
over Makefile
db8c1a7b2 Sam Ravnborg 2005-07-27  42  kbuild-dir := $(if $(filter 
/%,$(src)),$(src),$(srctree)/$(src))
0c53c8e6e Sam Ravnborg 2007-10-14  43  kbuild-file := $(if $(wildcard 
$(kbuild-dir)/Kbuild),$(kbuild-dir)/Kbuild,$(kbuild-dir)/Makefile)
0c53c8e6e Sam Ravnborg 2007-10-14 @44  include $(kbuild-file)
^1da177e4 Linus Torvalds   2005-04-16  45  
0c53c8e6e Sam Ravnborg 2007-10-14  46  # If the save-* variables changed 
error out
0c53c8e6e Sam Ravnborg 2007-10-14  47  ifeq ($(KBUILD_NOPEDANTIC),)

:: The code at line 44 was first introduced by commit
:: 0c53c8e6eb456cde30f2305421c605713856abc8 kbuild: check for wrong use of 
CFLAGS

:: TO: Sam Ravnborg <sam@neptun.(none)>
:: CC: Sam Ravnborg <sam@neptun.(none)>

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH 4/9] selftests: move prctl tests from Documentation/prctl

2016-09-09 Thread kbuild test robot
Hi Shuah,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.8-rc5 next-20160909]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]
[Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for 
convenience) to record what (public, well-known) commit your patch series was 
built on]
[Check https://git-scm.com/docs/git-format-patch for more information]

url:
https://github.com/0day-ci/linux/commits/Shuah-Khan/Move-runnable-code-tests-from-Documentation-to-selftests/20160910-063538
config: i386-tinyconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

>> scripts/Makefile.build:44: Documentation/prctl/Makefile: No such file or 
>> directory
   make[3]: *** No rule to make target 'Documentation/prctl/Makefile'.
   make[3]: Failed to remake makefile 'Documentation/prctl/Makefile'.

vim +44 scripts/Makefile.build

3156fd052 Robert P. J. Day 2008-02-18  38  # For backward compatibility check 
that these variables do not change
0c53c8e6e Sam Ravnborg 2007-10-14  39  save-cflags := $(CFLAGS)
0c53c8e6e Sam Ravnborg 2007-10-14  40  
2a6914703 Sam Ravnborg 2005-07-25  41  # The filename Kbuild has precedence 
over Makefile
db8c1a7b2 Sam Ravnborg 2005-07-27  42  kbuild-dir := $(if $(filter 
/%,$(src)),$(src),$(srctree)/$(src))
0c53c8e6e Sam Ravnborg 2007-10-14  43  kbuild-file := $(if $(wildcard 
$(kbuild-dir)/Kbuild),$(kbuild-dir)/Kbuild,$(kbuild-dir)/Makefile)
0c53c8e6e Sam Ravnborg 2007-10-14 @44  include $(kbuild-file)
^1da177e4 Linus Torvalds   2005-04-16  45  
0c53c8e6e Sam Ravnborg 2007-10-14  46  # If the save-* variables changed 
error out
0c53c8e6e Sam Ravnborg 2007-10-14  47  ifeq ($(KBUILD_NOPEDANTIC),)

:: The code at line 44 was first introduced by commit
:: 0c53c8e6eb456cde30f2305421c605713856abc8 kbuild: check for wrong use of 
CFLAGS

:: TO: Sam Ravnborg <sam@neptun.(none)>
:: CC: Sam Ravnborg <sam@neptun.(none)>

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH 1/9] selftests: move dnotify_test from Documentation/filesystems

2016-09-09 Thread kbuild test robot
Hi Shuah,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.8-rc5 next-20160909]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]
[Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for 
convenience) to record what (public, well-known) commit your patch series was 
built on]
[Check https://git-scm.com/docs/git-format-patch for more information]

url:
https://github.com/0day-ci/linux/commits/Shuah-Khan/Move-runnable-code-tests-from-Documentation-to-selftests/20160910-063538
config: i386-tinyconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

>> scripts/Makefile.build:44: Documentation/filesystems/Makefile: No such file 
>> or directory
>> make[3]: *** No rule to make target 'Documentation/filesystems/Makefile'.
   make[3]: Failed to remake makefile 'Documentation/filesystems/Makefile'.

vim +44 scripts/Makefile.build

f77bf0142 Sam Ravnborg 2007-10-15  28  ldflags-y  :=
d72e5edbf Sam Ravnborg 2007-05-28  29  
720097d89 Sam Ravnborg 2009-04-19  30  subdir-asflags-y :=
720097d89 Sam Ravnborg 2009-04-19  31  subdir-ccflags-y :=
720097d89 Sam Ravnborg 2009-04-19  32  
3156fd052 Robert P. J. Day 2008-02-18  33  # Read auto.conf if it exists, 
otherwise ignore
c955ccafc Roman Zippel 2006-06-08  34  -include include/config/auto.conf
^1da177e4 Linus Torvalds   2005-04-16  35  
20a468b51 Sam Ravnborg 2006-01-22  36  include scripts/Kbuild.include
20a468b51 Sam Ravnborg 2006-01-22  37  
3156fd052 Robert P. J. Day 2008-02-18  38  # For backward compatibility check 
that these variables do not change
0c53c8e6e Sam Ravnborg 2007-10-14  39  save-cflags := $(CFLAGS)
0c53c8e6e Sam Ravnborg 2007-10-14  40  
2a6914703 Sam Ravnborg 2005-07-25  41  # The filename Kbuild has precedence 
over Makefile
db8c1a7b2 Sam Ravnborg 2005-07-27  42  kbuild-dir := $(if $(filter 
/%,$(src)),$(src),$(srctree)/$(src))
0c53c8e6e Sam Ravnborg 2007-10-14  43  kbuild-file := $(if $(wildcard 
$(kbuild-dir)/Kbuild),$(kbuild-dir)/Kbuild,$(kbuild-dir)/Makefile)
0c53c8e6e Sam Ravnborg 2007-10-14 @44  include $(kbuild-file)
^1da177e4 Linus Torvalds   2005-04-16  45  
0c53c8e6e Sam Ravnborg 2007-10-14  46  # If the save-* variables changed 
error out
0c53c8e6e Sam Ravnborg 2007-10-14  47  ifeq ($(KBUILD_NOPEDANTIC),)
0c53c8e6e Sam Ravnborg 2007-10-14  48  ifneq 
("$(save-cflags)","$(CFLAGS)")
49c57d254 Arnaud Lacombe   2011-08-15  49  $(error CFLAGS was 
changed in "$(kbuild-file)". Fix it to use ccflags-y)
0c53c8e6e Sam Ravnborg 2007-10-14  50  endif
0c53c8e6e Sam Ravnborg 2007-10-14  51  endif
4a5838ad9 Borislav Petkov  2011-03-01  52  

:: The code at line 44 was first introduced by commit
:: 0c53c8e6eb456cde30f2305421c605713856abc8 kbuild: check for wrong use of 
CFLAGS

:: TO: Sam Ravnborg <sam@neptun.(none)>
:: CC: Sam Ravnborg <sam@neptun.(none)>

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines

2016-09-09 Thread John Fastabend
On 16-09-09 04:44 PM, Tom Herbert wrote:
> On Fri, Sep 9, 2016 at 2:29 PM, John Fastabend  
> wrote:
>> e1000 supports a single TX queue so it is being shared with the stack
>> when XDP runs XDP_TX action. This requires taking the xmit lock to
>> ensure we don't corrupt the tx ring. To avoid taking and dropping the
>> lock per packet this patch adds a bundling implementation to submit
>> a bundle of packets to the xmit routine.
>>
>> I tested this patch running e1000 in a VM using KVM over a tap
>> device using pktgen to generate traffic along with 'ping -f -l 100'.
>>
> Hi John,
> 
> How does this interact with BQL on e1000?
> 
> Tom
> 

Let me check if I have the API correct. When we enqueue a packet to
be sent we must issue a netdev_sent_queue() call and then on actual
transmission issue a netdev_completed_queue().

The patch attached here missed a few things though.

But it looks like I just need to call netdev_sent_queue() from the
e1000_xmit_raw_frame() routine and then let the tx completion logic
kick in which will call netdev_completed_queue() correctly.

I'll need to add a check for the queue state as well. So if I do these
three things,

check __QUEUE_STATE_XOFF before sending
netdev_sent_queue() -> on XDP_TX
netdev_completed_queue()

It should work agree? Now should we do this even when XDP owns the
queue? Or is this purely an issue with sharing the queue between
XDP and stack.

.John



Re: [PATCH 4/9] selftests: move prctl tests from Documentation/prctl

2016-09-09 Thread kbuild test robot
Hi Shuah,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.8-rc5 next-20160909]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]
[Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for 
convenience) to record what (public, well-known) commit your patch series was 
built on]
[Check https://git-scm.com/docs/git-format-patch for more information]

url:
https://github.com/0day-ci/linux/commits/Shuah-Khan/Move-runnable-code-tests-from-Documentation-to-selftests/20160910-063538
config: i386-tinyconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   scripts/Makefile.clean:14: Documentation/prctl/Makefile: No such file or 
directory
>> make[3]: *** No rule to make target 'Documentation/prctl/Makefile'.
   make[3]: Failed to remake makefile 'Documentation/prctl/Makefile'.
   make[2]: *** [Documentation/prctl] Error 2
   scripts/Makefile.clean:14: Documentation/filesystems/Makefile: No such file 
or directory
   make[3]: *** No rule to make target 'Documentation/filesystems/Makefile'.
   make[3]: Failed to remake makefile 'Documentation/filesystems/Makefile'.
   make[2]: *** [Documentation/filesystems] Error 2
   make[2]: Target '__clean' not remade because of errors.
   make[1]: *** [_clean_Documentation] Error 2
   make[1]: Target 'distclean' not remade because of errors.
   make: *** [sub-make] Error 2

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [patch net 0/2] mlxsw: couple of fixes

2016-09-09 Thread David Miller
From: Jiri Pirko 
Date: Thu,  8 Sep 2016 08:16:00 +0200

> Couple of fixes from Ido and myself.

Series applied, thanks Jiri.


Re: [RFC] bridge: MAC learning uevents

2016-09-09 Thread Florian Fainelli
On 09/09/2016 01:51 AM, D. Herrendoerfer wrote:
>> just like neighbor table modifications, it should be possible to listen for
>> events with netlink. Doing it through uevent is the wrong model.
> 
> I agree partially - but consider:
> we plug hardware - we get an event
> we remove hardware - we get an event
> we add a virtual interface - we get an event
> we add a bridge - event
> we add an interface to that bridge - event 
> a kvm guest starts using the interface on that bridge - we need to monitor 
> netlink, poll brforward, capture traffic

Yes, because now there is network activity going on, so why not ask the
networking stack to get these events?

> 
> It seems inconsistent, bridge is already emitting events.

It does not seem particularly inconsistent, all networking events are
already emitted using rt netlink, why should bridge be different here?
(Yes, uevent is netlink too, just a special family).

-- 
Florian


Re: [PATCH net-next] macsec: set network devtype

2016-09-09 Thread David Miller
From: Stephen Hemminger 
Date: Wed, 7 Sep 2016 14:07:32 -0700

> The netdevice type structure for macsec was being defined but never used.
> To set the network device type the macro SET_NETDEV_DEVTYPE must be called.
> Compile tested only, I don't use macsec.
> 
> Signed-off-by: Stephen Hemminger 

Applied.


Re: [PATCH net-next] rtnetlink: remove unused ifla_stats_policy

2016-09-09 Thread David Miller
From: Stephen Hemminger 
Date: Wed, 7 Sep 2016 13:57:36 -0700

> This structure is defined but never used. Flagged with W=1
> 
> Signed-off-by: Stephen Hemminger 

Applied.


Re: [PATCH net 0/2] ip: fix creation flags reported in RTM_NEWROUTE events

2016-09-09 Thread David Miller
From: Guillaume Nault 
Date: Wed, 7 Sep 2016 17:18:50 +0200

> Netlink messages sent to user-space upon RTM_NEWROUTE events have their
> nlmsg_flags field inconsistently set. While the NLM_F_REPLACE and
> NLM_F_APPEND bits are correctly handled, NLM_F_CREATE and NLM_F_EXCL
> are always 0.
> 
> This series sets the NLM_F_CREATE and NLM_F_EXCL bits when applicable,
> for IPv4 and IPv6.
> 
> Since IPv6 ignores the NLM_F_APPEND flags in requests, this flag isn't
> reported in RTM_NEWROUTE IPv6 events. This keeps IPv6 internal
> consistency (same flag semantic for user requests and kernel events) at
> the cost of bringing different flag interpretation for IPv4 and IPv6.

I'm applying this series to net-next so that it has time to cook and
expose anything in userland that might break due to these changes.

I briefly considered applying this to net but I think that is
premature at least for the time being.

Thanks.


Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines

2016-09-09 Thread Tom Herbert
On Fri, Sep 9, 2016 at 2:29 PM, John Fastabend  wrote:
> e1000 supports a single TX queue so it is being shared with the stack
> when XDP runs XDP_TX action. This requires taking the xmit lock to
> ensure we don't corrupt the tx ring. To avoid taking and dropping the
> lock per packet this patch adds a bundling implementation to submit
> a bundle of packets to the xmit routine.
>
> I tested this patch running e1000 in a VM using KVM over a tap
> device using pktgen to generate traffic along with 'ping -f -l 100'.
>
Hi John,

How does this interact with BQL on e1000?

Tom

> Suggested-by: Jesper Dangaard Brouer 
> Signed-off-by: John Fastabend 
> ---
>  drivers/net/ethernet/intel/e1000/e1000.h  |   10 +++
>  drivers/net/ethernet/intel/e1000/e1000_main.c |   81 
> +++--
>  2 files changed, 71 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000/e1000.h 
> b/drivers/net/ethernet/intel/e1000/e1000.h
> index 5cf8a0a..877b377 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000.h
> +++ b/drivers/net/ethernet/intel/e1000/e1000.h
> @@ -133,6 +133,8 @@ struct e1000_adapter;
>  #define E1000_TX_QUEUE_WAKE16
>  /* How many Rx Buffers do we bundle into one write to the hardware ? */
>  #define E1000_RX_BUFFER_WRITE  16 /* Must be power of 2 */
> +/* How many XDP XMIT buffers to bundle into one xmit transaction */
> +#define E1000_XDP_XMIT_BUNDLE_MAX E1000_RX_BUFFER_WRITE
>
>  #define AUTO_ALL_MODES 0
>  #define E1000_EEPROM_82544_APM 0x0004
> @@ -168,6 +170,11 @@ struct e1000_rx_buffer {
> dma_addr_t dma;
>  };
>
> +struct e1000_rx_buffer_bundle {
> +   struct e1000_rx_buffer *buffer;
> +   u32 length;
> +};
> +
>  struct e1000_tx_ring {
> /* pointer to the descriptor ring memory */
> void *desc;
> @@ -206,6 +213,9 @@ struct e1000_rx_ring {
> struct e1000_rx_buffer *buffer_info;
> struct sk_buff *rx_skb_top;
>
> +   /* array of XDP buffer information structs */
> +   struct e1000_rx_buffer_bundle *xdp_buffer;
> +
> /* cpu for rx queue */
> int cpu;
>
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c 
> b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index 91d5c87..b985271 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -1738,10 +1738,18 @@ static int e1000_setup_rx_resources(struct 
> e1000_adapter *adapter,
> struct pci_dev *pdev = adapter->pdev;
> int size, desc_len;
>
> +   size = sizeof(struct e1000_rx_buffer_bundle) *
> +   E1000_XDP_XMIT_BUNDLE_MAX;
> +   rxdr->xdp_buffer = vzalloc(size);
> +   if (!rxdr->xdp_buffer)
> +   return -ENOMEM;
> +
> size = sizeof(struct e1000_rx_buffer) * rxdr->count;
> rxdr->buffer_info = vzalloc(size);
> -   if (!rxdr->buffer_info)
> +   if (!rxdr->buffer_info) {
> +   vfree(rxdr->xdp_buffer);
> return -ENOMEM;
> +   }
>
> desc_len = sizeof(struct e1000_rx_desc);
>
> @@ -1754,6 +1762,7 @@ static int e1000_setup_rx_resources(struct 
> e1000_adapter *adapter,
> GFP_KERNEL);
> if (!rxdr->desc) {
>  setup_rx_desc_die:
> +   vfree(rxdr->xdp_buffer);
> vfree(rxdr->buffer_info);
> return -ENOMEM;
> }
> @@ -2087,6 +2096,9 @@ static void e1000_free_rx_resources(struct 
> e1000_adapter *adapter,
>
> e1000_clean_rx_ring(adapter, rx_ring);
>
> +   vfree(rx_ring->xdp_buffer);
> +   rx_ring->xdp_buffer = NULL;
> +
> vfree(rx_ring->buffer_info);
> rx_ring->buffer_info = NULL;
>
> @@ -3369,33 +3381,52 @@ static void e1000_tx_map_rxpage(struct e1000_tx_ring 
> *tx_ring,
>  }
>
>  static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
> -unsigned int len,
> -struct net_device *netdev,
> -struct e1000_adapter *adapter)
> +__u32 len,
> +struct e1000_adapter *adapter,
> +struct e1000_tx_ring *tx_ring)
>  {
> -   struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> -   struct e1000_hw *hw = >hw;
> -   struct e1000_tx_ring *tx_ring;
> -
> if (len > E1000_MAX_DATA_PER_TXD)
> return;
>
> +   if (E1000_DESC_UNUSED(tx_ring) < 2)
> +   return;
> +
> +   e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
> +   e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> +}
> +
> +static void e1000_xdp_xmit_bundle(struct e1000_rx_buffer_bundle *buffer_info,
> + struct net_device *netdev,
> + struct e1000_adapter *adapter)
> +{
> +   

Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines

2016-09-09 Thread John Fastabend
On 16-09-09 02:29 PM, John Fastabend wrote:
> e1000 supports a single TX queue so it is being shared with the stack
> when XDP runs XDP_TX action. This requires taking the xmit lock to
> ensure we don't corrupt the tx ring. To avoid taking and dropping the
> lock per packet this patch adds a bundling implementation to submit
> a bundle of packets to the xmit routine.
> 
> I tested this patch running e1000 in a VM using KVM over a tap
> device using pktgen to generate traffic along with 'ping -f -l 100'.
> 
> Suggested-by: Jesper Dangaard Brouer 
> Signed-off-by: John Fastabend 
> ---

This patch is a bit bogus in a few spots as well...


> -
> - if (E1000_DESC_UNUSED(tx_ring) < 2) {
> - HARD_TX_UNLOCK(netdev, txq);
> - return;
> + for (; i < E1000_XDP_XMIT_BUNDLE_MAX && buffer_info[i].buffer; i++) {
> + e1000_xmit_raw_frame(buffer_info[i].buffer,
> +  buffer_info[i].length,
> +  adapter, tx_ring);
> + buffer_info[i].buffer->rxbuf.page = NULL;
> + buffer_info[i].buffer = NULL;
> + buffer_info[i].length = 0;
> + i++;
  



Re: [net-next PATCH v2 1/2] e1000: add initial XDP support

2016-09-09 Thread John Fastabend
On 16-09-09 03:04 PM, Eric Dumazet wrote:
> On Fri, 2016-09-09 at 14:29 -0700, John Fastabend wrote:
>> From: Alexei Starovoitov 
>>
> 
> 
> So it looks like e1000_xmit_raw_frame() can return early,
> say if there is no available descriptor.
> 
>> +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
>> + unsigned int len,
>> + struct net_device *netdev,
>> + struct e1000_adapter *adapter)
>> +{
>> +struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
>> +struct e1000_hw *hw = >hw;
>> +struct e1000_tx_ring *tx_ring;
>> +
>> +if (len > E1000_MAX_DATA_PER_TXD)
>> +return;
>> +
>> +/* e1000 only support a single txq at the moment so the queue is being
>> + * shared with stack. To support this requires locking to ensure the
>> + * stack and XDP are not running at the same time. Devices with
>> + * multiple queues should allocate a separate queue space.
>> + */
>> +HARD_TX_LOCK(netdev, txq, smp_processor_id());
>> +
>> +tx_ring = adapter->tx_ring;
>> +
>> +if (E1000_DESC_UNUSED(tx_ring) < 2) {
>> +HARD_TX_UNLOCK(netdev, txq);
>> +return;
>> +}
>> +
>> +e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
>> +e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
>> +
>> +writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
>> +mmiowb();
>> +
>> +HARD_TX_UNLOCK(netdev, txq);
>> +}
>> +
>>  #define NUM_REGS 38 /* 1 based count */
>>  static void e1000_regdump(struct e1000_adapter *adapter)
>>  {
>> @@ -4142,6 +4247,19 @@ static struct sk_buff *e1000_alloc_rx_skb(struct 
>> e1000_adapter *adapter,
>>  return skb;
>>  }
>> +act = e1000_call_bpf(prog, page_address(p), length);
>> +switch (act) {
>> +case XDP_PASS:
>> +break;
>> +case XDP_TX:
>> +dma_sync_single_for_device(>dev,
>> +   dma,
>> +   length,
>> +   DMA_TO_DEVICE);
>> +e1000_xmit_raw_frame(buffer_info, length,
>> + netdev, adapter);
>> +buffer_info->rxbuf.page = NULL;
> 
> 
> So I am trying to understand how pages are not leaked ?
> 
> 

Pages are being leaked thanks! v3 coming soon.



[PATCH RFC 3/6] ila: Call library function alloc_bucket_locks

2016-09-09 Thread Tom Herbert
To allocate the array of bucket locks for the hash table we now
call library function alloc_bucket_spinlocks.

Signed-off-by: Tom Herbert 
---
 net/ipv6/ila/ila_xlat.c | 36 +---
 1 file changed, 5 insertions(+), 31 deletions(-)

diff --git a/net/ipv6/ila/ila_xlat.c b/net/ipv6/ila/ila_xlat.c
index e604013..7d1c34b 100644
--- a/net/ipv6/ila/ila_xlat.c
+++ b/net/ipv6/ila/ila_xlat.c
@@ -30,34 +30,6 @@ struct ila_net {
bool hooks_registered;
 };
 
-#defineLOCKS_PER_CPU 10
-
-static int alloc_ila_locks(struct ila_net *ilan)
-{
-   unsigned int i, size;
-   unsigned int nr_pcpus = num_possible_cpus();
-
-   nr_pcpus = min_t(unsigned int, nr_pcpus, 32UL);
-   size = roundup_pow_of_two(nr_pcpus * LOCKS_PER_CPU);
-
-   if (sizeof(spinlock_t) != 0) {
-#ifdef CONFIG_NUMA
-   if (size * sizeof(spinlock_t) > PAGE_SIZE)
-   ilan->locks = vmalloc(size * sizeof(spinlock_t));
-   else
-#endif
-   ilan->locks = kmalloc_array(size, sizeof(spinlock_t),
-   GFP_KERNEL);
-   if (!ilan->locks)
-   return -ENOMEM;
-   for (i = 0; i < size; i++)
-   spin_lock_init(>locks[i]);
-   }
-   ilan->locks_mask = size - 1;
-
-   return 0;
-}
-
 static u32 hashrnd __read_mostly;
 static __always_inline void __ila_hash_secret_init(void)
 {
@@ -561,14 +533,16 @@ static const struct genl_ops ila_nl_ops[] = {
},
 };
 
-#define ILA_HASH_TABLE_SIZE 1024
+#define LOCKS_PER_CPU 10
+#define MAX_LOCKS 1024
 
 static __net_init int ila_init_net(struct net *net)
 {
int err;
struct ila_net *ilan = net_generic(net, ila_net_id);
 
-   err = alloc_ila_locks(ilan);
+   err = alloc_bucket_spinlocks(>locks, >locks_mask,
+MAX_LOCKS, LOCKS_PER_CPU, GFP_KERNEL);
if (err)
return err;
 
@@ -583,7 +557,7 @@ static __net_exit void ila_exit_net(struct net *net)
 
rhashtable_free_and_destroy(>rhash_table, ila_free_cb, NULL);
 
-   kvfree(ilan->locks);
+   free_bucket_spinlocks(ilan->locks);
 
if (ilan->hooks_registered)
nf_unregister_net_hooks(net, ila_nf_hook_ops,
-- 
2.8.0.rc2



[PATCH RFC 6/6] ila: Resolver mechanism

2016-09-09 Thread Tom Herbert
Implement an ILA resolver. This uses LWT to implement the hook to a
userspace resolver and tracks pending unresolved address using the
backend net resolver.

The idea is that the kernel sets an ILA resolver route to the
SIR prefix, something like:

ip route add ::/64 encap ila-resolve \
 via 2401:db00:20:911a::27:0 dev eth0

When a packet hits the route the address is looked up in a resolver
table. If the entry is created (no entry with the address already
exists) then an rtnl message is generated with group
RTNLGRP_ILA_NOTIFY and type RTM_ADDR_RESOLVE. A userspace
daemon can listen for such messages and perform an ILA resolution
protocol to determine the ILA mapping. If the mapping is resolved
then a /128 ila encap router is set so that host can perform
ILA translation and send directly to destination.

Signed-off-by: Tom Herbert 
---
 include/uapi/linux/lwtunnel.h  |   1 +
 include/uapi/linux/rtnetlink.h |   5 ++
 net/ipv6/Kconfig   |   1 +
 net/ipv6/ila/Makefile  |   2 +-
 net/ipv6/ila/ila.h |  16 
 net/ipv6/ila/ila_common.c  |   7 ++
 net/ipv6/ila/ila_lwt.c |   9 ++
 net/ipv6/ila/ila_resolver.c| 192 +
 net/ipv6/ila/ila_xlat.c|  15 ++--
 9 files changed, 239 insertions(+), 9 deletions(-)
 create mode 100644 net/ipv6/ila/ila_resolver.c

diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h
index a478fe8..d880e49 100644
--- a/include/uapi/linux/lwtunnel.h
+++ b/include/uapi/linux/lwtunnel.h
@@ -9,6 +9,7 @@ enum lwtunnel_encap_types {
LWTUNNEL_ENCAP_IP,
LWTUNNEL_ENCAP_ILA,
LWTUNNEL_ENCAP_IP6,
+   LWTUNNEL_ENCAP_ILA_NOTIFY,
__LWTUNNEL_ENCAP_MAX,
 };
 
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 262f037..271215f 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -144,6 +144,9 @@ enum {
RTM_GETSTATS = 94,
 #define RTM_GETSTATS RTM_GETSTATS
 
+   RTM_ADDR_RESOLVE = 95,
+#define RTM_ADDR_RESOLVE RTM_ADDR_RESOLVE
+
__RTM_MAX,
 #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1)
 };
@@ -656,6 +659,8 @@ enum rtnetlink_groups {
 #define RTNLGRP_MPLS_ROUTE RTNLGRP_MPLS_ROUTE
RTNLGRP_NSID,
 #define RTNLGRP_NSID   RTNLGRP_NSID
+   RTNLGRP_ILA_NOTIFY,
+#define RTNLGRP_ILA_NOTIFY RTNLGRP_ILA_NOTIFY
__RTNLGRP_MAX
 };
 #define RTNLGRP_MAX(__RTNLGRP_MAX - 1)
diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
index 2343e4f..cf3ea8e 100644
--- a/net/ipv6/Kconfig
+++ b/net/ipv6/Kconfig
@@ -97,6 +97,7 @@ config IPV6_ILA
tristate "IPv6: Identifier Locator Addressing (ILA)"
depends on NETFILTER
select LWTUNNEL
+   select NET_EXT_RESOLVER
---help---
  Support for IPv6 Identifier Locator Addressing (ILA).
 
diff --git a/net/ipv6/ila/Makefile b/net/ipv6/ila/Makefile
index 4b32e59..f2aadc3 100644
--- a/net/ipv6/ila/Makefile
+++ b/net/ipv6/ila/Makefile
@@ -4,4 +4,4 @@
 
 obj-$(CONFIG_IPV6_ILA) += ila.o
 
-ila-objs := ila_common.o ila_lwt.o ila_xlat.o
+ila-objs := ila_common.o ila_lwt.o ila_xlat.o ila_resolver.o
diff --git a/net/ipv6/ila/ila.h b/net/ipv6/ila/ila.h
index e0170f6..e369611 100644
--- a/net/ipv6/ila/ila.h
+++ b/net/ipv6/ila/ila.h
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -23,6 +24,16 @@
 #include 
 #include 
 
+extern unsigned int ila_net_id;
+
+struct ila_net {
+   struct rhashtable rhash_table;
+   spinlock_t *locks; /* Bucket locks for entry manipulation */
+   unsigned int locks_mask;
+   bool hooks_registered;
+   struct net_rslv *nrslv;
+};
+
 struct ila_locator {
union {
__u8v8[8];
@@ -114,9 +125,14 @@ void ila_update_ipv6_locator(struct sk_buff *skb, struct 
ila_params *p,
 
 void ila_init_saved_csum(struct ila_params *p);
 
+void ila_rslv_resolved(struct ila_net *ilan, struct ila_addr *iaddr);
 int ila_lwt_init(void);
 void ila_lwt_fini(void);
 int ila_xlat_init(void);
 void ila_xlat_fini(void);
+int ila_rslv_init(void);
+void ila_rslv_fini(void);
+int ila_init_resolver_net(struct ila_net *ilan);
+void ila_exit_resolver_net(struct ila_net *ilan);
 
 #endif /* __ILA_H */
diff --git a/net/ipv6/ila/ila_common.c b/net/ipv6/ila/ila_common.c
index aba0998..83c7d4a 100644
--- a/net/ipv6/ila/ila_common.c
+++ b/net/ipv6/ila/ila_common.c
@@ -157,7 +157,13 @@ static int __init ila_init(void)
if (ret)
goto fail_xlat;
 
+   ret = ila_rslv_init();
+   if (ret)
+   goto fail_rslv;
+
return 0;
+fail_rslv:
+   ila_xlat_fini();
 fail_xlat:
ila_lwt_fini();
 fail_lwt:
@@ -168,6 +174,7 @@ static void __exit ila_fini(void)
 {
ila_xlat_fini();
ila_lwt_fini();
+   ila_rslv_fini();
 }
 
 module_init(ila_init);
diff --git a/net/ipv6/ila/ila_lwt.c b/net/ipv6/ila/ila_lwt.c
index 

[PATCH RFC 5/6] net: Generic resolver backend

2016-09-09 Thread Tom Herbert
This patch implements the backend of a resolver, specifically it
provides a means to track unresolved addresses and to time them out.

The resolver is mostly a frontend to an rhashtable where the key
of the table is whatever address type or object is tracked. A resolver
instance is created by net_rslv_create. A resolver is destroyed by
net_rslv_destroy.

There are two functions that are used to manipulate entries in the
table: net_rslv_lookup_and_create and net_rslv_resolved.

net_rslv_lookup_and_create is called with an unresolved address as
the argument. It returns a structure of type net_rslv_ent. When
called a lookup is performed to see if an entry for the address
is already in the table, if it is the entry is return and the
false is returned in the new bool pointer argument to indicate that
the entry was preexisting. If an entry is not found, one is create
and true is returned on the new pointer argument. It is expected
that when an entry is new the address resolution protocol is
initiated (for instance a RTM_ADDR_RESOLVE message may be sent to a
userspace daemon as we will do in ILA). If net_rslv_lookup_and_create
returns NULL then presumably the hash table has reached the limit of
number of outstanding unresolved addresses, the caller should take
appropriate actions to avoid spamming the resolution protocol.

net_rslv_resolved is called when resolution is completely (e.g.
ILA locator mapping was instantiated for a locator. The entry is
removed for the hash table.

An argument to net_rslv_create indicates a time for the pending
resolution in milliseconds. If the timer fires before resolution
then the entry is removed from the table. Subsequently, another
attempt to resolve the same address will result in a new entry in
the table.

net_rslv_lookup_and_create allocates an net_rslv_ent struct and
includes allocating related user data. This is the object[] field
in the structure. The key (unresolved address) is always the first
field in the the object. Following that the caller may add it's
own private field data. The key length and size of the user object
(including the key) are specific in net_rslv_create.

There are three callback functions that can be set as arugments in
net_rslv_create:

   - cmp_fn: Compare function for hash table. Arguments are the
   key and an object in the table. If this is NULL then the
   default memcmp of rhashtable is used.

   - init_fn: Initial a new net_rslv_ent structure. This allows
   initialization of the user portion of the structure
   (the object[]).

   - destroy_fn: Called right before a net_rslv_ent is freed.
   This allows cleanup of user data associated with the
   entry.

Note that the resolver backend only tracks unresolved addresses, it
is up to the caller to perform the mechanism of resolution. This
includes the possible of queuing packets awaiting resolution; this
can be accomplished for instance by maintaining an skbuff queue
in the net_rslv_ent user object[] data.

DOS mitigation is done by limiting the number of entries in the
resolver table (the max_size which argument of net_rslv_create)
and setting a timeout. IF the timeout is set then the maximum rate
of new resolution requests is max_table_size / timeout. For
instance, with a maximum size of 1000 entries and a timeout of 100
msecs the maximum rate of resolutions requests is 1/s.

Signed-off-by: Tom Herbert 
---
 include/net/resolver.h |  58 +++
 net/Kconfig|   4 +
 net/core/Makefile  |   1 +
 net/core/resolver.c| 267 +
 4 files changed, 330 insertions(+)
 create mode 100644 include/net/resolver.h
 create mode 100644 net/core/resolver.c

diff --git a/include/net/resolver.h b/include/net/resolver.h
new file mode 100644
index 000..8f73b5c
--- /dev/null
+++ b/include/net/resolver.h
@@ -0,0 +1,58 @@
+#ifndef __NET_RESOLVER_H
+#define __NET_RESOLVER_H
+
+#include 
+
+struct net_rslv;
+struct net_rslv_ent;
+
+typedef int (*net_rslv_cmpfn)(struct net_rslv *nrslv, const void *key,
+ const void *object);
+typedef void (*net_rslv_initfn)(struct net_rslv *nrslv, void *object);
+typedef void (*net_rslv_destroyfn)(struct net_rslv_ent *nrent);
+
+struct net_rslv {
+   struct rhashtable rhash_table;
+   struct rhashtable_params params;
+   net_rslv_cmpfn rslv_cmp;
+   net_rslv_initfn rslv_init;
+   net_rslv_destroyfn rslv_destroy;
+   size_t obj_size;
+   spinlock_t *locks;
+   unsigned int locks_mask;
+   unsigned int hash_rnd;
+   long timeout;
+};
+
+struct net_rslv_ent {
+   struct rcu_head rcu;
+   union {
+   /* Fields set when entry is in hash table */
+   struct {
+   struct rhash_head node;
+   struct delayed_work timeout_work;
+   struct net_rslv *nrslv;
+   };
+
+   /* Fields set when rcu 

[PATCH RFC 4/6] rhashtable: abstract out function to get hash

2016-09-09 Thread Tom Herbert
Split out most of rht_key_hashfn which is calculating the hash into
its own function. This way the hash function can be called separately to
get the hash value.

Signed-off-by: Tom Herbert 
---
 include/linux/rhashtable.h | 28 ++--
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index fd82584..e398a62 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -208,34 +208,42 @@ static inline unsigned int rht_bucket_index(const struct 
bucket_table *tbl,
return (hash >> RHT_HASH_RESERVED_SPACE) & (tbl->size - 1);
 }
 
-static inline unsigned int rht_key_hashfn(
-   struct rhashtable *ht, const struct bucket_table *tbl,
-   const void *key, const struct rhashtable_params params)
+static inline unsigned int rht_key_get_hash(struct rhashtable *ht,
+   const void *key, const struct rhashtable_params params,
+   unsigned int hash_rnd)
 {
unsigned int hash;
 
/* params must be equal to ht->p if it isn't constant. */
if (!__builtin_constant_p(params.key_len))
-   hash = ht->p.hashfn(key, ht->key_len, tbl->hash_rnd);
+   hash = ht->p.hashfn(key, ht->key_len, hash_rnd);
else if (params.key_len) {
unsigned int key_len = params.key_len;
 
if (params.hashfn)
-   hash = params.hashfn(key, key_len, tbl->hash_rnd);
+   hash = params.hashfn(key, key_len, hash_rnd);
else if (key_len & (sizeof(u32) - 1))
-   hash = jhash(key, key_len, tbl->hash_rnd);
+   hash = jhash(key, key_len, hash_rnd);
else
-   hash = jhash2(key, key_len / sizeof(u32),
- tbl->hash_rnd);
+   hash = jhash2(key, key_len / sizeof(u32), hash_rnd);
} else {
unsigned int key_len = ht->p.key_len;
 
if (params.hashfn)
-   hash = params.hashfn(key, key_len, tbl->hash_rnd);
+   hash = params.hashfn(key, key_len, hash_rnd);
else
-   hash = jhash(key, key_len, tbl->hash_rnd);
+   hash = jhash(key, key_len, hash_rnd);
}
 
+   return hash;
+}
+
+static inline unsigned int rht_key_hashfn(
+   struct rhashtable *ht, const struct bucket_table *tbl,
+   const void *key, const struct rhashtable_params params)
+{
+   unsigned int hash = rht_key_get_hash(ht, key, params, tbl->hash_rnd);
+
return rht_bucket_index(tbl, hash);
 }
 
-- 
2.8.0.rc2



[PATCH RFC 0/6] net: ILA resolver and generic resolver backend

2016-09-09 Thread Tom Herbert
This patch sets implements an ILA host side resolver. This uses LWT to
implement the hook to a userspace resolver and tracks pending unresolved
address using the backend net resolver.

This patch set contains:

- An new library function to allocate an array of spinlocks for use
  with locking hash buckets.
- Make hash function in rhashtable directly callable.
- A generic resolver backend infrastructure. This primary does two
  things: track unsesolved addresses and implement a timeout for
  resolution not happening. These mechanisms provides rate limiting
  control over resolution requests (for instance in ILA it use used
  to rate limit requests to userspace to resolve addresses).
- The ILA resolver. This is implements to path from the kernel ILA
  implementation to a userspace daemon that an identifier address
  needs to be resolved.

Tom Herbert (6):
  spinlock: Add library function to allocate spinlock buckets array
  rhashtable: Call library function alloc_bucket_locks
  ila: Call library function alloc_bucket_locks
  rhashtable: abstract out function to get hash
  net: Generic resolver backend
  ila: Resolver mechanism

 include/linux/rhashtable.h |  28 +++--
 include/linux/spinlock.h   |   6 +
 include/net/resolver.h |  58 +
 include/uapi/linux/lwtunnel.h  |   1 +
 include/uapi/linux/rtnetlink.h |   5 +
 lib/Makefile   |   2 +-
 lib/bucket_locks.c |  63 ++
 lib/rhashtable.c   |  46 +--
 net/Kconfig|   4 +
 net/core/Makefile  |   1 +
 net/core/resolver.c| 267 +
 net/ipv6/Kconfig   |   1 +
 net/ipv6/ila/Makefile  |   2 +-
 net/ipv6/ila/ila.h |  16 +++
 net/ipv6/ila/ila_common.c  |   7 ++
 net/ipv6/ila/ila_lwt.c |   9 ++
 net/ipv6/ila/ila_resolver.c| 192 +
 net/ipv6/ila/ila_xlat.c|  51 ++--
 18 files changed, 666 insertions(+), 93 deletions(-)
 create mode 100644 include/net/resolver.h
 create mode 100644 lib/bucket_locks.c
 create mode 100644 net/core/resolver.c
 create mode 100644 net/ipv6/ila/ila_resolver.c

-- 
2.8.0.rc2



[PATCH RFC 2/6] rhashtable: Call library function alloc_bucket_locks

2016-09-09 Thread Tom Herbert
To allocate the array of bucket locks for the hash table we now
call library function alloc_bucket_spinlocks. This function is
based on the old alloc_bucket_locks in rhashtable and should
produce the same effect.

Signed-off-by: Tom Herbert 
---
 lib/rhashtable.c | 46 --
 1 file changed, 4 insertions(+), 42 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 06c2872..5b53304 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -59,50 +59,10 @@ EXPORT_SYMBOL_GPL(lockdep_rht_bucket_is_held);
 #define ASSERT_RHT_MUTEX(HT)
 #endif
 
-
-static int alloc_bucket_locks(struct rhashtable *ht, struct bucket_table *tbl,
- gfp_t gfp)
-{
-   unsigned int i, size;
-#if defined(CONFIG_PROVE_LOCKING)
-   unsigned int nr_pcpus = 2;
-#else
-   unsigned int nr_pcpus = num_possible_cpus();
-#endif
-
-   nr_pcpus = min_t(unsigned int, nr_pcpus, 64UL);
-   size = roundup_pow_of_two(nr_pcpus * ht->p.locks_mul);
-
-   /* Never allocate more than 0.5 locks per bucket */
-   size = min_t(unsigned int, size, tbl->size >> 1);
-
-   if (sizeof(spinlock_t) != 0) {
-   tbl->locks = NULL;
-#ifdef CONFIG_NUMA
-   if (size * sizeof(spinlock_t) > PAGE_SIZE &&
-   gfp == GFP_KERNEL)
-   tbl->locks = vmalloc(size * sizeof(spinlock_t));
-#endif
-   if (gfp != GFP_KERNEL)
-   gfp |= __GFP_NOWARN | __GFP_NORETRY;
-
-   if (!tbl->locks)
-   tbl->locks = kmalloc_array(size, sizeof(spinlock_t),
-  gfp);
-   if (!tbl->locks)
-   return -ENOMEM;
-   for (i = 0; i < size; i++)
-   spin_lock_init(>locks[i]);
-   }
-   tbl->locks_mask = size - 1;
-
-   return 0;
-}
-
 static void bucket_table_free(const struct bucket_table *tbl)
 {
if (tbl)
-   kvfree(tbl->locks);
+   free_bucket_spinlocks(tbl->locks);
 
kvfree(tbl);
 }
@@ -131,7 +91,9 @@ static struct bucket_table *bucket_table_alloc(struct 
rhashtable *ht,
 
tbl->size = nbuckets;
 
-   if (alloc_bucket_locks(ht, tbl, gfp) < 0) {
+   /* Never allocate more than 0.5 locks per bucket */
+   if (alloc_bucket_spinlocks(>locks, >locks_mask,
+  tbl->size >> 1, ht->p.locks_mul, gfp)) {
bucket_table_free(tbl);
return NULL;
}
-- 
2.8.0.rc2



[PATCH RFC 1/6] spinlock: Add library function to allocate spinlock buckets array

2016-09-09 Thread Tom Herbert
Add two new library functions alloc_bucket_spinlocks and
free_bucket_spinlocks. These are use to allocate and free an array
of spinlocks that are useful as locks for hash buckets. The interface
specifies the maximum number of spinlocks in the array as well
as a CPU multiplier to derive the number of spinlocks to allocate.
The number to allocated is rounded up to a power of two to make
the array amenable to hash lookup.

Signed-off-by: Tom Herbert 
---
 include/linux/spinlock.h |  6 +
 lib/Makefile |  2 +-
 lib/bucket_locks.c   | 63 
 3 files changed, 70 insertions(+), 1 deletion(-)
 create mode 100644 lib/bucket_locks.c

diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 47dd0ce..4ebdfbf 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -416,4 +416,10 @@ extern int _atomic_dec_and_lock(atomic_t *atomic, 
spinlock_t *lock);
 #define atomic_dec_and_lock(atomic, lock) \
__cond_lock(lock, _atomic_dec_and_lock(atomic, lock))
 
+int alloc_bucket_spinlocks(spinlock_t **locks, unsigned int *lock_mask,
+  unsigned int max_size, unsigned int cpu_mult,
+  gfp_t gfp);
+
+void free_bucket_spinlocks(spinlock_t *locks);
+
 #endif /* __LINUX_SPINLOCK_H */
diff --git a/lib/Makefile b/lib/Makefile
index cfa68eb..a1dedf1 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -37,7 +37,7 @@ obj-y += bcd.o div64.o sort.o parser.o halfmd4.o 
debug_locks.o random32.o \
 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-once.o
+once.o bucket_locks.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += hexdump.o
diff --git a/lib/bucket_locks.c b/lib/bucket_locks.c
new file mode 100644
index 000..bb9bf11
--- /dev/null
+++ b/lib/bucket_locks.c
@@ -0,0 +1,63 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* Allocate an array of spinlocks to be accessed by a hash. Two arguments
+ * indicate the number of elements to allocate in the array. max_size
+ * gives the maximum number of elements to allocate. cpu_mult gives
+ * the number of locks per CPU to allocate. The size is rounded up
+ * to a power of 2 to be suitable as a hash table.
+ */
+int alloc_bucket_spinlocks(spinlock_t **locks, unsigned int *locks_mask,
+  unsigned int max_size, unsigned int cpu_mult,
+  gfp_t gfp)
+{
+   unsigned int i, size;
+#if defined(CONFIG_PROVE_LOCKING)
+   unsigned int nr_pcpus = 2;
+#else
+   unsigned int nr_pcpus = num_possible_cpus();
+#endif
+   spinlock_t *tlocks = NULL;
+
+   if (cpu_mult) {
+   nr_pcpus = min_t(unsigned int, nr_pcpus, 64UL);
+   size = min_t(unsigned int, nr_pcpus * cpu_mult, max_size);
+   } else {
+   size = max_size;
+   }
+   size = roundup_pow_of_two(size);
+
+   if (!size)
+   return -EINVAL;
+
+   if (sizeof(spinlock_t) != 0) {
+#ifdef CONFIG_NUMA
+   if (size * sizeof(spinlock_t) > PAGE_SIZE &&
+   gfp == GFP_KERNEL)
+   tlocks = vmalloc(size * sizeof(spinlock_t));
+#endif
+   if (gfp != GFP_KERNEL)
+   gfp |= __GFP_NOWARN | __GFP_NORETRY;
+
+   if (!tlocks)
+   tlocks = kmalloc_array(size, sizeof(spinlock_t), gfp);
+   if (!tlocks)
+   return -ENOMEM;
+   for (i = 0; i < size; i++)
+   spin_lock_init([i]);
+   }
+   *locks = tlocks;
+   *locks_mask = size - 1;
+
+   return 0;
+}
+EXPORT_SYMBOL(alloc_bucket_spinlocks);
+
+void free_bucket_spinlocks(spinlock_t *locks)
+{
+   kvfree(locks);
+}
+EXPORT_SYMBOL(free_bucket_spinlocks);
-- 
2.8.0.rc2



Re: [PATCH next 3/3] ipvlan: Introduce l3s mode

2016-09-09 Thread महेश बंडेवार
On Fri, Sep 9, 2016 at 3:26 PM, David Ahern  wrote:
> On 9/9/16 3:53 PM, Mahesh Bandewar wrote:
>> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
>> index 0c5415b05ea9..95edd1737ab5 100644
>> --- a/drivers/net/Kconfig
>> +++ b/drivers/net/Kconfig
>> @@ -149,6 +149,7 @@ config IPVLAN
>>  tristate "IP-VLAN support"
>>  depends on INET
>>  depends on IPV6
>> +select NET_L3_MASTER_DEV
>
> depends on instead of select?

The kbuild/kconfig-language.txt suggests that for "depends on" the
option _must_ be selected otherwise menuconfig wont even present the
dependent option while select positively sets the option.

INET and IPv6 are well understood and almost all configs select that.
L3_MASTER is very new and not well understood so chances of someone
_not_ putting them (IPvlan and L3_MASTER) in same context are very
high.


[PATCH 5/9] selftests: Update prctl Makefile to work under selftests

2016-09-09 Thread Shuah Khan
Update prctl Makefile to work under selftests. prctl will not be run as
part of selftests suite and will not included in install targets. They
can be built separately for now.

Signed-off-by: Shuah Khan 
---
 tools/testing/selftests/prctl/Makefile | 19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/prctl/Makefile 
b/tools/testing/selftests/prctl/Makefile
index 44de308..35aa1c8 100644
--- a/tools/testing/selftests/prctl/Makefile
+++ b/tools/testing/selftests/prctl/Makefile
@@ -1,10 +1,15 @@
 ifndef CROSS_COMPILE
-# List of programs to build
-hostprogs-$(CONFIG_X86) := disable-tsc-ctxt-sw-stress-test 
disable-tsc-on-off-stress-test disable-tsc-test
-# Tell kbuild to always build the programs
-always := $(hostprogs-y)
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/)
 
-HOSTCFLAGS_disable-tsc-ctxt-sw-stress-test.o += -I$(objtree)/usr/include
-HOSTCFLAGS_disable-tsc-on-off-stress-test.o += -I$(objtree)/usr/include
-HOSTCFLAGS_disable-tsc-test.o += -I$(objtree)/usr/include
+ifeq ($(ARCH),x86)
+TEST_PROGS := disable-tsc-ctxt-sw-stress-test disable-tsc-on-off-stress-test \
+   disable-tsc-test
+all: $(TEST_PROGS)
+
+include ../lib.mk
+
+clean:
+   rm -fr $(TEST_PROGS)
+endif
 endif
-- 
2.7.4



[PATCH 0/9] Move runnable code (tests) from Documentation to selftests

2016-09-09 Thread Shuah Khan
Move runnable code (tests) from Documentation to selftests and update
Makefiles to work under selftests.

Jon Corbet and I discussed this in an email thread and as per that
discussion, this patch series moves all the tests that are under the
Documentation directory to selftests. There is more runnable code in
the form of examples and utils and that is going to be another patch
series. I moved just the tests and left the documentation files as is.

Checkpatch isn't happy with a few of the patches as some of the
renamed files have existing checkpatch errors and warnings. I am
working another patch series that will address those.

Shuah Khan (9):
  selftests: move dnotify_test from Documentation/filesystems
  selftests: update filesystems Makefile to work under selftests
  selftests: move .gitignore from Documentation/filesystems
  selftests: move prctl tests from Documentation/prctl
  selftests: Update prctl Makefile to work under selftests
  selftests: move ptp tests from Documentation/ptp
  selftests: Update ptp Makefile to work under selftests
  selftests: move vDSO tests from Documentation/vDSO
  selftests: Update vDSO Makefile to work under selftests

 Documentation/filesystems/.gitignore   |   1 -
 Documentation/filesystems/Makefile |   5 -
 Documentation/filesystems/dnotify_test.c   |  34 --
 Documentation/prctl/.gitignore |   3 -
 Documentation/prctl/Makefile   |  10 -
 .../prctl/disable-tsc-ctxt-sw-stress-test.c|  97 
 .../prctl/disable-tsc-on-off-stress-test.c |  96 
 Documentation/prctl/disable-tsc-test.c |  95 
 Documentation/ptp/.gitignore   |   1 -
 Documentation/ptp/Makefile |   8 -
 Documentation/ptp/testptp.c| 523 -
 Documentation/ptp/testptp.mk   |  33 --
 Documentation/vDSO/.gitignore  |   2 -
 Documentation/vDSO/Makefile|  17 -
 Documentation/vDSO/parse_vdso.c| 269 ---
 Documentation/vDSO/vdso_standalone_test_x86.c  | 128 -
 Documentation/vDSO/vdso_test.c |  52 --
 tools/testing/selftests/filesystems/.gitignore |   1 +
 tools/testing/selftests/filesystems/Makefile   |   7 +
 tools/testing/selftests/filesystems/dnotify_test.c |  34 ++
 tools/testing/selftests/prctl/.gitignore   |   3 +
 tools/testing/selftests/prctl/Makefile |  15 +
 .../prctl/disable-tsc-ctxt-sw-stress-test.c|  97 
 .../prctl/disable-tsc-on-off-stress-test.c |  96 
 tools/testing/selftests/prctl/disable-tsc-test.c   |  95 
 tools/testing/selftests/ptp/.gitignore |   1 +
 tools/testing/selftests/ptp/Makefile   |   8 +
 tools/testing/selftests/ptp/testptp.c  | 523 +
 tools/testing/selftests/ptp/testptp.mk |  33 ++
 tools/testing/selftests/vDSO/.gitignore|   2 +
 tools/testing/selftests/vDSO/Makefile  |  20 +
 tools/testing/selftests/vDSO/parse_vdso.c  | 269 +++
 .../selftests/vDSO/vdso_standalone_test_x86.c  | 128 +
 tools/testing/selftests/vDSO/vdso_test.c   |  52 ++
 34 files changed, 1384 insertions(+), 1374 deletions(-)
 delete mode 100644 Documentation/filesystems/.gitignore
 delete mode 100644 Documentation/filesystems/Makefile
 delete mode 100644 Documentation/filesystems/dnotify_test.c
 delete mode 100644 Documentation/prctl/.gitignore
 delete mode 100644 Documentation/prctl/Makefile
 delete mode 100644 Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c
 delete mode 100644 Documentation/prctl/disable-tsc-on-off-stress-test.c
 delete mode 100644 Documentation/prctl/disable-tsc-test.c
 delete mode 100644 Documentation/ptp/.gitignore
 delete mode 100644 Documentation/ptp/Makefile
 delete mode 100644 Documentation/ptp/testptp.c
 delete mode 100644 Documentation/ptp/testptp.mk
 delete mode 100644 Documentation/vDSO/.gitignore
 delete mode 100644 Documentation/vDSO/Makefile
 delete mode 100644 Documentation/vDSO/parse_vdso.c
 delete mode 100644 Documentation/vDSO/vdso_standalone_test_x86.c
 delete mode 100644 Documentation/vDSO/vdso_test.c
 create mode 100644 tools/testing/selftests/filesystems/.gitignore
 create mode 100644 tools/testing/selftests/filesystems/Makefile
 create mode 100644 tools/testing/selftests/filesystems/dnotify_test.c
 create mode 100644 tools/testing/selftests/prctl/.gitignore
 create mode 100644 tools/testing/selftests/prctl/Makefile
 create mode 100644 
tools/testing/selftests/prctl/disable-tsc-ctxt-sw-stress-test.c
 create mode 100644 
tools/testing/selftests/prctl/disable-tsc-on-off-stress-test.c
 create mode 100644 tools/testing/selftests/prctl/disable-tsc-test.c
 create mode 100644 tools/testing/selftests/ptp/.gitignore
 create mode 100644 tools/testing/selftests/ptp/Makefile
 create mode 100644 

[PATCH 3/9] selftests: move .gitignore from Documentation/filesystems

2016-09-09 Thread Shuah Khan
Move .gitignore for dnotify_test from Documentation/filesystems to
selftests/filesystems.

Signed-off-by: Shuah Khan 
---
 Documentation/filesystems/.gitignore   | 1 -
 tools/testing/selftests/filesystems/.gitignore | 1 +
 2 files changed, 1 insertion(+), 1 deletion(-)
 delete mode 100644 Documentation/filesystems/.gitignore
 create mode 100644 tools/testing/selftests/filesystems/.gitignore

diff --git a/Documentation/filesystems/.gitignore 
b/Documentation/filesystems/.gitignore
deleted file mode 100644
index 31d6e42..000
--- a/Documentation/filesystems/.gitignore
+++ /dev/null
@@ -1 +0,0 @@
-dnotify_test
diff --git a/tools/testing/selftests/filesystems/.gitignore 
b/tools/testing/selftests/filesystems/.gitignore
new file mode 100644
index 000..31d6e42
--- /dev/null
+++ b/tools/testing/selftests/filesystems/.gitignore
@@ -0,0 +1 @@
+dnotify_test
-- 
2.7.4



[PATCH 4/9] selftests: move prctl tests from Documentation/prctl

2016-09-09 Thread Shuah Khan
Move prctl tests from Documentation/prctl to selftests/prctl.

Signed-off-by: Shuah Khan 
---
 Documentation/prctl/.gitignore |  3 -
 Documentation/prctl/Makefile   | 10 ---
 .../prctl/disable-tsc-ctxt-sw-stress-test.c| 97 --
 .../prctl/disable-tsc-on-off-stress-test.c | 96 -
 Documentation/prctl/disable-tsc-test.c | 95 -
 tools/testing/selftests/prctl/.gitignore   |  3 +
 tools/testing/selftests/prctl/Makefile | 10 +++
 .../prctl/disable-tsc-ctxt-sw-stress-test.c| 97 ++
 .../prctl/disable-tsc-on-off-stress-test.c | 96 +
 tools/testing/selftests/prctl/disable-tsc-test.c   | 95 +
 10 files changed, 301 insertions(+), 301 deletions(-)
 delete mode 100644 Documentation/prctl/.gitignore
 delete mode 100644 Documentation/prctl/Makefile
 delete mode 100644 Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c
 delete mode 100644 Documentation/prctl/disable-tsc-on-off-stress-test.c
 delete mode 100644 Documentation/prctl/disable-tsc-test.c
 create mode 100644 tools/testing/selftests/prctl/.gitignore
 create mode 100644 tools/testing/selftests/prctl/Makefile
 create mode 100644 
tools/testing/selftests/prctl/disable-tsc-ctxt-sw-stress-test.c
 create mode 100644 
tools/testing/selftests/prctl/disable-tsc-on-off-stress-test.c
 create mode 100644 tools/testing/selftests/prctl/disable-tsc-test.c

diff --git a/Documentation/prctl/.gitignore b/Documentation/prctl/.gitignore
deleted file mode 100644
index 0b5c274..000
--- a/Documentation/prctl/.gitignore
+++ /dev/null
@@ -1,3 +0,0 @@
-disable-tsc-ctxt-sw-stress-test
-disable-tsc-on-off-stress-test
-disable-tsc-test
diff --git a/Documentation/prctl/Makefile b/Documentation/prctl/Makefile
deleted file mode 100644
index 44de308..000
--- a/Documentation/prctl/Makefile
+++ /dev/null
@@ -1,10 +0,0 @@
-ifndef CROSS_COMPILE
-# List of programs to build
-hostprogs-$(CONFIG_X86) := disable-tsc-ctxt-sw-stress-test 
disable-tsc-on-off-stress-test disable-tsc-test
-# Tell kbuild to always build the programs
-always := $(hostprogs-y)
-
-HOSTCFLAGS_disable-tsc-ctxt-sw-stress-test.o += -I$(objtree)/usr/include
-HOSTCFLAGS_disable-tsc-on-off-stress-test.o += -I$(objtree)/usr/include
-HOSTCFLAGS_disable-tsc-test.o += -I$(objtree)/usr/include
-endif
diff --git a/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c 
b/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c
deleted file mode 100644
index f7499d1..000
--- a/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c
+++ /dev/null
@@ -1,97 +0,0 @@
-/*
- * Tests for prctl(PR_GET_TSC, ...) / prctl(PR_SET_TSC, ...)
- *
- * Tests if the control register is updated correctly
- * at context switches
- *
- * Warning: this test will cause a very high load for a few seconds
- *
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-
-#include 
-#include 
-
-/* Get/set the process' ability to use the timestamp counter instruction */
-#ifndef PR_GET_TSC
-#define PR_GET_TSC 25
-#define PR_SET_TSC 26
-# define PR_TSC_ENABLE 1   /* allow the use of the timestamp counter */
-# define PR_TSC_SIGSEGV2   /* throw a SIGSEGV instead of 
reading the TSC */
-#endif
-
-static uint64_t rdtsc(void)
-{
-uint32_t lo, hi;
-/* We cannot use "=A", since this would use %rax on x86_64 */
-__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
-return (uint64_t)hi << 32 | lo;
-}
-
-static void sigsegv_expect(int sig)
-{
-   /* */
-}
-
-static void segvtask(void)
-{
-   if (prctl(PR_SET_TSC, PR_TSC_SIGSEGV) < 0)
-   {
-   perror("prctl");
-   exit(0);
-   }
-   signal(SIGSEGV, sigsegv_expect);
-   alarm(10);
-   rdtsc();
-   fprintf(stderr, "FATAL ERROR, rdtsc() succeeded while disabled\n");
-   exit(0);
-}
-
-
-static void sigsegv_fail(int sig)
-{
-   fprintf(stderr, "FATAL ERROR, rdtsc() failed while enabled\n");
-   exit(0);
-}
-
-static void rdtsctask(void)
-{
-   if (prctl(PR_SET_TSC, PR_TSC_ENABLE) < 0)
-   {
-   perror("prctl");
-   exit(0);
-   }
-   signal(SIGSEGV, sigsegv_fail);
-   alarm(10);
-   for(;;) rdtsc();
-}
-
-
-int main(void)
-{
-   int n_tasks = 100, i;
-
-   fprintf(stderr, "[No further output means we're allright]\n");
-
-   for (i=0; i

[PATCH 7/9] selftests: Update ptp Makefile to work under selftests

2016-09-09 Thread Shuah Khan
Update ptp Makefile to work under selftests. ptp will not be run as part
of selftests suite and will not included in install targets. They can be
built separately for now.

Signed-off-by: Shuah Khan 
---
 tools/testing/selftests/ptp/Makefile | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/ptp/Makefile 
b/tools/testing/selftests/ptp/Makefile
index 293d6c0..f4a7238 100644
--- a/tools/testing/selftests/ptp/Makefile
+++ b/tools/testing/selftests/ptp/Makefile
@@ -1,8 +1,8 @@
-# List of programs to build
-hostprogs-y := testptp
+TEST_PROGS := testptp
+LDLIBS += -lrt
+all: $(TEST_PROGS)
 
-# Tell kbuild to always build the programs
-always := $(hostprogs-y)
+include ../lib.mk
 
-HOSTCFLAGS_testptp.o += -I$(objtree)/usr/include
-HOSTLOADLIBES_testptp := -lrt
+clean:
+   rm -fr testptp
-- 
2.7.4



[PATCH 2/9] selftests: update filesystems Makefile to work under selftests

2016-09-09 Thread Shuah Khan
Update to work under selftests. dnotify_test will not be run as part of
selftests suite and will not included in install targets. It can be built
separately for now.

Signed-off-by: Shuah Khan 
---
 tools/testing/selftests/filesystems/Makefile | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/filesystems/Makefile 
b/tools/testing/selftests/filesystems/Makefile
index 883010c..f1dce5c 100644
--- a/tools/testing/selftests/filesystems/Makefile
+++ b/tools/testing/selftests/filesystems/Makefile
@@ -1,5 +1,7 @@
-# List of programs to build
-hostprogs-y := dnotify_test
+TEST_PROGS := dnotify_test
+all: $(TEST_PROGS)
 
-# Tell kbuild to always build the programs
-always := $(hostprogs-y)
+include ../lib.mk
+
+clean:
+   rm -fr dnotify_test
-- 
2.7.4



[PATCH 1/9] selftests: move dnotify_test from Documentation/filesystems

2016-09-09 Thread Shuah Khan
Move dnotify_test from Documentation/filesystems to selftests/filesystems

Signed-off-by: Shuah Khan 
---
 Documentation/filesystems/Makefile |  5 
 Documentation/filesystems/dnotify_test.c   | 34 --
 tools/testing/selftests/filesystems/Makefile   |  5 
 tools/testing/selftests/filesystems/dnotify_test.c | 34 ++
 4 files changed, 39 insertions(+), 39 deletions(-)
 delete mode 100644 Documentation/filesystems/Makefile
 delete mode 100644 Documentation/filesystems/dnotify_test.c
 create mode 100644 tools/testing/selftests/filesystems/Makefile
 create mode 100644 tools/testing/selftests/filesystems/dnotify_test.c

diff --git a/Documentation/filesystems/Makefile 
b/Documentation/filesystems/Makefile
deleted file mode 100644
index 883010c..000
--- a/Documentation/filesystems/Makefile
+++ /dev/null
@@ -1,5 +0,0 @@
-# List of programs to build
-hostprogs-y := dnotify_test
-
-# Tell kbuild to always build the programs
-always := $(hostprogs-y)
diff --git a/Documentation/filesystems/dnotify_test.c 
b/Documentation/filesystems/dnotify_test.c
deleted file mode 100644
index 8b37b4a..000
--- a/Documentation/filesystems/dnotify_test.c
+++ /dev/null
@@ -1,34 +0,0 @@
-#define _GNU_SOURCE/* needed to get the defines */
-#include  /* in glibc 2.2 this has the needed
-  values defined */
-#include 
-#include 
-#include 
-
-static volatile int event_fd;
-
-static void handler(int sig, siginfo_t *si, void *data)
-{
-   event_fd = si->si_fd;
-}
-
-int main(void)
-{
-   struct sigaction act;
-   int fd;
-
-   act.sa_sigaction = handler;
-   sigemptyset(_mask);
-   act.sa_flags = SA_SIGINFO;
-   sigaction(SIGRTMIN + 1, , NULL);
-
-   fd = open(".", O_RDONLY);
-   fcntl(fd, F_SETSIG, SIGRTMIN + 1);
-   fcntl(fd, F_NOTIFY, DN_MODIFY|DN_CREATE|DN_MULTISHOT);
-   /* we will now be notified if any of the files
-  in "." is modified or new files are created */
-   while (1) {
-   pause();
-   printf("Got event on fd=%d\n", event_fd);
-   }
-}
diff --git a/tools/testing/selftests/filesystems/Makefile 
b/tools/testing/selftests/filesystems/Makefile
new file mode 100644
index 000..883010c
--- /dev/null
+++ b/tools/testing/selftests/filesystems/Makefile
@@ -0,0 +1,5 @@
+# List of programs to build
+hostprogs-y := dnotify_test
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
diff --git a/tools/testing/selftests/filesystems/dnotify_test.c 
b/tools/testing/selftests/filesystems/dnotify_test.c
new file mode 100644
index 000..8b37b4a
--- /dev/null
+++ b/tools/testing/selftests/filesystems/dnotify_test.c
@@ -0,0 +1,34 @@
+#define _GNU_SOURCE/* needed to get the defines */
+#include  /* in glibc 2.2 this has the needed
+  values defined */
+#include 
+#include 
+#include 
+
+static volatile int event_fd;
+
+static void handler(int sig, siginfo_t *si, void *data)
+{
+   event_fd = si->si_fd;
+}
+
+int main(void)
+{
+   struct sigaction act;
+   int fd;
+
+   act.sa_sigaction = handler;
+   sigemptyset(_mask);
+   act.sa_flags = SA_SIGINFO;
+   sigaction(SIGRTMIN + 1, , NULL);
+
+   fd = open(".", O_RDONLY);
+   fcntl(fd, F_SETSIG, SIGRTMIN + 1);
+   fcntl(fd, F_NOTIFY, DN_MODIFY|DN_CREATE|DN_MULTISHOT);
+   /* we will now be notified if any of the files
+  in "." is modified or new files are created */
+   while (1) {
+   pause();
+   printf("Got event on fd=%d\n", event_fd);
+   }
+}
-- 
2.7.4



[PATCH 9/9] selftests: Update vDSO Makefile to work under selftests

2016-09-09 Thread Shuah Khan
Update vDSO Makefile to work under selftests. vDSO will not be run as part
of selftests suite and will not included in install targets. They can be
built separately for now.

Signed-off-by: Shuah Khan 
---
 tools/testing/selftests/vDSO/Makefile | 29 -
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/vDSO/Makefile 
b/tools/testing/selftests/vDSO/Makefile
index b12e987..706b68b 100644
--- a/tools/testing/selftests/vDSO/Makefile
+++ b/tools/testing/selftests/vDSO/Makefile
@@ -1,17 +1,20 @@
 ifndef CROSS_COMPILE
-# vdso_test won't build for glibc < 2.16, so disable it
-# hostprogs-y := vdso_test
-hostprogs-$(CONFIG_X86) := vdso_standalone_test_x86
-vdso_standalone_test_x86-objs := vdso_standalone_test_x86.o parse_vdso.o
-vdso_test-objs := parse_vdso.o vdso_test.o
-
-# Tell kbuild to always build the programs
-always := $(hostprogs-y)
-
-HOSTCFLAGS := -I$(objtree)/usr/include -std=gnu99
-HOSTCFLAGS_vdso_standalone_test_x86.o := -fno-asynchronous-unwind-tables 
-fno-stack-protector
-HOSTLOADLIBES_vdso_standalone_test_x86 := -nostdlib
+CFLAGS := -std=gnu99
+CFLAGS_vdso_standalone_test_x86 := -nostdlib -fno-asynchronous-unwind-tables 
-fno-stack-protector
 ifeq ($(CONFIG_X86_32),y)
-HOSTLOADLIBES_vdso_standalone_test_x86 += -lgcc_s
+LDLIBS += -lgcc_s
 endif
+
+TEST_PROGS := vdso_test vdso_standalone_test_x86
+
+all: $(TEST_PROGS)
+vdso_test: parse_vdso.c vdso_test.c
+vdso_standalone_test_x86: vdso_standalone_test_x86.c parse_vdso.c
+   $(CC) $(CFLAGS) $(CFLAGS_vdso_standalone_test_x86) \
+   vdso_standalone_test_x86.c parse_vdso.c \
+   -o vdso_standalone_test_x86
+
+include ../lib.mk
+clean:
+   rm -fr $(TEST_PROGS)
 endif
-- 
2.7.4



[PATCH 6/9] selftests: move ptp tests from Documentation/ptp

2016-09-09 Thread Shuah Khan
Move ptp tests from Documentation/ptp to selftests/ptp.

Signed-off-by: Shuah Khan 
---
 Documentation/ptp/.gitignore   |   1 -
 Documentation/ptp/Makefile |   8 -
 Documentation/ptp/testptp.c| 523 -
 Documentation/ptp/testptp.mk   |  33 ---
 tools/testing/selftests/ptp/.gitignore |   1 +
 tools/testing/selftests/ptp/Makefile   |   8 +
 tools/testing/selftests/ptp/testptp.c  | 523 +
 tools/testing/selftests/ptp/testptp.mk |  33 +++
 8 files changed, 565 insertions(+), 565 deletions(-)
 delete mode 100644 Documentation/ptp/.gitignore
 delete mode 100644 Documentation/ptp/Makefile
 delete mode 100644 Documentation/ptp/testptp.c
 delete mode 100644 Documentation/ptp/testptp.mk
 create mode 100644 tools/testing/selftests/ptp/.gitignore
 create mode 100644 tools/testing/selftests/ptp/Makefile
 create mode 100644 tools/testing/selftests/ptp/testptp.c
 create mode 100644 tools/testing/selftests/ptp/testptp.mk

diff --git a/Documentation/ptp/.gitignore b/Documentation/ptp/.gitignore
deleted file mode 100644
index f562e49..000
--- a/Documentation/ptp/.gitignore
+++ /dev/null
@@ -1 +0,0 @@
-testptp
diff --git a/Documentation/ptp/Makefile b/Documentation/ptp/Makefile
deleted file mode 100644
index 293d6c0..000
--- a/Documentation/ptp/Makefile
+++ /dev/null
@@ -1,8 +0,0 @@
-# List of programs to build
-hostprogs-y := testptp
-
-# Tell kbuild to always build the programs
-always := $(hostprogs-y)
-
-HOSTCFLAGS_testptp.o += -I$(objtree)/usr/include
-HOSTLOADLIBES_testptp := -lrt
diff --git a/Documentation/ptp/testptp.c b/Documentation/ptp/testptp.c
deleted file mode 100644
index 5d2eae1..000
--- a/Documentation/ptp/testptp.c
+++ /dev/null
@@ -1,523 +0,0 @@
-/*
- * PTP 1588 clock support - User space test program
- *
- * Copyright (C) 2010 OMICRON electronics GmbH
- *
- *  This program is free software; you can redistribute it and/or modify
- *  it under the terms of the GNU General Public License as published by
- *  the Free Software Foundation; either version 2 of the License, or
- *  (at your option) any later version.
- *
- *  This program is distributed in the hope that it will be useful,
- *  but WITHOUT ANY WARRANTY; without even the implied warranty of
- *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- *  GNU General Public License for more details.
- *
- *  You should have received a copy of the GNU General Public License
- *  along with this program; if not, write to the Free Software
- *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
- */
-#define _GNU_SOURCE
-#define __SANE_USERSPACE_TYPES__/* For PPC64, to get LL64 types */
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include 
-
-#define DEVICE "/dev/ptp0"
-
-#ifndef ADJ_SETOFFSET
-#define ADJ_SETOFFSET 0x0100
-#endif
-
-#ifndef CLOCK_INVALID
-#define CLOCK_INVALID -1
-#endif
-
-/* clock_adjtime is not available in GLIBC < 2.14 */
-#if !__GLIBC_PREREQ(2, 14)
-#include 
-static int clock_adjtime(clockid_t id, struct timex *tx)
-{
-   return syscall(__NR_clock_adjtime, id, tx);
-}
-#endif
-
-static clockid_t get_clockid(int fd)
-{
-#define CLOCKFD 3
-#define FD_TO_CLOCKID(fd)  ((~(clockid_t) (fd) << 3) | CLOCKFD)
-
-   return FD_TO_CLOCKID(fd);
-}
-
-static void handle_alarm(int s)
-{
-   printf("received signal %d\n", s);
-}
-
-static int install_handler(int signum, void (*handler)(int))
-{
-   struct sigaction action;
-   sigset_t mask;
-
-   /* Unblock the signal. */
-   sigemptyset();
-   sigaddset(, signum);
-   sigprocmask(SIG_UNBLOCK, , NULL);
-
-   /* Install the signal handler. */
-   action.sa_handler = handler;
-   action.sa_flags = 0;
-   sigemptyset(_mask);
-   sigaction(signum, , NULL);
-
-   return 0;
-}
-
-static long ppb_to_scaled_ppm(int ppb)
-{
-   /*
-* The 'freq' field in the 'struct timex' is in parts per
-* million, but with a 16 bit binary fractional field.
-* Instead of calculating either one of
-*
-*scaled_ppm = (ppb / 1000) << 16  [1]
-*scaled_ppm = (ppb << 16) / 1000  [2]
-*
-* we simply use double precision math, in order to avoid the
-* truncation in [1] and the possible overflow in [2].
-*/
-   return (long) (ppb * 65.536);
-}
-
-static int64_t pctns(struct ptp_clock_time *t)
-{
-   return t->sec * 10LL + t->nsec;
-}
-
-static void usage(char *progname)
-{
-   fprintf(stderr,
-   "usage: %s [options]\n"
-   " -a val request a one-shot alarm after 'val' seconds\n"
-   " -A val request a periodic alarm every 'val' seconds\n"
-   " -c query the ptp clock's capabilities\n"
-   

[PATCH 8/9] selftests: move vDSO tests from Documentation/vDSO

2016-09-09 Thread Shuah Khan
Move vDSO tests from Documentation/vDSO to selftests/vDSO.

Signed-off-by: Shuah Khan 
---
 Documentation/vDSO/.gitignore  |   2 -
 Documentation/vDSO/Makefile|  17 --
 Documentation/vDSO/parse_vdso.c| 269 -
 Documentation/vDSO/vdso_standalone_test_x86.c  | 128 --
 Documentation/vDSO/vdso_test.c |  52 
 tools/testing/selftests/vDSO/.gitignore|   2 +
 tools/testing/selftests/vDSO/Makefile  |  17 ++
 tools/testing/selftests/vDSO/parse_vdso.c  | 269 +
 .../selftests/vDSO/vdso_standalone_test_x86.c  | 128 ++
 tools/testing/selftests/vDSO/vdso_test.c   |  52 
 10 files changed, 468 insertions(+), 468 deletions(-)
 delete mode 100644 Documentation/vDSO/.gitignore
 delete mode 100644 Documentation/vDSO/Makefile
 delete mode 100644 Documentation/vDSO/parse_vdso.c
 delete mode 100644 Documentation/vDSO/vdso_standalone_test_x86.c
 delete mode 100644 Documentation/vDSO/vdso_test.c
 create mode 100644 tools/testing/selftests/vDSO/.gitignore
 create mode 100644 tools/testing/selftests/vDSO/Makefile
 create mode 100644 tools/testing/selftests/vDSO/parse_vdso.c
 create mode 100644 tools/testing/selftests/vDSO/vdso_standalone_test_x86.c
 create mode 100644 tools/testing/selftests/vDSO/vdso_test.c

diff --git a/Documentation/vDSO/.gitignore b/Documentation/vDSO/.gitignore
deleted file mode 100644
index 133bf9e..000
--- a/Documentation/vDSO/.gitignore
+++ /dev/null
@@ -1,2 +0,0 @@
-vdso_test
-vdso_standalone_test_x86
diff --git a/Documentation/vDSO/Makefile b/Documentation/vDSO/Makefile
deleted file mode 100644
index b12e987..000
--- a/Documentation/vDSO/Makefile
+++ /dev/null
@@ -1,17 +0,0 @@
-ifndef CROSS_COMPILE
-# vdso_test won't build for glibc < 2.16, so disable it
-# hostprogs-y := vdso_test
-hostprogs-$(CONFIG_X86) := vdso_standalone_test_x86
-vdso_standalone_test_x86-objs := vdso_standalone_test_x86.o parse_vdso.o
-vdso_test-objs := parse_vdso.o vdso_test.o
-
-# Tell kbuild to always build the programs
-always := $(hostprogs-y)
-
-HOSTCFLAGS := -I$(objtree)/usr/include -std=gnu99
-HOSTCFLAGS_vdso_standalone_test_x86.o := -fno-asynchronous-unwind-tables 
-fno-stack-protector
-HOSTLOADLIBES_vdso_standalone_test_x86 := -nostdlib
-ifeq ($(CONFIG_X86_32),y)
-HOSTLOADLIBES_vdso_standalone_test_x86 += -lgcc_s
-endif
-endif
diff --git a/Documentation/vDSO/parse_vdso.c b/Documentation/vDSO/parse_vdso.c
deleted file mode 100644
index 1dbb4b8..000
--- a/Documentation/vDSO/parse_vdso.c
+++ /dev/null
@@ -1,269 +0,0 @@
-/*
- * parse_vdso.c: Linux reference vDSO parser
- * Written by Andrew Lutomirski, 2011-2014.
- *
- * This code is meant to be linked in to various programs that run on Linux.
- * As such, it is available with as few restrictions as possible.  This file
- * is licensed under the Creative Commons Zero License, version 1.0,
- * available at http://creativecommons.org/publicdomain/zero/1.0/legalcode
- *
- * The vDSO is a regular ELF DSO that the kernel maps into user space when
- * it starts a program.  It works equally well in statically and dynamically
- * linked binaries.
- *
- * This code is tested on x86.  In principle it should work on any
- * architecture that has a vDSO.
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-
-/*
- * To use this vDSO parser, first call one of the vdso_init_* functions.
- * If you've already parsed auxv, then pass the value of AT_SYSINFO_EHDR
- * to vdso_init_from_sysinfo_ehdr.  Otherwise pass auxv to vdso_init_from_auxv.
- * Then call vdso_sym for each symbol you want.  For example, to look up
- * gettimeofday on x86_64, use:
- *
- *  = vdso_sym("LINUX_2.6", "gettimeofday");
- * or
- *  = vdso_sym("LINUX_2.6", "__vdso_gettimeofday");
- *
- * vdso_sym will return 0 if the symbol doesn't exist or if the init function
- * failed or was not called.  vdso_sym is a little slow, so its return value
- * should be cached.
- *
- * vdso_sym is threadsafe; the init functions are not.
- *
- * These are the prototypes:
- */
-extern void vdso_init_from_auxv(void *auxv);
-extern void vdso_init_from_sysinfo_ehdr(uintptr_t base);
-extern void *vdso_sym(const char *version, const char *name);
-
-
-/* And here's the code. */
-#ifndef ELF_BITS
-# if ULONG_MAX > 0xUL
-#  define ELF_BITS 64
-# else
-#  define ELF_BITS 32
-# endif
-#endif
-
-#define ELF_BITS_XFORM2(bits, x) Elf##bits##_##x
-#define ELF_BITS_XFORM(bits, x) ELF_BITS_XFORM2(bits, x)
-#define ELF(x) ELF_BITS_XFORM(ELF_BITS, x)
-
-static struct vdso_info
-{
-   bool valid;
-
-   /* Load information */
-   uintptr_t load_addr;
-   uintptr_t load_offset;  /* load_addr - recorded vaddr */
-
-   /* Symbol table */
-   ELF(Sym) *symtab;
-   const char *symstrings;
-   ELF(Word) *bucket, *chain;
-   ELF(Word) nbucket, nchain;
-
-   /* Version table 

Re: [PATCH next 3/3] ipvlan: Introduce l3s mode

2016-09-09 Thread David Ahern
On 9/9/16 3:53 PM, Mahesh Bandewar wrote:
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 0c5415b05ea9..95edd1737ab5 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -149,6 +149,7 @@ config IPVLAN
>  tristate "IP-VLAN support"
>  depends on INET
>  depends on IPV6
> +select NET_L3_MASTER_DEV

depends on instead of select?


Re: [PATCH next 3/3] ipvlan: Introduce l3s mode

2016-09-09 Thread महेश बंडेवार
On Fri, Sep 9, 2016 at 3:07 PM, Rick Jones  wrote:
> On 09/09/2016 02:53 PM, Mahesh Bandewar wrote:
>
>> @@ -48,6 +48,11 @@ master device for the L2 processing and routing from
>> that instance will be
>>  used before packets are queued on the outbound device. In this mode the
>> slaves
>>  will not receive nor can send multicast / broadcast traffic.
>>
>> +4.3 L3S mode:
>> +   This is very similar to the L3 mode except that iptables
>> conn-tracking
>> +works in this mode and that is why L3-symsetric (L3s) from iptables
>> perspective.
>> +This will have slightly less performance but that shouldn't matter since
>> you
>> +are choosing this mode over plain-L3 mode to make conn-tracking work.
>
>
> What is that first sentence trying to say?  It appears to be incomplete, and
> is that supposed to be "L3-symmetric?"
>
Apologies! Seems like I picked up wrong text file (I'll correct this
in next ver). BTW it should read -
" This is very similar to L3 mode except that iptables (conn-tracking)
works in this mode and hence it is  L3-symmetric (L3s). This will have
..."

> happy benchmarking,
>
> rick jones


Re: [PATCH next 3/3] ipvlan: Introduce l3s mode

2016-09-09 Thread Rick Jones

On 09/09/2016 02:53 PM, Mahesh Bandewar wrote:


@@ -48,6 +48,11 @@ master device for the L2 processing and routing from that 
instance will be
 used before packets are queued on the outbound device. In this mode the slaves
 will not receive nor can send multicast / broadcast traffic.

+4.3 L3S mode:
+   This is very similar to the L3 mode except that iptables conn-tracking
+works in this mode and that is why L3-symsetric (L3s) from iptables 
perspective.
+This will have slightly less performance but that shouldn't matter since you
+are choosing this mode over plain-L3 mode to make conn-tracking work.


What is that first sentence trying to say?  It appears to be incomplete, 
and is that supposed to be "L3-symmetric?"


happy benchmarking,

rick jones


Re: [net-next PATCH v2 1/2] e1000: add initial XDP support

2016-09-09 Thread Eric Dumazet
On Fri, 2016-09-09 at 14:29 -0700, John Fastabend wrote:
> From: Alexei Starovoitov 
> 


So it looks like e1000_xmit_raw_frame() can return early,
say if there is no available descriptor.

> +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
> +  unsigned int len,
> +  struct net_device *netdev,
> +  struct e1000_adapter *adapter)
> +{
> + struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> + struct e1000_hw *hw = >hw;
> + struct e1000_tx_ring *tx_ring;
> +
> + if (len > E1000_MAX_DATA_PER_TXD)
> + return;
> +
> + /* e1000 only support a single txq at the moment so the queue is being
> +  * shared with stack. To support this requires locking to ensure the
> +  * stack and XDP are not running at the same time. Devices with
> +  * multiple queues should allocate a separate queue space.
> +  */
> + HARD_TX_LOCK(netdev, txq, smp_processor_id());
> +
> + tx_ring = adapter->tx_ring;
> +
> + if (E1000_DESC_UNUSED(tx_ring) < 2) {
> + HARD_TX_UNLOCK(netdev, txq);
> + return;
> + }
> +
> + e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
> + e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> +
> + writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
> + mmiowb();
> +
> + HARD_TX_UNLOCK(netdev, txq);
> +}
> +
>  #define NUM_REGS 38 /* 1 based count */
>  static void e1000_regdump(struct e1000_adapter *adapter)
>  {
> @@ -4142,6 +4247,19 @@ static struct sk_buff *e1000_alloc_rx_skb(struct 
> e1000_adapter *adapter,
>   return skb;
>  }
> + act = e1000_call_bpf(prog, page_address(p), length);
> + switch (act) {
> + case XDP_PASS:
> + break;
> + case XDP_TX:
> + dma_sync_single_for_device(>dev,
> +dma,
> +length,
> +DMA_TO_DEVICE);
> + e1000_xmit_raw_frame(buffer_info, length,
> +  netdev, adapter);
> + buffer_info->rxbuf.page = NULL;


So I am trying to understand how pages are not leaked ?




[PATCH next 2/3] net: Add _nf_(un)register_hooks symbols

2016-09-09 Thread Mahesh Bandewar
From: Mahesh Bandewar 

Add _nf_register_hooks() and _nf_unregister_hooks() calls which allow
caller to hold RTNL mutex.

Signed-off-by: Mahesh Bandewar 
---
 include/linux/netfilter.h |  2 ++
 net/netfilter/core.c  | 51 ++-
 2 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 9230f9aee896..e82b76781bf6 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -133,6 +133,8 @@ int nf_register_hook(struct nf_hook_ops *reg);
 void nf_unregister_hook(struct nf_hook_ops *reg);
 int nf_register_hooks(struct nf_hook_ops *reg, unsigned int n);
 void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n);
+int _nf_register_hooks(struct nf_hook_ops *reg, unsigned int n);
+void _nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n);
 
 /* Functions to register get/setsockopt ranges (non-inclusive).  You
need to check permissions yourself! */
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index f39276d1c2d7..2c5327e43a88 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -188,19 +188,17 @@ EXPORT_SYMBOL(nf_unregister_net_hooks);
 
 static LIST_HEAD(nf_hook_list);
 
-int nf_register_hook(struct nf_hook_ops *reg)
+static int _nf_register_hook(struct nf_hook_ops *reg)
 {
struct net *net, *last;
int ret;
 
-   rtnl_lock();
for_each_net(net) {
ret = nf_register_net_hook(net, reg);
if (ret && ret != -ENOENT)
goto rollback;
}
list_add_tail(>list, _hook_list);
-   rtnl_unlock();
 
return 0;
 rollback:
@@ -210,19 +208,34 @@ rollback:
break;
nf_unregister_net_hook(net, reg);
}
+   return ret;
+}
+
+int nf_register_hook(struct nf_hook_ops *reg)
+{
+   int ret;
+
+   rtnl_lock();
+   ret = _nf_register_hook(reg);
rtnl_unlock();
+
return ret;
 }
 EXPORT_SYMBOL(nf_register_hook);
 
-void nf_unregister_hook(struct nf_hook_ops *reg)
+static void _nf_unregister_hook(struct nf_hook_ops *reg)
 {
struct net *net;
 
-   rtnl_lock();
list_del(>list);
for_each_net(net)
nf_unregister_net_hook(net, reg);
+}
+
+void nf_unregister_hook(struct nf_hook_ops *reg)
+{
+   rtnl_lock();
+   _nf_unregister_hook(reg);
rtnl_unlock();
 }
 EXPORT_SYMBOL(nf_unregister_hook);
@@ -246,6 +259,26 @@ err:
 }
 EXPORT_SYMBOL(nf_register_hooks);
 
+/* Caller MUST take rtnl_lock() */
+int _nf_register_hooks(struct nf_hook_ops *reg, unsigned int n)
+{
+   unsigned int i;
+   int err = 0;
+
+   for (i = 0; i < n; i++) {
+   err = _nf_register_hook([i]);
+   if (err)
+   goto err;
+   }
+   return err;
+
+err:
+   if (i > 0)
+   _nf_unregister_hooks(reg, i);
+   return err;
+}
+EXPORT_SYMBOL(_nf_register_hooks);
+
 void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n)
 {
while (n-- > 0)
@@ -253,6 +286,14 @@ void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned 
int n)
 }
 EXPORT_SYMBOL(nf_unregister_hooks);
 
+/* Caller MUST take rtnl_lock */
+void _nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n)
+{
+   while (n-- > 0)
+   _nf_unregister_hook([n]);
+}
+EXPORT_SYMBOL(_nf_unregister_hooks);
+
 unsigned int nf_iterate(struct list_head *head,
struct sk_buff *skb,
struct nf_hook_state *state,
-- 
2.8.0.rc3.226.g39d4020



[PATCH next 1/3] ipv6: Export p6_route_input_lookup symbol

2016-09-09 Thread Mahesh Bandewar
From: Mahesh Bandewar 

Make ip6_route_input_lookup available outside of ipv6 the module
similar to ip_route_input_noref in the IPv4 world.

Signed-off-by: Mahesh Bandewar 
---
 include/net/ip6_route.h | 3 +++
 net/ipv6/route.c| 7 ---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index d97305d0e71f..e0cd318d5103 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -64,6 +64,9 @@ static inline bool rt6_need_strict(const struct in6_addr 
*daddr)
 }
 
 void ip6_route_input(struct sk_buff *skb);
+struct dst_entry *ip6_route_input_lookup(struct net *net,
+struct net_device *dev,
+struct flowi6 *fl6, int flags);
 
 struct dst_entry *ip6_route_output_flags(struct net *net, const struct sock 
*sk,
 struct flowi6 *fl6, int flags);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 09d43ff11a8d..9563eedd4f97 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1147,15 +1147,16 @@ static struct rt6_info *ip6_pol_route_input(struct net 
*net, struct fib6_table *
return ip6_pol_route(net, table, fl6->flowi6_iif, fl6, flags);
 }
 
-static struct dst_entry *ip6_route_input_lookup(struct net *net,
-   struct net_device *dev,
-   struct flowi6 *fl6, int flags)
+struct dst_entry *ip6_route_input_lookup(struct net *net,
+struct net_device *dev,
+struct flowi6 *fl6, int flags)
 {
if (rt6_need_strict(>daddr) && dev->type != ARPHRD_PIMREG)
flags |= RT6_LOOKUP_F_IFACE;
 
return fib6_rule_lookup(net, fl6, flags, ip6_pol_route_input);
 }
+EXPORT_SYMBOL_GPL(ip6_route_input_lookup);
 
 void ip6_route_input(struct sk_buff *skb)
 {
-- 
2.8.0.rc3.226.g39d4020



[PATCH next 3/3] ipvlan: Introduce l3s mode

2016-09-09 Thread Mahesh Bandewar
From: Mahesh Bandewar 

In a typical IPvlan L3 setup where master is in default-ns and
each slave is into different (slave) ns. In this setup egress
packet processing for traffic originating from slave-ns will
hit all NF_HOOKs in slave-ns as well as default-ns. However same
is not true for ingress processing. All these NF_HOOKs are
hit only in the slave-ns skipping them in the default-ns.
IPvlan in L3 mode is restrictive and if admins want to deploy
iptables rules in default-ns, this asymmetric data path makes it
impossible to do so.

This patch makes use of the l3_rcv() (added as part of l3mdev
enhancements) to perform input route lookup on RX packets without
changing the skb->dev and then uses nf_hook at NF_INET_LOCAL_IN
to change the skb->dev just before handing over skb to L4.

Signed-off-by: Mahesh Bandewar 
---
 Documentation/networking/ipvlan.txt |  7 ++-
 drivers/net/Kconfig |  1 +
 drivers/net/ipvlan/ipvlan.h |  7 +++
 drivers/net/ipvlan/ipvlan_core.c| 94 +
 drivers/net/ipvlan/ipvlan_main.c| 60 ---
 include/uapi/linux/if_link.h|  1 +
 6 files changed, 162 insertions(+), 8 deletions(-)

diff --git a/Documentation/networking/ipvlan.txt 
b/Documentation/networking/ipvlan.txt
index 14422f8fcdc4..58d3a946f66c 100644
--- a/Documentation/networking/ipvlan.txt
+++ b/Documentation/networking/ipvlan.txt
@@ -22,7 +22,7 @@ The driver can be built into the kernel (CONFIG_IPVLAN=y) or 
as a module
There are no module parameters for this driver and it can be configured
 using IProute2/ip utility.
 
-   ip link add link   type ipvlan mode { l2 | L3 }
+   ip link add link   type ipvlan mode { l2 | l3 | 
l3s }
 
e.g. ip link add link ipvl0 eth0 type ipvlan mode l2
 
@@ -48,6 +48,11 @@ master device for the L2 processing and routing from that 
instance will be
 used before packets are queued on the outbound device. In this mode the slaves
 will not receive nor can send multicast / broadcast traffic.
 
+4.3 L3S mode:
+   This is very similar to the L3 mode except that iptables conn-tracking
+works in this mode and that is why L3-symsetric (L3s) from iptables 
perspective.
+This will have slightly less performance but that shouldn't matter since you
+are choosing this mode over plain-L3 mode to make conn-tracking work.
 
 5. What to choose (macvlan vs. ipvlan)?
These two devices are very similar in many regards and the specific use
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 0c5415b05ea9..95edd1737ab5 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -149,6 +149,7 @@ config IPVLAN
 tristate "IP-VLAN support"
 depends on INET
 depends on IPV6
+select NET_L3_MASTER_DEV
 ---help---
   This allows one to create virtual devices off of a main interface
   and packets will be delivered based on the dest L3 (IPv6/IPv4 addr)
diff --git a/drivers/net/ipvlan/ipvlan.h b/drivers/net/ipvlan/ipvlan.h
index 695a5dc9ace3..68b270b59ba9 100644
--- a/drivers/net/ipvlan/ipvlan.h
+++ b/drivers/net/ipvlan/ipvlan.h
@@ -23,11 +23,13 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #define IPVLAN_DRV "ipvlan"
 #define IPV_DRV_VER"0.1"
@@ -96,6 +98,7 @@ struct ipvl_port {
struct work_struct  wq;
struct sk_buff_head backlog;
int count;
+   boolipt_hook_added;
struct rcu_head rcu;
 };
 
@@ -124,4 +127,8 @@ struct ipvl_addr *ipvlan_find_addr(const struct ipvl_dev 
*ipvlan,
   const void *iaddr, bool is_v6);
 bool ipvlan_addr_busy(struct ipvl_port *port, void *iaddr, bool is_v6);
 void ipvlan_ht_addr_del(struct ipvl_addr *addr);
+struct sk_buff *ipvlan_l3_rcv(struct net_device *dev, struct sk_buff *skb,
+ u16 proto);
+unsigned int ipvlan_nf_input(void *priv, struct sk_buff *skb,
+const struct nf_hook_state *state);
 #endif /* __IPVLAN_H */
diff --git a/drivers/net/ipvlan/ipvlan_core.c b/drivers/net/ipvlan/ipvlan_core.c
index b5f9511d819e..b4e990743e1d 100644
--- a/drivers/net/ipvlan/ipvlan_core.c
+++ b/drivers/net/ipvlan/ipvlan_core.c
@@ -560,6 +560,7 @@ int ipvlan_queue_xmit(struct sk_buff *skb, struct 
net_device *dev)
case IPVLAN_MODE_L2:
return ipvlan_xmit_mode_l2(skb, dev);
case IPVLAN_MODE_L3:
+   case IPVLAN_MODE_L3S:
return ipvlan_xmit_mode_l3(skb, dev);
}
 
@@ -664,6 +665,8 @@ rx_handler_result_t ipvlan_handle_frame(struct sk_buff 
**pskb)
return ipvlan_handle_mode_l2(pskb, port);
case IPVLAN_MODE_L3:
return ipvlan_handle_mode_l3(pskb, port);
+   case IPVLAN_MODE_L3S:
+   return RX_HANDLER_PASS;
}
 
/* Should not reach here */
@@ 

[PATCH next 0/3] ipvlan introduce l3s mode

2016-09-09 Thread Mahesh Bandewar
From: Mahesh Bandewar 

Same old problem with new appraoch especially from suggestions from
earlier patch-series.

First thing is that this is introduced as a new mode rather than
modifying the old (L3) mode. So the behavior of the existing modes is
preserved as it is and the new L3s mode obeys iptables so that intended
conn-tracking can work. 

To do this, the code uses newly added l3mdev_rcv() handler and an
Iptables hook. l3mdev_rcv() to perform an inbound route lookup with the
correct (IPvlan slave) interface and then IPtable-hook at LOCAL_INPUT
to change the input device from master to the slave to complete the
formality.

Supporting stack changes are trivial changes to export symbol to get
IPv4 equivalent code exported for IPv6 and to allow netfilter hook
registration code to allow caller to hold RTNL. Please look into
individual patches for details.

Mahesh Bandewar (3):
  ipv6: Export p6_route_input_lookup symbol
  net: Add _nf_(un)register_hooks symbols
  ipvlan: Introduce l3s mode

 Documentation/networking/ipvlan.txt |  7 ++-
 drivers/net/Kconfig |  1 +
 drivers/net/ipvlan/ipvlan.h |  7 +++
 drivers/net/ipvlan/ipvlan_core.c| 94 +
 drivers/net/ipvlan/ipvlan_main.c| 60 ---
 include/linux/netfilter.h   |  2 +
 include/net/ip6_route.h |  3 ++
 include/uapi/linux/if_link.h|  1 +
 net/ipv6/route.c|  7 +--
 net/netfilter/core.c| 51 ++--
 10 files changed, 217 insertions(+), 16 deletions(-)

-- 
2.8.0.rc3.226.g39d4020



[PULL] virtio: fixes for 4.8

2016-09-09 Thread Michael S. Tsirkin
The following changes since commit 3eab887a55424fc2c27553b7bfe32330df83f7b8:

  Linux 4.8-rc4 (2016-08-28 15:04:33 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 5e59d9a1aed26abcc79abe78af5cfd34e53cbe7f:

  virtio_console: Stop doing DMA on the stack (2016-09-09 21:12:45 +0300)


virtio: fixes for 4.8

This includes a couple of bugfixs for virtio.

The virtio console patch is actually also
in x86/tip targeting 4.9 because it helps vmap
stacks, but it also fixes IOMMU_PLATFORM which
was added in 4.8, and it seems important not to
ship that in a broken configuration.

Signed-off-by: Michael S. Tsirkin 


Andy Lutomirski (1):
  virtio_console: Stop doing DMA on the stack

Baoyou Xie (1):
  virtio: mark vring_dma_dev() static

 drivers/char/virtio_console.c | 23 +++
 drivers/virtio/virtio_ring.c  |  2 +-
 2 files changed, 16 insertions(+), 9 deletions(-)


[net-next PATCH v2 2/2] e1000: bundle xdp xmit routines

2016-09-09 Thread John Fastabend
e1000 supports a single TX queue so it is being shared with the stack
when XDP runs XDP_TX action. This requires taking the xmit lock to
ensure we don't corrupt the tx ring. To avoid taking and dropping the
lock per packet this patch adds a bundling implementation to submit
a bundle of packets to the xmit routine.

I tested this patch running e1000 in a VM using KVM over a tap
device using pktgen to generate traffic along with 'ping -f -l 100'.

Suggested-by: Jesper Dangaard Brouer 
Signed-off-by: John Fastabend 
---
 drivers/net/ethernet/intel/e1000/e1000.h  |   10 +++
 drivers/net/ethernet/intel/e1000/e1000_main.c |   81 +++--
 2 files changed, 71 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000.h 
b/drivers/net/ethernet/intel/e1000/e1000.h
index 5cf8a0a..877b377 100644
--- a/drivers/net/ethernet/intel/e1000/e1000.h
+++ b/drivers/net/ethernet/intel/e1000/e1000.h
@@ -133,6 +133,8 @@ struct e1000_adapter;
 #define E1000_TX_QUEUE_WAKE16
 /* How many Rx Buffers do we bundle into one write to the hardware ? */
 #define E1000_RX_BUFFER_WRITE  16 /* Must be power of 2 */
+/* How many XDP XMIT buffers to bundle into one xmit transaction */
+#define E1000_XDP_XMIT_BUNDLE_MAX E1000_RX_BUFFER_WRITE
 
 #define AUTO_ALL_MODES 0
 #define E1000_EEPROM_82544_APM 0x0004
@@ -168,6 +170,11 @@ struct e1000_rx_buffer {
dma_addr_t dma;
 };
 
+struct e1000_rx_buffer_bundle {
+   struct e1000_rx_buffer *buffer;
+   u32 length;
+};
+
 struct e1000_tx_ring {
/* pointer to the descriptor ring memory */
void *desc;
@@ -206,6 +213,9 @@ struct e1000_rx_ring {
struct e1000_rx_buffer *buffer_info;
struct sk_buff *rx_skb_top;
 
+   /* array of XDP buffer information structs */
+   struct e1000_rx_buffer_bundle *xdp_buffer;
+
/* cpu for rx queue */
int cpu;
 
diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c 
b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 91d5c87..b985271 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -1738,10 +1738,18 @@ static int e1000_setup_rx_resources(struct 
e1000_adapter *adapter,
struct pci_dev *pdev = adapter->pdev;
int size, desc_len;
 
+   size = sizeof(struct e1000_rx_buffer_bundle) *
+   E1000_XDP_XMIT_BUNDLE_MAX;
+   rxdr->xdp_buffer = vzalloc(size);
+   if (!rxdr->xdp_buffer)
+   return -ENOMEM;
+
size = sizeof(struct e1000_rx_buffer) * rxdr->count;
rxdr->buffer_info = vzalloc(size);
-   if (!rxdr->buffer_info)
+   if (!rxdr->buffer_info) {
+   vfree(rxdr->xdp_buffer);
return -ENOMEM;
+   }
 
desc_len = sizeof(struct e1000_rx_desc);
 
@@ -1754,6 +1762,7 @@ static int e1000_setup_rx_resources(struct e1000_adapter 
*adapter,
GFP_KERNEL);
if (!rxdr->desc) {
 setup_rx_desc_die:
+   vfree(rxdr->xdp_buffer);
vfree(rxdr->buffer_info);
return -ENOMEM;
}
@@ -2087,6 +2096,9 @@ static void e1000_free_rx_resources(struct e1000_adapter 
*adapter,
 
e1000_clean_rx_ring(adapter, rx_ring);
 
+   vfree(rx_ring->xdp_buffer);
+   rx_ring->xdp_buffer = NULL;
+
vfree(rx_ring->buffer_info);
rx_ring->buffer_info = NULL;
 
@@ -3369,33 +3381,52 @@ static void e1000_tx_map_rxpage(struct e1000_tx_ring 
*tx_ring,
 }
 
 static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
-unsigned int len,
-struct net_device *netdev,
-struct e1000_adapter *adapter)
+__u32 len,
+struct e1000_adapter *adapter,
+struct e1000_tx_ring *tx_ring)
 {
-   struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
-   struct e1000_hw *hw = >hw;
-   struct e1000_tx_ring *tx_ring;
-
if (len > E1000_MAX_DATA_PER_TXD)
return;
 
+   if (E1000_DESC_UNUSED(tx_ring) < 2)
+   return;
+
+   e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
+   e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
+}
+
+static void e1000_xdp_xmit_bundle(struct e1000_rx_buffer_bundle *buffer_info,
+ struct net_device *netdev,
+ struct e1000_adapter *adapter)
+{
+   struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
+   struct e1000_tx_ring *tx_ring = adapter->tx_ring;
+   struct e1000_hw *hw = >hw;
+   int i = 0;
+
/* e1000 only support a single txq at the moment so the queue is being
 * shared with stack. To support this requires locking to ensure the
 * stack and XDP are not running at the 

[net-next PATCH v2 1/2] e1000: add initial XDP support

2016-09-09 Thread John Fastabend
From: Alexei Starovoitov 

This patch adds initial support for XDP on e1000 driver. Note e1000
driver does not support page recycling in general which could be
added as a further improvement. However XDP_DROP case will recycle.
XDP_TX and XDP_PASS do not support recycling yet.

I tested this patch running e1000 in a VM using KVM over a tap
device.

CC: William Tu 
Signed-off-by: Alexei Starovoitov 
Signed-off-by: John Fastabend 
---
 drivers/net/ethernet/intel/e1000/e1000.h  |2 
 drivers/net/ethernet/intel/e1000/e1000_main.c |  171 +
 2 files changed, 170 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000.h 
b/drivers/net/ethernet/intel/e1000/e1000.h
index d7bdea7..5cf8a0a 100644
--- a/drivers/net/ethernet/intel/e1000/e1000.h
+++ b/drivers/net/ethernet/intel/e1000/e1000.h
@@ -150,6 +150,7 @@ struct e1000_adapter;
  */
 struct e1000_tx_buffer {
struct sk_buff *skb;
+   struct page *page;
dma_addr_t dma;
unsigned long time_stamp;
u16 length;
@@ -279,6 +280,7 @@ struct e1000_adapter {
 struct e1000_rx_ring *rx_ring,
 int cleaned_count);
struct e1000_rx_ring *rx_ring;  /* One per active queue */
+   struct bpf_prog *prog;
struct napi_struct napi;
 
int num_tx_queues;
diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c 
b/drivers/net/ethernet/intel/e1000/e1000_main.c
index f42129d..91d5c87 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 
 char e1000_driver_name[] = "e1000";
 static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver";
@@ -842,6 +843,44 @@ static int e1000_set_features(struct net_device *netdev,
return 0;
 }
 
+static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
+{
+   struct e1000_adapter *adapter = netdev_priv(netdev);
+   struct bpf_prog *old_prog;
+
+   old_prog = xchg(>prog, prog);
+   if (old_prog) {
+   synchronize_net();
+   bpf_prog_put(old_prog);
+   }
+
+   if (netif_running(netdev))
+   e1000_reinit_locked(adapter);
+   else
+   e1000_reset(adapter);
+   return 0;
+}
+
+static bool e1000_xdp_attached(struct net_device *dev)
+{
+   struct e1000_adapter *priv = netdev_priv(dev);
+
+   return !!priv->prog;
+}
+
+static int e1000_xdp(struct net_device *dev, struct netdev_xdp *xdp)
+{
+   switch (xdp->command) {
+   case XDP_SETUP_PROG:
+   return e1000_xdp_set(dev, xdp->prog);
+   case XDP_QUERY_PROG:
+   xdp->prog_attached = e1000_xdp_attached(dev);
+   return 0;
+   default:
+   return -EINVAL;
+   }
+}
+
 static const struct net_device_ops e1000_netdev_ops = {
.ndo_open   = e1000_open,
.ndo_stop   = e1000_close,
@@ -860,6 +899,7 @@ static const struct net_device_ops e1000_netdev_ops = {
 #endif
.ndo_fix_features   = e1000_fix_features,
.ndo_set_features   = e1000_set_features,
+   .ndo_xdp= e1000_xdp,
 };
 
 /**
@@ -1276,6 +1316,9 @@ static void e1000_remove(struct pci_dev *pdev)
e1000_down_and_stop(adapter);
e1000_release_manageability(adapter);
 
+   if (adapter->prog)
+   bpf_prog_put(adapter->prog);
+
unregister_netdev(netdev);
 
e1000_phy_hw_reset(hw);
@@ -1859,7 +1902,7 @@ static void e1000_configure_rx(struct e1000_adapter 
*adapter)
struct e1000_hw *hw = >hw;
u32 rdlen, rctl, rxcsum;
 
-   if (adapter->netdev->mtu > ETH_DATA_LEN) {
+   if (adapter->netdev->mtu > ETH_DATA_LEN || adapter->prog) {
rdlen = adapter->rx_ring[0].count *
sizeof(struct e1000_rx_desc);
adapter->clean_rx = e1000_clean_jumbo_rx_irq;
@@ -1973,6 +2016,11 @@ e1000_unmap_and_free_tx_resource(struct e1000_adapter 
*adapter,
dev_kfree_skb_any(buffer_info->skb);
buffer_info->skb = NULL;
}
+   if (buffer_info->page) {
+   put_page(buffer_info->page);
+   buffer_info->page = NULL;
+   }
+
buffer_info->time_stamp = 0;
/* buffer_info must be completely set up in the transmit path */
 }
@@ -3298,6 +3346,63 @@ static netdev_tx_t e1000_xmit_frame(struct sk_buff *skb,
return NETDEV_TX_OK;
 }
 
+static void e1000_tx_map_rxpage(struct e1000_tx_ring *tx_ring,
+   struct e1000_rx_buffer *rx_buffer_info,
+   unsigned int len)
+{
+   struct e1000_tx_buffer *buffer_info;
+   unsigned int i = tx_ring->next_to_use;
+
+   buffer_info = _ring->buffer_info[i];
+
+   

[PATCH net-next] tcp: better use ooo_last_skb in tcp_data_queue_ofo()

2016-09-09 Thread Eric Dumazet
From: Eric Dumazet 

Willem noticed that we could avoid an rbtree lookup if the
the attempt to coalesce incoming skb to the last skb failed
for some reason.

Since most ooo additions are at the tail, this is definitely
worth adding a test and fast path.

Suggested-by: Willem de Bruijn 
Signed-off-by: Eric Dumazet 
Cc: Yaogong Wang 
Cc: Yuchung Cheng 
Cc: Neal Cardwell 
Cc: Ilpo Järvinen 
---
 net/ipv4/tcp_input.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a5934c4c8cd4..2e26f3eb0293 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4461,6 +4461,12 @@ coalesce_done:
skb = NULL;
goto add_sack;
}
+   /* Can avoid an rbtree lookup if we are adding skb after ooo_last_skb */
+   if (!before(seq, TCP_SKB_CB(tp->ooo_last_skb)->end_seq)) {
+   parent = >ooo_last_skb->rbnode;
+   p = >rb_right;
+   goto insert;
+   }
 
/* Find place to insert this segment. Handle overlaps on the way. */
parent = NULL;
@@ -4503,7 +4509,7 @@ coalesce_done:
}
p = >rb_right;
}
-
+insert:
/* Insert segment into RB tree. */
rb_link_node(>rbnode, parent, p);
rb_insert_color(>rbnode, >out_of_order_queue);




[PATCH] openvswitch: use alias for genetlink family names

2016-09-09 Thread Thadeu Lima de Souza Cascardo
When userspace tries to create datapaths and the module is not loaded,
it will simply fail. With this patch, the module will be automatically
loaded.

Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 net/openvswitch/datapath.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 524c0fd..0536ab3 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -2437,3 +2437,7 @@ module_exit(dp_cleanup);
 
 MODULE_DESCRIPTION("Open vSwitch switching datapath");
 MODULE_LICENSE("GPL");
+MODULE_ALIAS_GENL_FAMILY(OVS_DATAPATH_FAMILY);
+MODULE_ALIAS_GENL_FAMILY(OVS_VPORT_FAMILY);
+MODULE_ALIAS_GENL_FAMILY(OVS_FLOW_FAMILY);
+MODULE_ALIAS_GENL_FAMILY(OVS_PACKET_FAMILY);
-- 
2.7.4



[PATCH 2/2] openvswitch: use percpu flow stats

2016-09-09 Thread Thadeu Lima de Souza Cascardo
Instead of using flow stats per NUMA node, use it per CPU. When using
megaflows, the stats lock can be a bottleneck in scalability.

On a E5-2690 12-core system, usual throughput went from ~4Mpps to
~15Mpps when forwarding between two 40GbE ports with a single flow
configured on the datapath.

This has been tested on a system with possible CPUs 0-7,16-23. After
module removal, there were no corruption on the slab cache.

Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 net/openvswitch/flow.c   | 43 +++
 net/openvswitch/flow.h   |  4 ++--
 net/openvswitch/flow_table.c | 23 ---
 3 files changed, 37 insertions(+), 33 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 3609f37..2970a9f 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -72,32 +73,33 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 
tcp_flags,
 {
struct flow_stats *stats;
int node = numa_node_id();
+   int cpu = get_cpu();
int len = skb->len + (skb_vlan_tag_present(skb) ? VLAN_HLEN : 0);
 
-   stats = rcu_dereference(flow->stats[node]);
+   stats = rcu_dereference(flow->stats[cpu]);
 
-   /* Check if already have node-specific stats. */
+   /* Check if already have CPU-specific stats. */
if (likely(stats)) {
spin_lock(>lock);
/* Mark if we write on the pre-allocated stats. */
-   if (node == 0 && unlikely(flow->stats_last_writer != node))
-   flow->stats_last_writer = node;
+   if (cpu == 0 && unlikely(flow->stats_last_writer != cpu))
+   flow->stats_last_writer = cpu;
} else {
stats = rcu_dereference(flow->stats[0]); /* Pre-allocated. */
spin_lock(>lock);
 
-   /* If the current NUMA-node is the only writer on the
+   /* If the current CPU is the only writer on the
 * pre-allocated stats keep using them.
 */
-   if (unlikely(flow->stats_last_writer != node)) {
+   if (unlikely(flow->stats_last_writer != cpu)) {
/* A previous locker may have already allocated the
-* stats, so we need to check again.  If node-specific
+* stats, so we need to check again.  If CPU-specific
 * stats were already allocated, we update the pre-
 * allocated stats as we have already locked them.
 */
-   if (likely(flow->stats_last_writer != NUMA_NO_NODE)
-   && likely(!rcu_access_pointer(flow->stats[node]))) {
-   /* Try to allocate node-specific stats. */
+   if (likely(flow->stats_last_writer != -1) &&
+   likely(!rcu_access_pointer(flow->stats[cpu]))) {
+   /* Try to allocate CPU-specific stats. */
struct flow_stats *new_stats;
 
new_stats =
@@ -114,12 +116,12 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 
tcp_flags,
new_stats->tcp_flags = tcp_flags;
spin_lock_init(_stats->lock);
 
-   rcu_assign_pointer(flow->stats[node],
+   rcu_assign_pointer(flow->stats[cpu],
   new_stats);
goto unlock;
}
}
-   flow->stats_last_writer = node;
+   flow->stats_last_writer = cpu;
}
}
 
@@ -129,6 +131,7 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 
tcp_flags,
stats->tcp_flags |= tcp_flags;
 unlock:
spin_unlock(>lock);
+   put_cpu();
 }
 
 /* Must be called with rcu_read_lock or ovs_mutex. */
@@ -136,15 +139,15 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
struct ovs_flow_stats *ovs_stats,
unsigned long *used, __be16 *tcp_flags)
 {
-   int node;
+   int cpu;
 
*used = 0;
*tcp_flags = 0;
memset(ovs_stats, 0, sizeof(*ovs_stats));
 
-   /* We open code this to make sure node 0 is always considered */
-   for (node = 0; node < MAX_NUMNODES; node = next_node(node, 
node_possible_map)) {
-   struct flow_stats *stats = 
rcu_dereference_ovsl(flow->stats[node]);
+   /* We open code this to make sure cpu 0 is always considered */
+   for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next(cpu, 
cpu_possible_mask)) {
+  

[PATCH 1/2] openvswitch: fix flow stats accounting when node 0 is not possible

2016-09-09 Thread Thadeu Lima de Souza Cascardo
On a system with only node 1 as possible, all statistics is going to be
accounted on node 0 as it will have a single writer.

However, when getting and clearing the statistics, node 0 is not going
to be considered, as it's not a possible node.

Tested that statistics are not zero on a system with only node 1
possible. Also compile-tested with CONFIG_NUMA off.

Signed-off-by: Thadeu Lima de Souza Cascardo 
---

I am providing this intermediate patch, that will be thrown out by the next one,
in case there is any need to backport this fix.

---
 net/openvswitch/flow.c   | 6 --
 net/openvswitch/flow_table.c | 5 +++--
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 0ea128e..3609f37 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -142,7 +142,8 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
*tcp_flags = 0;
memset(ovs_stats, 0, sizeof(*ovs_stats));
 
-   for_each_node(node) {
+   /* We open code this to make sure node 0 is always considered */
+   for (node = 0; node < MAX_NUMNODES; node = next_node(node, 
node_possible_map)) {
struct flow_stats *stats = 
rcu_dereference_ovsl(flow->stats[node]);
 
if (stats) {
@@ -165,7 +166,8 @@ void ovs_flow_stats_clear(struct sw_flow *flow)
 {
int node;
 
-   for_each_node(node) {
+   /* We open code this to make sure node 0 is always considered */
+   for (node = 0; node < MAX_NUMNODES; node = next_node(node, 
node_possible_map)) {
struct flow_stats *stats = ovsl_dereference(flow->stats[node]);
 
if (stats) {
diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
index d073fff..957a3c3 100644
--- a/net/openvswitch/flow_table.c
+++ b/net/openvswitch/flow_table.c
@@ -148,8 +148,9 @@ static void flow_free(struct sw_flow *flow)
kfree(flow->id.unmasked_key);
if (flow->sf_acts)
ovs_nla_free_flow_actions((struct sw_flow_actions __force 
*)flow->sf_acts);
-   for_each_node(node)
-   if (flow->stats[node])
+   /* We open code this to make sure node 0 is always considered */
+   for (node = 0; node < MAX_NUMNODES; node = next_node(node, 
node_possible_map))
+   if (node != 0 && flow->stats[node])
kmem_cache_free(flow_stats_cache,
(struct flow_stats __force 
*)flow->stats[node]);
kmem_cache_free(flow_cache, flow);
-- 
2.7.4



[PATCH net-next 0/5] liquidio CN23XX VF support

2016-09-09 Thread Raghu Vatsavayi
Dave,

Following is the initial patch series for adding support of
VF functionality on CN23XX devices. Please apply patches in
the following order as some of the patches depend on earlier
patches.

Raghu Vatsavayi (5):
  liquidio CN23XX: VF config support
  liquidio CN23XX: sriov enable
  liquidio CN23XX: Mailbox support
  liquidio CN23XX: mailbox interrupt processing
  liquidio CN23XX: VF related operations

 drivers/net/ethernet/cavium/liquidio/Makefile  |   1 +
 .../ethernet/cavium/liquidio/cn23xx_pf_device.c| 700 +++--
 .../ethernet/cavium/liquidio/cn23xx_pf_device.h|   3 +
 .../net/ethernet/cavium/liquidio/cn66xx_device.c   |  13 +-
 .../net/ethernet/cavium/liquidio/cn68xx_device.c   |  13 +-
 drivers/net/ethernet/cavium/liquidio/lio_core.c|  32 +
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 366 +--
 .../net/ethernet/cavium/liquidio/liquidio_common.h |  11 +-
 .../net/ethernet/cavium/liquidio/octeon_config.h   |   8 +
 .../net/ethernet/cavium/liquidio/octeon_console.c  |  16 +-
 .../net/ethernet/cavium/liquidio/octeon_device.c   |  11 +-
 .../net/ethernet/cavium/liquidio/octeon_device.h   |  32 +-
 drivers/net/ethernet/cavium/liquidio/octeon_droq.c |  28 +-
 .../net/ethernet/cavium/liquidio/octeon_mailbox.c  | 322 ++
 .../net/ethernet/cavium/liquidio/octeon_mailbox.h  | 116 
 drivers/net/ethernet/cavium/liquidio/octeon_main.h |  12 +-
 .../net/ethernet/cavium/liquidio/request_manager.c |   9 +-
 17 files changed, 1409 insertions(+), 284 deletions(-)
 create mode 100644 drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
 create mode 100644 drivers/net/ethernet/cavium/liquidio/octeon_mailbox.h

-- 
1.8.3.1



[PATCH net-next 1/5] liquidio CN23XX: VF config support

2016-09-09 Thread Raghu Vatsavayi
Adds support for VF configuration. It also limits the number
of rings per VF based on total number of VFs configured.

Signed-off-by: Derek Chickles 
Signed-off-by: Satanand Burla 
Signed-off-by: Felix Manlunas 
Signed-off-by: Raghu Vatsavayi 
---
 .../ethernet/cavium/liquidio/cn23xx_pf_device.c| 260 -
 .../net/ethernet/cavium/liquidio/cn66xx_device.c   |  13 +-
 .../net/ethernet/cavium/liquidio/octeon_config.h   |   5 +
 .../net/ethernet/cavium/liquidio/octeon_device.c   |  10 +-
 .../net/ethernet/cavium/liquidio/octeon_device.h   |   9 +-
 5 files changed, 228 insertions(+), 69 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c 
b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
index bddb198..a2953d5 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
+++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
@@ -312,11 +312,12 @@ static void cn23xx_setup_global_mac_regs(struct 
octeon_device *oct)
u64 reg_val;
u16 mac_no = oct->pcie_port;
u16 pf_num = oct->pf_num;
+   u64 temp;
 
/* programming SRN and TRS for each MAC(0..3)  */
 
-   dev_dbg(>pci_dev->dev, "%s:Using pcie port %d\n",
-   __func__, mac_no);
+   pr_devel("%s:Using pcie port %d\n",
+__func__, mac_no);
/* By default, mapping all 64 IOQs to  a single MACs */
 
reg_val =
@@ -333,13 +334,21 @@ static void cn23xx_setup_global_mac_regs(struct 
octeon_device *oct)
/* setting TRS <23:16> */
reg_val = reg_val |
  (oct->sriov_info.trs << CN23XX_PKT_MAC_CTL_RINFO_TRS_BIT_POS);
+   /* setting RPVF <39:32> */
+   temp = oct->sriov_info.rings_per_vf & 0xff;
+   reg_val |= (temp << CN23XX_PKT_MAC_CTL_RINFO_RPVF_BIT_POS);
+
+   /* setting NVFS <55:48> */
+   temp = oct->sriov_info.num_vfs & 0xff;
+   reg_val |= (temp << CN23XX_PKT_MAC_CTL_RINFO_NVFS_BIT_POS);
+
/* write these settings to MAC register */
octeon_write_csr64(oct, CN23XX_SLI_PKT_MAC_RINFO64(mac_no, pf_num),
   reg_val);
 
-   dev_dbg(>pci_dev->dev, "SLI_PKT_MAC(%d)_PF(%d)_RINFO : 
0x%016llx\n",
-   mac_no, pf_num, (u64)octeon_read_csr64
-   (oct, CN23XX_SLI_PKT_MAC_RINFO64(mac_no, pf_num)));
+   pr_devel("SLI_PKT_MAC(%d)_PF(%d)_RINFO : 0x%016llx\n",
+mac_no, pf_num, (u64)octeon_read_csr64
+(oct, CN23XX_SLI_PKT_MAC_RINFO64(mac_no, pf_num)));
 }
 
 static int cn23xx_reset_io_queues(struct octeon_device *oct)
@@ -404,6 +413,7 @@ static int cn23xx_pf_setup_global_input_regs(struct 
octeon_device *oct)
u64 intr_threshold, reg_val;
struct octeon_instr_queue *iq;
struct octeon_cn23xx_pf *cn23xx = (struct octeon_cn23xx_pf *)oct->chip;
+   u64 vf_num;
 
pf_num = oct->pf_num;
 
@@ -420,6 +430,16 @@ static int cn23xx_pf_setup_global_input_regs(struct 
octeon_device *oct)
*/
for (q_no = 0; q_no < ern; q_no++) {
reg_val = oct->pcie_port << CN23XX_PKT_INPUT_CTL_MAC_NUM_POS;
+
+   /* for VF assigned queues. */
+   if (q_no < oct->sriov_info.pf_srn) {
+   vf_num = q_no / oct->sriov_info.rings_per_vf;
+   vf_num += 1; /* VF1, VF2, */
+   } else {
+   vf_num = 0;
+   }
+
+   reg_val |= vf_num << CN23XX_PKT_INPUT_CTL_VF_NUM_POS;
reg_val |= pf_num << CN23XX_PKT_INPUT_CTL_PF_NUM_POS;
 
octeon_write_csr64(oct, CN23XX_SLI_IQ_PKT_CONTROL64(q_no),
@@ -590,8 +610,8 @@ static void cn23xx_setup_iq_regs(struct octeon_device *oct, 
u32 iq_no)
(u8 *)oct->mmio[0].hw_addr + CN23XX_SLI_IQ_DOORBELL(iq_no);
iq->inst_cnt_reg =
(u8 *)oct->mmio[0].hw_addr + CN23XX_SLI_IQ_INSTR_COUNT64(iq_no);
-   dev_dbg(>pci_dev->dev, "InstQ[%d]:dbell reg @ 0x%p instcnt_reg @ 
0x%p\n",
-   iq_no, iq->doorbell_reg, iq->inst_cnt_reg);
+   pr_devel("InstQ[%d]:dbell reg @ 0x%p instcnt_reg @ 0x%p\n",
+iq_no, iq->doorbell_reg, iq->inst_cnt_reg);
 
/* Store the current instruction counter (used in flush_iq
 * calculation)
@@ -822,7 +842,7 @@ static u64 cn23xx_pf_msix_interrupt_handler(void *dev)
u64 ret = 0;
struct octeon_droq *droq = oct->droq[ioq_vector->droq_index];
 
-   dev_dbg(>pci_dev->dev, "In %s octeon_dev @ %p\n", __func__, oct);
+   pr_devel("In %s octeon_dev @ %p\n", __func__, oct);
 
if (!droq) {
dev_err(>pci_dev->dev, "23XX bringup FIXME: oct pfnum:%d 
ioq_vector->ioq_num :%d droq is NULL\n",
@@ -862,7 +882,7 @@ static irqreturn_t cn23xx_interrupt_handler(void *dev)
struct octeon_cn23xx_pf *cn23xx = (struct 

Re: [PATCH] net: ip, diag -- Add diag interface for raw sockets

2016-09-09 Thread Cyrill Gorcunov
On Fri, Sep 09, 2016 at 12:55:13PM -0700, Eric Dumazet wrote:
> > +
> > +   rep = nlmsg_new(sizeof(struct inet_diag_msg) +
> > +   sizeof(struct inet_diag_meminfo) + 64,
> > +   GFP_KERNEL);
> > +   if (!rep)
> 
> There is a missing sock_put(sk)
> 
> > +   return -ENOMEM;
> > +
> > +   err = inet_sk_diag_fill(sk, NULL, rep, r,
> > +   sk_user_ns(NETLINK_CB(in_skb).sk),
> > +   NETLINK_CB(in_skb).portid,
> > +   nlh->nlmsg_seq, 0, nlh);
> 
> sock_put(sk);

Oh, missed. Thanks lot, Eric, will update!


[PATCH net-next 4/5] liquidio CN23XX: mailbox interrupt processing

2016-09-09 Thread Raghu Vatsavayi
Adds support for mailbox interrupt processing of various
commands.

Signed-off-by: Derek Chickles 
Signed-off-by: Satanand Burla 
Signed-off-by: Felix Manlunas 
Signed-off-by: Raghu Vatsavayi 
---
 .../ethernet/cavium/liquidio/cn23xx_pf_device.c| 157 +
 drivers/net/ethernet/cavium/liquidio/lio_main.c|   8 +-
 .../net/ethernet/cavium/liquidio/octeon_device.c   |   1 +
 .../net/ethernet/cavium/liquidio/octeon_device.h   |   6 +
 drivers/net/ethernet/cavium/liquidio/octeon_droq.c |  28 ++--
 5 files changed, 184 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c 
b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
index b3c61302..4d975d8 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
+++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
@@ -30,6 +30,7 @@
 #include "octeon_device.h"
 #include "cn23xx_pf_device.h"
 #include "octeon_main.h"
+#include "octeon_mailbox.h"
 
 #define RESET_NOTDONE 0
 #define RESET_DONE 1
@@ -682,6 +683,118 @@ static void cn23xx_setup_oq_regs(struct octeon_device 
*oct, u32 oq_no)
}
 }
 
+static void cn23xx_pf_mbox_thread(struct work_struct *work)
+{
+   struct cavium_wk *wk = (struct cavium_wk *)work;
+   struct octeon_mbox *mbox = (struct octeon_mbox *)wk->ctxptr;
+   struct octeon_device *oct = mbox->oct_dev;
+   u64 mbox_int_val, val64;
+   u32 q_no, i;
+
+   if (oct->rev_id < OCTEON_CN23XX_REV_1_1) {
+   /*read and clear by writing 1*/
+   mbox_int_val = readq(mbox->mbox_int_reg);
+   writeq(mbox_int_val, mbox->mbox_int_reg);
+
+   for (i = 0; i < oct->sriov_info.num_vfs; i++) {
+   q_no = i * oct->sriov_info.rings_per_vf;
+
+   val64 = readq(oct->mbox[q_no]->mbox_write_reg);
+
+   if (val64 && (val64 != OCTEON_PFVFACK)) {
+   if (octeon_mbox_read(oct->mbox[q_no]))
+   octeon_mbox_process_message(
+   oct->mbox[q_no]);
+   }
+   }
+
+   schedule_delayed_work(>work, msecs_to_jiffies(10));
+   } else {
+   octeon_mbox_process_message(mbox);
+   }
+}
+
+static int cn23xx_setup_pf_mbox(struct octeon_device *oct)
+{
+   u32 q_no, i;
+   u16 mac_no = oct->pcie_port;
+   u16 pf_num = oct->pf_num;
+   struct octeon_mbox *mbox = NULL;
+
+   if (!oct->sriov_info.num_vfs)
+   return 0;
+
+   for (i = 0; i < oct->sriov_info.num_vfs; i++) {
+   q_no = i * oct->sriov_info.rings_per_vf;
+
+   mbox = vmalloc(sizeof(*mbox));
+   if (!mbox)
+   goto free_mbox;
+
+   memset(mbox, 0, sizeof(struct octeon_mbox));
+
+   spin_lock_init(>lock);
+
+   mbox->oct_dev = oct;
+
+   mbox->q_no = q_no;
+
+   mbox->state = OCTEON_MBOX_STATE_IDLE;
+
+   /* PF mbox interrupt reg */
+   mbox->mbox_int_reg = (u8 *)oct->mmio[0].hw_addr +
+CN23XX_SLI_MAC_PF_MBOX_INT(mac_no, pf_num);
+
+   /* PF writes into SIG0 reg */
+   mbox->mbox_write_reg = (u8 *)oct->mmio[0].hw_addr +
+  CN23XX_SLI_PKT_PF_VF_MBOX_SIG(q_no, 0);
+
+   /* PF reads from SIG1 reg */
+   mbox->mbox_read_reg = (u8 *)oct->mmio[0].hw_addr +
+ CN23XX_SLI_PKT_PF_VF_MBOX_SIG(q_no, 1);
+
+   /*Mail Box Thread creation*/
+   INIT_DELAYED_WORK(>mbox_poll_wk.work,
+ cn23xx_pf_mbox_thread);
+   mbox->mbox_poll_wk.ctxptr = (void *)mbox;
+
+   oct->mbox[q_no] = mbox;
+
+   writeq(OCTEON_PFVFSIG, mbox->mbox_read_reg);
+   }
+
+   if (oct->rev_id < OCTEON_CN23XX_REV_1_1)
+   schedule_delayed_work(>mbox[0]->mbox_poll_wk.work,
+ msecs_to_jiffies(0));
+
+   return 0;
+
+free_mbox:
+   while (i) {
+   i--;
+   vfree(oct->mbox[i]);
+   }
+
+   return 1;
+}
+
+static int cn23xx_free_pf_mbox(struct octeon_device *oct)
+{
+   u32 q_no, i;
+
+   if (!oct->sriov_info.num_vfs)
+   return 0;
+
+   for (i = 0; i < oct->sriov_info.num_vfs; i++) {
+   q_no = i * oct->sriov_info.rings_per_vf;
+   cancel_delayed_work_sync(
+   >mbox[q_no]->mbox_poll_wk.work);
+   vfree(oct->mbox[q_no]);
+   }
+
+   return 0;
+}
+
 static int cn23xx_enable_io_queues(struct octeon_device *oct)
 {
u64 reg_val;
@@ -876,6 +989,29 @@ static u64 

[PATCH net-next 5/5] liquidio CN23XX: VF related operations

2016-09-09 Thread Raghu Vatsavayi
Adds support for VF related operations like mac address vlan
and link changes.

Signed-off-by: Derek Chickles 
Signed-off-by: Satanand Burla 
Signed-off-by: Felix Manlunas 
Signed-off-by: Raghu Vatsavayi 
---
 .../ethernet/cavium/liquidio/cn23xx_pf_device.c|  22 +++
 .../ethernet/cavium/liquidio/cn23xx_pf_device.h|   3 +
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 211 +
 .../net/ethernet/cavium/liquidio/liquidio_common.h |   5 +
 .../net/ethernet/cavium/liquidio/octeon_device.h   |   8 +
 5 files changed, 249 insertions(+)

diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c 
b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
index 4d975d8..49efce1 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
+++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "liquidio_common.h"
 #include "octeon_droq.h"
 #include "octeon_iq.h"
@@ -1541,3 +1542,24 @@ int cn23xx_fw_loaded(struct octeon_device *oct)
val = octeon_read_csr64(oct, CN23XX_SLI_SCRATCH1);
return (val >> 1) & 1ULL;
 }
+
+void cn23xx_tell_vf_its_macaddr_changed(struct octeon_device *oct, int vfidx,
+   u8 *mac)
+{
+   if (oct->sriov_info.vf_drv_loaded_mask & BIT_ULL(vfidx)) {
+   struct octeon_mbox_cmd mbox_cmd;
+
+   mbox_cmd.msg.u64 = 0;
+   mbox_cmd.msg.s.type = OCTEON_MBOX_REQUEST;
+   mbox_cmd.msg.s.resp_needed = 0;
+   mbox_cmd.msg.s.cmd = OCTEON_PF_CHANGED_VF_MACADDR;
+   mbox_cmd.msg.s.len = 1;
+   mbox_cmd.recv_len = 0;
+   mbox_cmd.recv_status = 0;
+   mbox_cmd.fn = NULL;
+   mbox_cmd.fn_arg = 0;
+   ether_addr_copy(mbox_cmd.msg.s.params, mac);
+   mbox_cmd.q_no = vfidx * oct->sriov_info.rings_per_vf;
+   octeon_mbox_write(oct, _cmd);
+   }
+}
diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.h 
b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.h
index 21b5c90..20a9dc5 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.h
+++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.h
@@ -56,4 +56,7 @@ u32 cn23xx_pf_get_oq_ticks(struct octeon_device *oct, u32 
time_intr_in_us);
 void cn23xx_dump_pf_initialized_regs(struct octeon_device *oct);
 
 int cn23xx_fw_loaded(struct octeon_device *oct);
+
+void cn23xx_tell_vf_its_macaddr_changed(struct octeon_device *oct, int vfidx,
+   u8 *mac);
 #endif
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index e480c23..3b92036 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -3592,6 +3592,148 @@ static void liquidio_del_vxlan_port(struct net_device 
*netdev,
OCTNET_CMD_VXLAN_PORT_DEL);
 }
 
+static int __liquidio_set_vf_mac(struct net_device *netdev, int vfidx,
+u8 *mac, bool is_admin_assigned)
+{
+   struct lio *lio = GET_LIO(netdev);
+   struct octeon_device *oct = lio->oct_dev;
+   struct octnic_ctrl_pkt nctrl;
+
+   if (!is_valid_ether_addr(mac))
+   return -EINVAL;
+
+   if (vfidx < 0 || vfidx >= oct->sriov_info.num_vfs)
+   return -EINVAL;
+
+   memset(, 0, sizeof(struct octnic_ctrl_pkt));
+
+   nctrl.ncmd.u64 = 0;
+   nctrl.ncmd.s.cmd = OCTNET_CMD_CHANGE_MACADDR;
+   /* vfidx is 0 based, but vf_num (param1) is 1 based */
+   nctrl.ncmd.s.param1 = vfidx + 1;
+   nctrl.ncmd.s.param2 = (is_admin_assigned ? 1 : 0);
+   nctrl.ncmd.s.more = 1;
+   nctrl.iq_no = lio->linfo.txpciq[0].s.q_no;
+   nctrl.cb_fn = 0;
+   nctrl.wait_time = 100;
+
+   nctrl.udd[0] = 0;
+   /* The MAC Address is presented in network byte order. */
+   ether_addr_copy((u8 *)[0] + 2, mac);
+
+   oct->sriov_info.vf_macaddr[vfidx] = nctrl.udd[0];
+
+   octnet_send_nic_ctrl_pkt(oct, );
+
+   return 0;
+}
+
+static int liquidio_set_vf_mac(struct net_device *netdev, int vfidx, u8 *mac)
+{
+   struct lio *lio = GET_LIO(netdev);
+   struct octeon_device *oct = lio->oct_dev;
+   int retval;
+
+   retval = __liquidio_set_vf_mac(netdev, vfidx, mac, true);
+   if (!retval)
+   cn23xx_tell_vf_its_macaddr_changed(oct, vfidx, mac);
+
+   return retval;
+}
+
+static int liquidio_set_vf_vlan(struct net_device *netdev, int vfidx,
+   u16 vlan, u8 qos)
+{
+   struct lio *lio = GET_LIO(netdev);
+   struct octeon_device *oct = lio->oct_dev;
+   struct octnic_ctrl_pkt nctrl;
+   u16 vlantci;
+
+   

[PATCH net-next 3/5] liquidio CN23XX: Mailbox support

2016-09-09 Thread Raghu Vatsavayi
Adds support for mailbox communication between PF and VF.

Signed-off-by: Derek Chickles 
Signed-off-by: Satanand Burla 
Signed-off-by: Felix Manlunas 
Signed-off-by: Raghu Vatsavayi 
---
 drivers/net/ethernet/cavium/liquidio/Makefile  |   1 +
 .../ethernet/cavium/liquidio/cn23xx_pf_device.c|   4 +-
 .../net/ethernet/cavium/liquidio/cn68xx_device.c   |  13 +-
 drivers/net/ethernet/cavium/liquidio/lio_core.c|  32 ++
 .../net/ethernet/cavium/liquidio/liquidio_common.h |   6 +-
 .../net/ethernet/cavium/liquidio/octeon_console.c  |  16 +-
 .../net/ethernet/cavium/liquidio/octeon_device.h   |   4 +
 .../net/ethernet/cavium/liquidio/octeon_mailbox.c  | 322 +
 .../net/ethernet/cavium/liquidio/octeon_mailbox.h  | 116 
 drivers/net/ethernet/cavium/liquidio/octeon_main.h |  12 +-
 .../net/ethernet/cavium/liquidio/request_manager.c |   9 +-
 11 files changed, 507 insertions(+), 28 deletions(-)
 create mode 100644 drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
 create mode 100644 drivers/net/ethernet/cavium/liquidio/octeon_mailbox.h

diff --git a/drivers/net/ethernet/cavium/liquidio/Makefile 
b/drivers/net/ethernet/cavium/liquidio/Makefile
index 5a27b2a..14958de 100644
--- a/drivers/net/ethernet/cavium/liquidio/Makefile
+++ b/drivers/net/ethernet/cavium/liquidio/Makefile
@@ -11,6 +11,7 @@ liquidio-$(CONFIG_LIQUIDIO) += lio_ethtool.o \
cn66xx_device.o\
cn68xx_device.o\
cn23xx_pf_device.o \
+   octeon_mailbox.o   \
octeon_mem_ops.o   \
octeon_droq.o  \
octeon_nic.o
diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c 
b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
index deec869..b3c61302 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
+++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
@@ -270,8 +270,8 @@ static void cn23xx_enable_error_reporting(struct 
octeon_device *oct)
 
regval |= 0xf; /* Enable Link error reporting */
 
-   dev_dbg(>pci_dev->dev, "OCTEON[%d]: Enabling PCI-E error 
reporting..\n",
-   oct->octeon_id);
+   pr_devel("OCTEON[%d]: Enabling PCI-E error reporting..\n",
+oct->octeon_id);
pci_write_config_dword(oct->pci_dev, CN23XX_CONFIG_PCIE_DEVCTL, regval);
 }
 
diff --git a/drivers/net/ethernet/cavium/liquidio/cn68xx_device.c 
b/drivers/net/ethernet/cavium/liquidio/cn68xx_device.c
index dbf3566..424125e 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn68xx_device.c
+++ b/drivers/net/ethernet/cavium/liquidio/cn68xx_device.c
@@ -19,6 +19,7 @@
 * This file may also be available under a different license from Cavium.
 * Contact Cavium, Inc. for more information
 **/
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include 
 #include 
 #include "liquidio_common.h"
@@ -37,8 +38,8 @@ static void lio_cn68xx_set_dpi_regs(struct octeon_device *oct)
u32 fifo_sizes[6] = { 3, 3, 1, 1, 1, 8 };
 
lio_pci_writeq(oct, CN6XXX_DPI_DMA_CTL_MASK, CN6XXX_DPI_DMA_CONTROL);
-   dev_dbg(>pci_dev->dev, "DPI_DMA_CONTROL: 0x%016llx\n",
-   lio_pci_readq(oct, CN6XXX_DPI_DMA_CONTROL));
+   pr_devel("DPI_DMA_CONTROL: 0x%016llx\n",
+lio_pci_readq(oct, CN6XXX_DPI_DMA_CONTROL));
 
for (i = 0; i < 6; i++) {
/* Prevent service of instruction queue for all DMA engines
@@ -47,8 +48,8 @@ static void lio_cn68xx_set_dpi_regs(struct octeon_device *oct)
 */
lio_pci_writeq(oct, 0, CN6XXX_DPI_DMA_ENG_ENB(i));
lio_pci_writeq(oct, fifo_sizes[i], CN6XXX_DPI_DMA_ENG_BUF(i));
-   dev_dbg(>pci_dev->dev, "DPI_ENG_BUF%d: 0x%016llx\n", i,
-   lio_pci_readq(oct, CN6XXX_DPI_DMA_ENG_BUF(i)));
+   pr_devel("DPI_ENG_BUF%d: 0x%016llx\n", i,
+lio_pci_readq(oct, CN6XXX_DPI_DMA_ENG_BUF(i)));
}
 
/* DPI_SLI_PRT_CFG has MPS and MRRS settings that will be set
@@ -56,8 +57,8 @@ static void lio_cn68xx_set_dpi_regs(struct octeon_device *oct)
 */
 
lio_pci_writeq(oct, 1, CN6XXX_DPI_CTL);
-   dev_dbg(>pci_dev->dev, "DPI_CTL: 0x%016llx\n",
-   lio_pci_readq(oct, CN6XXX_DPI_CTL));
+   pr_devel("DPI_CTL: 0x%016llx\n",
+lio_pci_readq(oct, CN6XXX_DPI_CTL));
 }
 
 static int lio_cn68xx_soft_reset(struct octeon_device *oct)
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_core.c 
b/drivers/net/ethernet/cavium/liquidio/lio_core.c
index 201eddb..4626b1f 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_core.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_core.c
@@ -264,3 +264,35 @@ 

[PATCH net-next 2/5] liquidio CN23XX: sriov enable

2016-09-09 Thread Raghu Vatsavayi
Adds support for enabling sriov on CN23XX cards.

Signed-off-by: Derek Chickles 
Signed-off-by: Satanand Burla 
Signed-off-by: Felix Manlunas 
Signed-off-by: Raghu Vatsavayi 
---
 .../ethernet/cavium/liquidio/cn23xx_pf_device.c| 257 +++--
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 147 
 .../net/ethernet/cavium/liquidio/octeon_config.h   |   3 +
 .../net/ethernet/cavium/liquidio/octeon_device.h   |   5 +
 4 files changed, 241 insertions(+), 171 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c 
b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
index a2953d5..deec869 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
+++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
@@ -19,7 +19,7 @@
 * This file may also be available under a different license from Cavium.
 * Contact Cavium, Inc. for more information
 **/
-
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include 
 #include 
 #include 
@@ -52,174 +52,174 @@ void cn23xx_dump_pf_initialized_regs(struct octeon_device 
*oct)
struct octeon_cn23xx_pf *cn23xx = (struct octeon_cn23xx_pf *)oct->chip;
 
/*In cn23xx_soft_reset*/
-   dev_dbg(>pci_dev->dev, "%s[%llx] : 0x%llx\n",
-   "CN23XX_WIN_WR_MASK_REG", CVM_CAST64(CN23XX_WIN_WR_MASK_REG),
-   CVM_CAST64(octeon_read_csr64(oct, CN23XX_WIN_WR_MASK_REG)));
-   dev_dbg(>pci_dev->dev, "%s[%llx] : 0x%016llx\n",
-   "CN23XX_SLI_SCRATCH1", CVM_CAST64(CN23XX_SLI_SCRATCH1),
-   CVM_CAST64(octeon_read_csr64(oct, CN23XX_SLI_SCRATCH1)));
-   dev_dbg(>pci_dev->dev, "%s[%llx] : 0x%016llx\n",
-   "CN23XX_RST_SOFT_RST", CN23XX_RST_SOFT_RST,
-   lio_pci_readq(oct, CN23XX_RST_SOFT_RST));
+   pr_devel("%s[%llx] : 0x%llx\n",
+"CN23XX_WIN_WR_MASK_REG", CVM_CAST64(CN23XX_WIN_WR_MASK_REG),
+CVM_CAST64(octeon_read_csr64(oct, CN23XX_WIN_WR_MASK_REG)));
+   pr_devel("%s[%llx] : 0x%016llx\n",
+"CN23XX_SLI_SCRATCH1", CVM_CAST64(CN23XX_SLI_SCRATCH1),
+CVM_CAST64(octeon_read_csr64(oct, CN23XX_SLI_SCRATCH1)));
+   pr_devel("%s[%llx] : 0x%016llx\n",
+"CN23XX_RST_SOFT_RST", CN23XX_RST_SOFT_RST,
+lio_pci_readq(oct, CN23XX_RST_SOFT_RST));
 
/*In cn23xx_set_dpi_regs*/
-   dev_dbg(>pci_dev->dev, "%s[%llx] : 0x%016llx\n",
-   "CN23XX_DPI_DMA_CONTROL", CN23XX_DPI_DMA_CONTROL,
-   lio_pci_readq(oct, CN23XX_DPI_DMA_CONTROL));
+   pr_devel("%s[%llx] : 0x%016llx\n",
+"CN23XX_DPI_DMA_CONTROL", CN23XX_DPI_DMA_CONTROL,
+lio_pci_readq(oct, CN23XX_DPI_DMA_CONTROL));
 
for (i = 0; i < 6; i++) {
-   dev_dbg(>pci_dev->dev, "%s(%d)[%llx] : 0x%016llx\n",
-   "CN23XX_DPI_DMA_ENG_ENB", i,
-   CN23XX_DPI_DMA_ENG_ENB(i),
-   lio_pci_readq(oct, CN23XX_DPI_DMA_ENG_ENB(i)));
-   dev_dbg(>pci_dev->dev, "%s(%d)[%llx] : 0x%016llx\n",
-   "CN23XX_DPI_DMA_ENG_BUF", i,
-   CN23XX_DPI_DMA_ENG_BUF(i),
-   lio_pci_readq(oct, CN23XX_DPI_DMA_ENG_BUF(i)));
+   pr_devel("%s(%d)[%llx] : 0x%016llx\n",
+"CN23XX_DPI_DMA_ENG_ENB", i,
+CN23XX_DPI_DMA_ENG_ENB(i),
+lio_pci_readq(oct, CN23XX_DPI_DMA_ENG_ENB(i)));
+   pr_devel("%s(%d)[%llx] : 0x%016llx\n",
+"CN23XX_DPI_DMA_ENG_BUF", i,
+CN23XX_DPI_DMA_ENG_BUF(i),
+lio_pci_readq(oct, CN23XX_DPI_DMA_ENG_BUF(i)));
}
 
-   dev_dbg(>pci_dev->dev, "%s[%llx] : 0x%016llx\n", "CN23XX_DPI_CTL",
-   CN23XX_DPI_CTL, lio_pci_readq(oct, CN23XX_DPI_CTL));
+   pr_devel("%s[%llx] : 0x%016llx\n", "CN23XX_DPI_CTL",
+CN23XX_DPI_CTL, lio_pci_readq(oct, CN23XX_DPI_CTL));
 
/*In cn23xx_setup_pcie_mps and cn23xx_setup_pcie_mrrs */
pci_read_config_dword(oct->pci_dev, CN23XX_CONFIG_PCIE_DEVCTL, );
-   dev_dbg(>pci_dev->dev, "%s[%llx] : 0x%016llx\n",
-   "CN23XX_CONFIG_PCIE_DEVCTL",
-   CVM_CAST64(CN23XX_CONFIG_PCIE_DEVCTL), CVM_CAST64(regval));
+   pr_devel("%s[%llx] : 0x%016llx\n",
+"CN23XX_CONFIG_PCIE_DEVCTL",
+CVM_CAST64(CN23XX_CONFIG_PCIE_DEVCTL), CVM_CAST64(regval));
 
-   dev_dbg(>pci_dev->dev, "%s(%d)[%llx] : 0x%016llx\n",
-   "CN23XX_DPI_SLI_PRTX_CFG", oct->pcie_port,
-   CN23XX_DPI_SLI_PRTX_CFG(oct->pcie_port),
-   lio_pci_readq(oct, CN23XX_DPI_SLI_PRTX_CFG(oct->pcie_port)));
+   

Re: [PATCH] net: ip, diag -- Add diag interface for raw sockets

2016-09-09 Thread Eric Dumazet
On Fri, 2016-09-09 at 21:26 +0300, Cyrill Gorcunov wrote:

...

> +static int raw_diag_dump_one(struct sk_buff *in_skb,
> +  const struct nlmsghdr *nlh,
> +  const struct inet_diag_req_v2 *r)
> +{
> + struct raw_hashinfo *hashinfo = raw_get_hashinfo(r);
> + struct net *net = sock_net(in_skb->sk);
> + struct sock *sk = NULL, *s;
> + int err = -ENOENT, slot;
> + struct sk_buff *rep;
> +
> + if (IS_ERR(hashinfo))
> + return PTR_ERR(hashinfo);
> +
> + read_lock(>lock);
> + for (slot = 0; slot < RAW_HTABLE_SIZE; slot++) {
> + sk_for_each(s, >ht[slot]) {
> + sk = raw_lookup(net, s, r);
> + if (sk)
> + break;
> + }
> + }
> + if (sk && !atomic_inc_not_zero(>sk_refcnt))
> + sk = NULL;
> + read_unlock(>lock);
> + if (!sk)
> + return -ENOENT;
> +
> + rep = nlmsg_new(sizeof(struct inet_diag_msg) +
> + sizeof(struct inet_diag_meminfo) + 64,
> + GFP_KERNEL);
> + if (!rep)

There is a missing sock_put(sk)

> + return -ENOMEM;
> +
> + err = inet_sk_diag_fill(sk, NULL, rep, r,
> + sk_user_ns(NETLINK_CB(in_skb).sk),
> + NETLINK_CB(in_skb).portid,
> + nlh->nlmsg_seq, 0, nlh);

sock_put(sk);

> + if (err < 0) {
> + kfree_skb(rep);


> + return err;
> + }
> +
> + err = netlink_unicast(net->diag_nlsk, rep,
> +   NETLINK_CB(in_skb).portid,
> +   MSG_DONTWAIT);
> + if (err > 0)
> + err = 0;
> + return err;
> +}
> +



Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-09 Thread Tom Herbert
On Thu, Sep 8, 2016 at 10:36 PM, Jesper Dangaard Brouer
 wrote:
> On Thu, 8 Sep 2016 20:22:04 -0700
> Alexei Starovoitov  wrote:
>
>> On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:
>> >
>> > I'm sorry but I have a problem with this patch!
>>
>> is it because the variable is called 'xdp_doorbell'?
>> Frankly I see nothing scary in this patch.
>> It extends existing code by adding a flag to ring doorbell or not.
>> The end of rx napi is used as an obvious heuristic to flush the pipe.
>> Looks pretty generic to me.
>> The same code can be used for non-xdp as well once we figure out
>> good algorithm for xmit_more in the stack.
>
> What I'm proposing can also be used by the normal stack.
>
>> > Looking at this patch, I want to bring up a fundamental architectural
>> > concern with the development direction of XDP transmit.
>> >
>> >
>> > What you are trying to implement, with delaying the doorbell, is
>> > basically TX bulking for TX_XDP.
>> >
>> >  Why not implement a TX bulking interface directly instead?!?
>> >
>> > Yes, the tailptr/doorbell is the most costly operation, but why not
>> > also take advantage of the benefits of bulking for other parts of the
>> > code? (benefit is smaller, by every cycles counts in this area)
>> >
>> > This hole XDP exercise is about avoiding having a transaction cost per
>> > packet, that reads "bulking" or "bundling" of packets, where possible.
>> >
>> >  Lets do bundling/bulking from the start!
>>
>> mlx4 already does bulking and this proposed mlx5 set of patches
>> does bulking as well.
>> See nothing wrong about it. RX side processes the packets and
>> when it's done it tells TX to xmit whatever it collected.
>
> This is doing "hidden" bulking and not really taking advantage of using
> the icache more effeciently.
>
> Let me explain the problem I see, little more clear then, so you
> hopefully see where I'm going.
>
> Imagine you have packets intermixed towards the stack and XDP_TX.
> Every time you call the stack code, then you flush your icache.  When
> returning to the driver code, you will have to reload all the icache
> associated with the XDP_TX, this is a costly operation.
>
>
>> > The reason behind the xmit_more API is that we could not change the
>> > API of all the drivers.  And we found that calling an explicit NDO
>> > flush came at a cost (only approx 7 ns IIRC), but it still a cost that
>> > would hit the common single packet use-case.
>> >
>> > It should be really easy to build a bundle of packets that need XDP_TX
>> > action, especially given you only have a single destination "port".
>> > And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.
>>
>> not sure what are you proposing here?
>> Sounds like you want to extend it to multi port in the future?
>> Sure. The proposed code is easily extendable.
>>
>> Or you want to see something like a link list of packets
>> or an array of packets that RX side is preparing and then
>> send the whole array/list to TX port?
>> I don't think that would be efficient, since it would mean
>> unnecessary copy of pointers.
>
> I just explain it will be more efficient due to better use of icache.
>
>
>> > In the future, XDP need to support XDP_FWD forwarding of packets/pages
>> > out other interfaces.  I also want bulk transmit from day-1 here.  It
>> > is slightly more tricky to sort packets for multiple outgoing
>> > interfaces efficiently in the pool loop.
>>
>> I don't think so. Multi port is natural extension to this set of patches.
>> With multi port the end of RX will tell multiple ports (that were
>> used to tx) to ring the bell. Pretty trivial and doesn't involve any
>> extra arrays or link lists.
>
> So, have you solved the problem exclusive access to a TX ring of a
> remote/different net_device when sending?
>
> In you design you assume there exist many TX ring available for other
> devices to access.  In my design I also want to support devices that
> doesn't have this HW capability, and e.g. only have one TX queue.
>
Right, but segregating TX queues used by the stack from the those used
by XDP is pretty fundamental to the design. If we start mixing them,
then we need to pull in several features (such as BQL which seems like
what you're proposing) into the XDP path. If this starts to slow
things down or we need to reinvent a bunch of existing features to not
use skbuffs that seems to run contrary to "the simple as possible"
model for XDP-- may as well use the regular stack at that point
maybe...

Tom

>
>> > But the mSwitch[1] article actually already solved this destination
>> > sorting.  Please read[1] section 3.3 "Switch Fabric Algorithm" for
>> > understanding the next steps, for a smarter data structure, when
>> > starting to have more TX "ports".  And perhaps align your single
>> > XDP_TX destination data structure to this future development.
>> >
>> > [1] 

[PATCH] ATM-iphase: Use kmalloc_array() in tx_init()

2016-09-09 Thread SF Markus Elfring
From: Markus Elfring 
Date: Fri, 9 Sep 2016 20:40:16 +0200

* Multiplications for the size determination of memory allocations
  indicated that array data structures should be processed.
  Thus use the corresponding function "kmalloc_array".

  This issue was detected by using the Coccinelle software.

* Replace the specification of data types by pointer dereferences
  to make the corresponding size determination a bit safer according to
  the Linux coding style convention.

Signed-off-by: Markus Elfring 
---
 drivers/atm/iphase.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/atm/iphase.c b/drivers/atm/iphase.c
index 809dd1e..9d8807e 100644
--- a/drivers/atm/iphase.c
+++ b/drivers/atm/iphase.c
@@ -1975,7 +1975,9 @@ static int tx_init(struct atm_dev *dev)
buf_desc_ptr++;   
tx_pkt_start += iadev->tx_buf_sz;  
}  
-iadev->tx_buf = kmalloc(iadev->num_tx_desc*sizeof(struct 
cpcs_trailer_desc), GFP_KERNEL);
+   iadev->tx_buf = kmalloc_array(iadev->num_tx_desc,
+ sizeof(*iadev->tx_buf),
+ GFP_KERNEL);
 if (!iadev->tx_buf) {
 printk(KERN_ERR DEV_LABEL " couldn't get mem\n");
goto err_free_dle;
@@ -1995,8 +1997,9 @@ static int tx_init(struct atm_dev *dev)
   sizeof(*cpcs),
   DMA_TO_DEVICE);
 }
-iadev->desc_tbl = kmalloc(iadev->num_tx_desc *
-   sizeof(struct desc_tbl_t), GFP_KERNEL);
+   iadev->desc_tbl = kmalloc_array(iadev->num_tx_desc,
+   sizeof(*iadev->desc_tbl),
+   GFP_KERNEL);
if (!iadev->desc_tbl) {
printk(KERN_ERR DEV_LABEL " couldn't get mem\n");
goto err_free_all_tx_bufs;
@@ -2124,7 +2127,9 @@ static int tx_init(struct atm_dev *dev)
memset((caddr_t)(iadev->seg_ram+i),  0, iadev->num_vc*4);
vc = (struct main_vc *)iadev->MAIN_VC_TABLE_ADDR;  
evc = (struct ext_vc *)iadev->EXT_VC_TABLE_ADDR;  
-iadev->testTable = kmalloc(sizeof(long)*iadev->num_vc, GFP_KERNEL); 
+   iadev->testTable = kmalloc_array(iadev->num_vc,
+sizeof(*iadev->testTable),
+GFP_KERNEL);
 if (!iadev->testTable) {
printk("Get freepage  failed\n");
   goto err_free_desc_tbl;
-- 
2.10.0



[PATCH] net: ip, diag -- Add diag interface for raw sockets

2016-09-09 Thread Cyrill Gorcunov
In criu we are actively using diag interface to collect sockets
present in the system when dumping applications. And while for
unix, tcp, udp[lite], packet, netlink it works as expected,
the raw sockets do not have. Thus add it.

CC: David S. Miller 
CC: Eric Dumazet 
CC: Alexey Kuznetsov 
CC: James Morris 
CC: Hideaki YOSHIFUJI 
CC: Patrick McHardy 
CC: Andrey Vagin 
CC: Stephen Hemminger 
Signed-off-by: Cyrill Gorcunov 
---

Take a look please, once time permit. Hopefully I didn't
miss something obvious, tested as "ss -n -A raw" for modified
iproute2 instance and c/r for trivial application which has
raw sockets opened. A patch for ss tool is at https://goo.gl/VFQ93L
for the reference, will send it out then.

 include/net/raw.h   |5 +
 include/net/rawv6.h |5 +
 net/ipv4/Kconfig|8 ++
 net/ipv4/Makefile   |1 
 net/ipv4/raw.c  |6 +
 net/ipv4/raw_diag.c |  192 
 net/ipv6/raw.c  |6 +
 7 files changed, 219 insertions(+), 4 deletions(-)

Index: linux-ml.git/include/net/raw.h
===
--- linux-ml.git.orig/include/net/raw.h
+++ linux-ml.git/include/net/raw.h
@@ -23,6 +23,11 @@
 
 extern struct proto raw_prot;
 
+extern struct raw_hashinfo raw_v4_hashinfo;
+struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
+unsigned short num, __be32 raddr,
+__be32 laddr, int dif);
+
 void raw_icmp_error(struct sk_buff *, int, u32);
 int raw_local_deliver(struct sk_buff *, int);
 
Index: linux-ml.git/include/net/rawv6.h
===
--- linux-ml.git.orig/include/net/rawv6.h
+++ linux-ml.git/include/net/rawv6.h
@@ -3,6 +3,11 @@
 
 #include 
 
+extern struct raw_hashinfo raw_v6_hashinfo;
+struct sock *__raw_v6_lookup(struct net *net, struct sock *sk,
+unsigned short num, const struct in6_addr 
*loc_addr,
+const struct in6_addr *rmt_addr, int dif);
+
 void raw6_icmp_error(struct sk_buff *, int nexthdr,
u8 type, u8 code, int inner_offset, __be32);
 bool raw6_local_deliver(struct sk_buff *, int);
Index: linux-ml.git/net/ipv4/Kconfig
===
--- linux-ml.git.orig/net/ipv4/Kconfig
+++ linux-ml.git/net/ipv4/Kconfig
@@ -430,6 +430,14 @@ config INET_UDP_DIAG
  Support for UDP socket monitoring interface used by the ss tool.
  If unsure, say Y.
 
+config INET_RAW_DIAG
+   tristate "RAW: socket monitoring interface"
+   depends on INET_DIAG && (IPV6 || IPV6=n)
+   default n
+   ---help---
+ Support for RAW socket monitoring interface used by the ss tool.
+ If unsure, say Y.
+
 config INET_DIAG_DESTROY
bool "INET: allow privileged process to administratively close sockets"
depends on INET_DIAG
Index: linux-ml.git/net/ipv4/Makefile
===
--- linux-ml.git.orig/net/ipv4/Makefile
+++ linux-ml.git/net/ipv4/Makefile
@@ -40,6 +40,7 @@ obj-$(CONFIG_NETFILTER)   += netfilter.o n
 obj-$(CONFIG_INET_DIAG) += inet_diag.o 
 obj-$(CONFIG_INET_TCP_DIAG) += tcp_diag.o
 obj-$(CONFIG_INET_UDP_DIAG) += udp_diag.o
+obj-$(CONFIG_INET_RAW_DIAG) += raw_diag.o
 obj-$(CONFIG_NET_TCPPROBE) += tcp_probe.o
 obj-$(CONFIG_TCP_CONG_BIC) += tcp_bic.o
 obj-$(CONFIG_TCP_CONG_CDG) += tcp_cdg.o
Index: linux-ml.git/net/ipv4/raw.c
===
--- linux-ml.git.orig/net/ipv4/raw.c
+++ linux-ml.git/net/ipv4/raw.c
@@ -89,9 +89,10 @@ struct raw_frag_vec {
int hlen;
 };
 
-static struct raw_hashinfo raw_v4_hashinfo = {
+struct raw_hashinfo raw_v4_hashinfo = {
.lock = __RW_LOCK_UNLOCKED(raw_v4_hashinfo.lock),
 };
+EXPORT_SYMBOL_GPL(raw_v4_hashinfo);
 
 int raw_hash_sk(struct sock *sk)
 {
@@ -120,7 +121,7 @@ void raw_unhash_sk(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(raw_unhash_sk);
 
-static struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
+struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
unsigned short num, __be32 raddr, __be32 laddr, int dif)
 {
sk_for_each_from(sk) {
@@ -136,6 +137,7 @@ static struct sock *__raw_v4_lookup(stru
 found:
return sk;
 }
+EXPORT_SYMBOL_GPL(__raw_v4_lookup);
 
 /*
  * 0 - deliver
Index: linux-ml.git/net/ipv4/raw_diag.c
===
--- /dev/null
+++ linux-ml.git/net/ipv4/raw_diag.c
@@ -0,0 +1,192 @@
+#include 
+
+#include 
+#include 
+
+#include 
+#include 
+
+#ifdef pr_fmt
+# undef pr_fmt
+#endif
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+static 

RE: [PATCH v2] net: ethernet: renesas: sh_eth: add POST registers for rz

2016-09-09 Thread Chris Brandt
On 9/9/2016, Sergei Shtylyov wrote:
> > sh_eth_private *mdp)  {
> > if (sh_eth_is_rz_fast_ether(mdp)) {
> > sh_eth_tsu_write(mdp, 0, TSU_TEN); /* Disable all CAM entry */
> > +   sh_eth_tsu_write(mdp, TSU_FWSLC_POSTENU | TSU_FWSLC_POSTENL,
> > +TSU_FWSLC);/* Enable POST registers */
> > return;
> > }
> 
> Wait, don't you also need to write 0s to the POST registers like done
> at the end of this function?

Nope.

The sh_eth_chip_reset() function will write to register ARSTR which will do a 
HW reset on the block and clear all the registers, including all the POST 
registers.

static struct sh_eth_cpu_data r7s72100_data = {
.chip_reset = sh_eth_chip_reset,


So, before sh_eth_tsu_init() is ever called, the hardware will always be reset.


/* initialize first or needed device */
if (!devno || pd->needs_init) {
if (mdp->cd->chip_reset)
mdp->cd->chip_reset(ndev);

if (mdp->cd->tsu) {
/* TSU init (Init only)*/
sh_eth_tsu_init(mdp);
}
}


Therefore there is no reason to set the POST registers back to 0 because they 
are already at 0 from the reset.


Chris




Re: Minimum MTU Mess

2016-09-09 Thread Jarod Wilson
On Thu, Sep 08, 2016 at 03:24:13AM +0200, Andrew Lunn wrote:
> > This is definitely going to require a few passes... (Working my way
> > through every driver with an ndo_change_mtu wired up right now to
> > see just how crazy this might get).
> 
> It might be something Coccinelle can help you with. Try describing the
> transformation you want to do, to their mailing list, and they might
> come up with a script for you.

>From looking everything over, I'd be very surprised if they could. The
places where things need changing vary quite wildly by driver, but I've
actually got a full set of compiling changes with a cumulative diffstat
of:

 153 files changed, 599 insertions(+), 1002 deletions(-)

Actually breaking this up into easily digestable/mergeable chunks is going
to be kind of entertaining... Suggestions welcomed on that. First up is
obviously the core change, which touches just net/ethernet/eth.c,
net/core/dev.c, include/linux/netdevice.h and
include/uapi/linux/if_ether.h, and should let existing code continue to
Just Work(tm), though devices using ether_setup() that had no MTU range
checking (or one or the other missing) will wind up with new bounds.

For the most part, after the initial patch, very few of the others
would have any direct interaction with any others, so they could all
be singletons, or small batches per-vendor, or whatever.

Full diffstat for the aid of discussion on how to break it up:

 drivers/char/pcmcia/synclink_cs.c  |  1 -
 drivers/firewire/net.c | 14 ++---
 drivers/infiniband/hw/nes/nes.c|  1 -
 drivers/infiniband/hw/nes/nes.h|  4 +-
 drivers/infiniband/hw/nes/nes_nic.c|  7 +--
 drivers/misc/sgi-xp/xpnet.c| 21 ++--
 drivers/net/ethernet/agere/et131x.c|  7 +--
 drivers/net/ethernet/altera/altera_tse.h   |  1 -
 drivers/net/ethernet/altera/altera_tse_main.c  | 12 ++---
 drivers/net/ethernet/amd/amd8111e.c|  5 +-
 drivers/net/ethernet/atheros/alx/hw.h  |  1 -
 drivers/net/ethernet/atheros/alx/main.c|  9 +---
 drivers/net/ethernet/atheros/atl1c/atl1c_main.c| 41 +-
 drivers/net/ethernet/atheros/atl1e/atl1e_main.c| 11 ++--
 drivers/net/ethernet/atheros/atlx/atl1.c   | 15 +++---
 drivers/net/ethernet/atheros/atlx/atl2.c   | 14 +++--
 drivers/net/ethernet/broadcom/b44.c|  5 +-
 drivers/net/ethernet/broadcom/bcm63xx_enet.c   | 30 +++
 drivers/net/ethernet/broadcom/bnx2.c   |  8 ++-
 drivers/net/ethernet/broadcom/bnx2.h   |  6 +--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h|  2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c|  8 +--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c   | 22 +++-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c   |  4 ++
 drivers/net/ethernet/broadcom/bnxt/bnxt.c  |  7 +--
 drivers/net/ethernet/broadcom/tg3.c|  7 +--
 drivers/net/ethernet/brocade/bna/bnad.c|  7 +--
 drivers/net/ethernet/cadence/macb.c| 17 +++---
 drivers/net/ethernet/calxeda/xgmac.c   | 18 ++-
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 15 ++
 .../net/ethernet/cavium/liquidio/octeon_network.h  |  2 +-
 drivers/net/ethernet/cavium/octeon/octeon_mgmt.c   |  5 +-
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   | 10 ++--
 drivers/net/ethernet/chelsio/cxgb/cxgb2.c  |  2 -
 drivers/net/ethernet/cisco/enic/enic_main.c|  7 +--
 drivers/net/ethernet/cisco/enic/enic_res.h |  2 +-
 drivers/net/ethernet/dlink/dl2k.c  | 22 ++--
 drivers/net/ethernet/dlink/sundance.c  |  6 ++-
 drivers/net/ethernet/freescale/gianfar.c   |  9 ++--
 drivers/net/ethernet/hisilicon/hns/hns_enet.c  |  4 --
 drivers/net/ethernet/ibm/ehea/ehea_main.c  | 13 ++---
 drivers/net/ethernet/ibm/emac/core.c   |  7 +--
 drivers/net/ethernet/intel/e100.c  |  9 
 drivers/net/ethernet/intel/e1000/e1000_main.c  | 12 ++---
 drivers/net/ethernet/intel/e1000e/netdev.c | 14 +++--
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c| 15 ++
 drivers/net/ethernet/intel/i40e/i40e_main.c| 10 ++--
 drivers/net/ethernet/intel/i40evf/i40evf_main.c|  8 +--
 drivers/net/ethernet/intel/igb/e1000_defines.h |  3 +-
 drivers/net/ethernet/intel/igb/igb_main.c  | 16 ++
 drivers/net/ethernet/intel/igbvf/defines.h |  3 +-
 drivers/net/ethernet/intel/igbvf/netdev.c  | 14 ++---
 drivers/net/ethernet/intel/ixgb/ixgb_main.c| 16 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  | 11 ++--
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  | 33 ++--
 drivers/net/ethernet/marvell/mvneta.c  | 36 -
 drivers/net/ethernet/marvell/mvpp2.c   | 36 -
 

[PATCH net-next] Revert "hv_netvsc: make inline functions static"

2016-09-09 Thread sthemmin
From: Stephen Hemminger 

These functions are used by other code misc-next tree.

This reverts commit 30d1de08c87ddde6f73936c3350e7e153988fe02.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/netvsc.c | 85 +
 include/linux/hyperv.h  | 84 
 2 files changed, 85 insertions(+), 84 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 2a9ccc4..ff05b9b 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -34,89 +34,6 @@
 #include "hyperv_net.h"
 
 /*
- * An API to support in-place processing of incoming VMBUS packets.
- */
-#define VMBUS_PKT_TRAILER  8
-
-static struct vmpacket_descriptor *
-get_next_pkt_raw(struct vmbus_channel *channel)
-{
-   struct hv_ring_buffer_info *ring_info = >inbound;
-   u32 read_loc = ring_info->priv_read_index;
-   void *ring_buffer = hv_get_ring_buffer(ring_info);
-   struct vmpacket_descriptor *cur_desc;
-   u32 packetlen;
-   u32 dsize = ring_info->ring_datasize;
-   u32 delta = read_loc - ring_info->ring_buffer->read_index;
-   u32 bytes_avail_toread = (hv_get_bytes_to_read(ring_info) - delta);
-
-   if (bytes_avail_toread < sizeof(struct vmpacket_descriptor))
-   return NULL;
-
-   if ((read_loc + sizeof(*cur_desc)) > dsize)
-   return NULL;
-
-   cur_desc = ring_buffer + read_loc;
-   packetlen = cur_desc->len8 << 3;
-
-   /*
-* If the packet under consideration is wrapping around,
-* return failure.
-*/
-   if ((read_loc + packetlen + VMBUS_PKT_TRAILER) > (dsize - 1))
-   return NULL;
-
-   return cur_desc;
-}
-
-/*
- * A helper function to step through packets "in-place"
- * This API is to be called after each successful call
- * get_next_pkt_raw().
- */
-static void put_pkt_raw(struct vmbus_channel *channel,
-   struct vmpacket_descriptor *desc)
-{
-   struct hv_ring_buffer_info *ring_info = >inbound;
-   u32 read_loc = ring_info->priv_read_index;
-   u32 packetlen = desc->len8 << 3;
-   u32 dsize = ring_info->ring_datasize;
-
-   BUG_ON((read_loc + packetlen + VMBUS_PKT_TRAILER) > dsize);
-
-   /*
-* Include the packet trailer.
-*/
-   ring_info->priv_read_index += packetlen + VMBUS_PKT_TRAILER;
-}
-
-/*
- * This call commits the read index and potentially signals the host.
- * Here is the pattern for using the "in-place" consumption APIs:
- *
- * while (get_next_pkt_raw() {
- * process the packet "in-place";
- * put_pkt_raw();
- * }
- * if (packets processed in place)
- * commit_rd_index();
- */
-static void commit_rd_index(struct vmbus_channel *channel)
-{
-   struct hv_ring_buffer_info *ring_info = >inbound;
-   /*
-* Make sure all reads are done before we update the read index since
-* the writer may start writing to the read area once the read index
-* is updated.
-*/
-   virt_rmb();
-   ring_info->ring_buffer->read_index = ring_info->priv_read_index;
-
-   if (hv_need_to_signal_on_read(ring_info))
-   vmbus_set_event(channel);
-}
-
-/*
  * Switch the data path from the synthetic interface to the VF
  * interface.
  */
@@ -840,7 +757,7 @@ static u32 netvsc_copy_to_send_buf(struct netvsc_device 
*net_device,
return msg_size;
 }
 
-static int netvsc_send_pkt(
+static inline int netvsc_send_pkt(
struct hv_device *device,
struct hv_netvsc_packet *packet,
struct netvsc_device *net_device,
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index b01c8c3..5df444b 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1429,4 +1429,88 @@ static inline  bool hv_need_to_signal_on_read(struct 
hv_ring_buffer_info *rbi)
return false;
 }
 
+/*
+ * An API to support in-place processing of incoming VMBUS packets.
+ */
+#define VMBUS_PKT_TRAILER  8
+
+static inline struct vmpacket_descriptor *
+get_next_pkt_raw(struct vmbus_channel *channel)
+{
+   struct hv_ring_buffer_info *ring_info = >inbound;
+   u32 read_loc = ring_info->priv_read_index;
+   void *ring_buffer = hv_get_ring_buffer(ring_info);
+   struct vmpacket_descriptor *cur_desc;
+   u32 packetlen;
+   u32 dsize = ring_info->ring_datasize;
+   u32 delta = read_loc - ring_info->ring_buffer->read_index;
+   u32 bytes_avail_toread = (hv_get_bytes_to_read(ring_info) - delta);
+
+   if (bytes_avail_toread < sizeof(struct vmpacket_descriptor))
+   return NULL;
+
+   if ((read_loc + sizeof(*cur_desc)) > dsize)
+   return NULL;
+
+   cur_desc = ring_buffer + read_loc;
+   packetlen = cur_desc->len8 << 3;
+
+   /*
+* If the packet under consideration is wrapping around,
+* return failure.
+*/
+   

Re: [PATCH v2] net: ethernet: renesas: sh_eth: add POST registers for rz

2016-09-09 Thread Sergei Shtylyov

On 09/07/2016 09:57 PM, Chris Brandt wrote:


Due to a mistake in the hardware manual, the FWSLC and POST1-4 registers
were not documented and left out of the driver for RZ/A making the CAM
feature non-operational.
Additionally, when the offset values for POST1-4 are left blank, the driver
attempts to set them using an offset of 0x which can cause a memory
corruption or panic.

This patch fixes the panic and properly enables CAM.

Reported-by: Daniel Palmer 
Signed-off-by: Chris Brandt 
---
v2:
* POST registers really do exist, so just add them
---
 drivers/net/ethernet/renesas/sh_eth.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/renesas/sh_eth.c 
b/drivers/net/ethernet/renesas/sh_eth.c
index 1f8240a..440ae27 100644
--- a/drivers/net/ethernet/renesas/sh_eth.c
+++ b/drivers/net/ethernet/renesas/sh_eth.c

[...]

@@ -2781,6 +2786,8 @@ static void sh_eth_tsu_init(struct sh_eth_private *mdp)
 {
if (sh_eth_is_rz_fast_ether(mdp)) {
sh_eth_tsu_write(mdp, 0, TSU_TEN); /* Disable all CAM entry */
+   sh_eth_tsu_write(mdp, TSU_FWSLC_POSTENU | TSU_FWSLC_POSTENL,
+TSU_FWSLC);/* Enable POST registers */
return;
}


   Wait, don't you also need to write 0s to the POST registers like done at 
the end of this function?


MBR, Sergei



Re: [PATCH v2] net: ethernet: renesas: sh_eth: add POST registers for rz

2016-09-09 Thread Sergei Shtylyov

Hello.

On 09/07/2016 09:57 PM, Chris Brandt wrote:


Due to a mistake in the hardware manual, the FWSLC and POST1-4 registers
were not documented and left out of the driver for RZ/A making the CAM
feature non-operational.
Additionally, when the offset values for POST1-4 are left blank, the driver
attempts to set them using an offset of 0x which can cause a memory
corruption or panic.


   You didn't really fix the root cause here...


This patch fixes the panic and properly enables CAM.

Reported-by: Daniel Palmer 
Signed-off-by: Chris Brandt 
---
v2:
* POST registers really do exist, so just add them


Acked-by: Sergei Shtylyov 

MBR, Sergei



Re: [PATCH 7/8] sctp: use IS_ENABLED() instead of checking for built-in or module

2016-09-09 Thread Neil Horman
On Fri, Sep 09, 2016 at 08:43:19AM -0400, Javier Martinez Canillas wrote:
> The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
> built-in or as a module, use that macro instead of open coding the same.
> 
> Using the macro makes the code more readable by helping abstract away some
> of the Kconfig built-in and module enable details.
> 
> Signed-off-by: Javier Martinez Canillas 
> ---
> 
>  net/sctp/auth.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sctp/auth.c b/net/sctp/auth.c
> index 912eb1685a5d..f99d4855d3de 100644
> --- a/net/sctp/auth.c
> +++ b/net/sctp/auth.c
> @@ -48,7 +48,7 @@ static struct sctp_hmac sctp_hmac_list[SCTP_AUTH_NUM_HMACS] 
> = {
>   /* id 2 is reserved as well */
>   .hmac_id = SCTP_AUTH_HMAC_ID_RESERVED_2,
>   },
> -#if defined (CONFIG_CRYPTO_SHA256) || defined (CONFIG_CRYPTO_SHA256_MODULE)
> +#if IS_ENABLED(CONFIG_CRYPTO_SHA256)
>   {
>   .hmac_id = SCTP_AUTH_HMAC_ID_SHA256,
>   .hmac_name = "hmac(sha256)",
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Acked-by: Neil Horman 



Re: [PATCH net 1/6] sctp: remove the unnecessary state check in sctp_outq_tail

2016-09-09 Thread Xin Long
> I don't know, I still don't feel safe about it.  I agree the socket lock keeps
> the state from changing during a single transmission, which makes the use case
> you are focused on correct.
ok, :-)

>
> That said, have you considered the retransmit case?  That is to say, if you
> queue and flush the outq, and some packets fail delivery, and in the time
> between the intial send and the expiration of the RTX timer (during which the
> socket lock will have been released), an event may occur which changes the
> transport state, which will then be ignored with your patch.
Sorry, I'm not sure if I got it.

You mean "during which changes q->asoc->state", right ?

This patch removes the check of q->asoc->state in sctp_outq_tail().

sctp_outq_tail() is called for data only in:
sctp_primitive_SEND -> sctp_do_sm -> sctp_cmd_send_msg ->
sctp_cmd_interpreter -> sctp_cmd_send_msg() -> sctp_outq_tail()

before calling sctp_primitive_SEND, hold sock lock first.
then sctp_primitive_SEND choose FUNC according:

#define TYPE_SCTP_PRIMITIVE_SEND  {


if asoc->state is unavailable, FUNC can't be sctp_cmd_send_msg,
but sctp_sf_error_closed/sctp_sf_error_shutdown,  sctp_outq_tail
can't be called, either.
I mean sctp_primitive_SEND do the same check for asoc->state
already actually.

so the code in sctp_outq_tail is redundant actually.




>
> Neil
>


Re: [PATCH net-next] macsec: set network devtype

2016-09-09 Thread Sabrina Dubroca
2016-09-08, 17:24:07 -0700, David Miller wrote:
> From: Stephen Hemminger 
> Date: Wed, 7 Sep 2016 14:07:32 -0700
> 
> > The netdevice type structure for macsec was being defined but never used.
> > To set the network device type the macro SET_NETDEV_DEVTYPE must be called.
> > Compile tested only, I don't use macsec.
> > 
> > Signed-off-by: Stephen Hemminger 
> 
> Sabrina, please review.
> 
> Thanks.

Sorry for the delay. LGTM:

Acked-by: Sabrina Dubroca 

-- 
Sabrina


Re: [iproute PATCH] macsec: fix input range of 'icvlen' parameter

2016-09-09 Thread Sabrina Dubroca
2016-09-09, 16:02:22 +0200, Davide Caratti wrote:
> the maximum possible ICV length in a MACsec frame is 16 octects, not 32:
> fix get_icvlen() accordingly, so that a proper error message is displayed
> in case input 'icvlen' is greater than 16.
> 
> Signed-off-by: Davide Caratti 

Acked-by: Sabrina Dubroca 

-- 
Sabrina


Re: [PATCH net] net_sched: act_mirred: full rcu conversion

2016-09-09 Thread John Fastabend
On 16-09-08 10:26 PM, Cong Wang wrote:
> On Thu, Sep 8, 2016 at 8:51 AM, Eric Dumazet  wrote:
>> On Thu, 2016-09-08 at 08:47 -0700, John Fastabend wrote:
>>
>>> Works for me. FWIW I find this plenty straightforward and don't really
>>> see the need to make the hash table itself rcu friendly.
>>>
>>> Acked-by: John Fastabend 
>>>
>>
>> Yes, it seems this hash table is used in control path, with RTNL held
>> anyway.
> 
> Seriously? You never read hashtable in fast path?? I think you need
> to wake up.
> 

But the actions use refcnt'ing and should never be decremented to zero
as long as they can still be referenced by an active filter. If each
action handles its parameters like mirred/gact then I don't see why its
necessary.

I believe though that the refcnt needs to be fixed a bit though most
likely by making it atomic. I original assumed it was protected by
RTNL lock but because its getting decremented from rcu callback this is
not true.

.John




Re: [PATCH net-next V7 4/4] net/sched: Introduce act_tunnel_key

2016-09-09 Thread John Fastabend
On 16-09-09 06:19 AM, Eric Dumazet wrote:
> On Thu, 2016-09-08 at 22:30 -0700, Cong Wang wrote:
>> On Thu, Sep 8, 2016 at 9:15 AM, John Fastabend  
>> wrote:
>>>
>>> This should be rtnl_derefence(t->params) and drop the read_lock/unlock
>>> pair. This is always called with RTNL lock unless you have a path I'm
>>> not seeing.
>>
>> You missed the previous discussion on V6, John.
>>
>> BTW, you really should follow the whole discussion instead of
>> jumping in the middle, like what you did for my patchset.
>> I understand you are eager to comment, but please don't waste
>> others' time in this way Please.
> 
> But John is right, and he definitely is welcome to give his feedback
> even at V13 if he wants to.
> 
> tunnel_key_dump() is called with RTNL being held.
> 
> Take a deep breath, vacations, and come back when you are relaxed.
> 
> Thanks.
> 
> 

Also v6 discussion was around cleanup() call back I see nothing about
the dump() callbacks. And if there was it wasn't fixed so it should
be resolved.

Anyways Dave/Hadar feel free to submit a follow up patch or v8 it
doesn't much matter to me as noted in the original post.

.John


Re: [RFC Patch net-next 5/6] net_sched: use rcu in fast path

2016-09-09 Thread John Fastabend
On 16-09-08 10:54 PM, Cong Wang wrote:
> On Thu, Sep 8, 2016 at 8:49 AM, John Fastabend  
> wrote:
>> Agreed not sure why you would ever want to do a late binding and
>> replace on a tc_mirred actions. But it is supported...
> 
> I will let Jamal teach you on this, /me is really tired of explaining
> things to you John.
> 

This was a meta-comment on the use case for doing this with mirred
action. Not necessarily about the patch itself.

I was actually curious where this happens in practice. The only thing
I can think of is your external logging box moved so you need to send
out another port. Is there any open source software that manages 'tc'
like this. If so I would like to read it.

So do you know of any?

.John


Re: [PATCH net-next v5] gso: Support partial splitting at the frag_list pointer

2016-09-09 Thread Alexander Duyck
On Fri, Sep 9, 2016 at 12:25 AM, Steffen Klassert
 wrote:
> Since commit 8a29111c7 ("net: gro: allow to build full sized skb")
> gro may build buffers with a frag_list. This can hurt forwarding
> because most NICs can't offload such packets, they need to be
> segmented in software. This patch splits buffers with a frag_list
> at the frag_list pointer into buffers that can be TSO offloaded.
>
> Signed-off-by: Steffen Klassert 
> ---
>
> Changes since v1:
>
> - Use the assumption that all buffers in the chain excluding the last
>   containing the same amount of data.
>
> - Simplify some checks against gso partial.
>
> - Fix the generation of IP IDs.
>
> Changes since v2:
>
> - Merge common code of gso partial and frag_list pointer splitting.
>
> Changes since v3:
>
> - Fix the checks for doing frag_list pointer splitting.
>
> Changes since v4:
>
> - Whitespace fix.
> - Fix size calculations of the tail packet.
>
>  net/core/skbuff.c  | 51 
> +++---
>  net/ipv4/af_inet.c | 14 ++
>  net/ipv4/gre_offload.c |  6 --
>  net/ipv4/tcp_offload.c | 13 +++--
>  net/ipv4/udp_offload.c |  6 --
>  net/ipv6/ip6_offload.c |  5 -
>  6 files changed, 69 insertions(+), 26 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 3864b4b6..51e761a 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3078,11 +3078,31 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,

<...>

> @@ -3090,6 +3110,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
> partial_segs = 0;
> }
>
> +normal:
> headroom = skb_headroom(head_skb);
> pos = skb_headlen(head_skb);
>
> @@ -3281,21 +3302,29 @@ perform_csum_check:
>  */
> segs->prev = tail;
>
> -   /* Update GSO info on first skb in partial sequence. */
> if (partial_segs) {
> +   struct sk_buff *iter;
> int type = skb_shinfo(head_skb)->gso_type;
> +   unsigned short gso_size = skb_shinfo(head_skb)->gso_size;
>
> /* Update type to add partial and then remove dodgy if set */
> -   type |= SKB_GSO_PARTIAL;
> +   type |= (features & NETIF_F_GSO_PARTIAL) / 
> NETIF_F_GSO_PARTIAL * SKB_GSO_PARTIAL;
> type &= ~SKB_GSO_DODGY;
>
> /* Update GSO info and prepare to start updating headers on
>  * our way back down the stack of protocols.
>  */
> -   skb_shinfo(segs)->gso_size = skb_shinfo(head_skb)->gso_size;
> -   skb_shinfo(segs)->gso_segs = partial_segs;
> -   skb_shinfo(segs)->gso_type = type;
> -   SKB_GSO_CB(segs)->data_offset = skb_headroom(segs) + doffset;
> +   for (iter = segs; iter; iter = iter->next) {
> +   skb_shinfo(iter)->gso_size = gso_size;
> +   skb_shinfo(iter)->gso_segs = partial_segs;
> +   skb_shinfo(iter)->gso_type = type;
> +   SKB_GSO_CB(iter)->data_offset = skb_headroom(iter) + 
> doffset;
> +   }
> +
> +   if (tail->len <= gso_size)
> +   skb_shinfo(tail)->gso_size = 0;

Actually we need to do tail->len - doffset up here as well.  The
gso_size value reflects the size of the data segment, and tail->len is
the size of the entire frame so we have to remove the size of the
headers to make the comparison accurate.

> +   else if (tail != segs)
> +   skb_shinfo(tail)->gso_segs = DIV_ROUND_UP(tail->len - 
> doffset, gso_size);
> }
>
> /* Following permits correct backpressure, for protocols


Re: [PATCH RFC 00/11] mlx5 RX refactoring and XDP support

2016-09-09 Thread Saeed Mahameed
On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed  wrote:
> Hi All,
>
> This patch set introduces some important data path RX refactoring
> addressing mlx5e memory allocation/management improvements and XDP support.
>
> Submitting as RFC since we would like to get an early feedback, while we
> continue reviewing testing and complete the performance analysis in house.
>

Hi,

I am going to be out of office for the whole next week with a random
mail access.
I will do my best to be as active as possible, but in the meanwhile,
Tariq and Or will handle any questions
regarding this series or mlx5 in general while I am away.

Thanks,
Saeed.


Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-09 Thread Saeed Mahameed
On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-dev
 wrote:
> On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:
>>
>> I'm sorry but I have a problem with this patch!
>
> is it because the variable is called 'xdp_doorbell'?
> Frankly I see nothing scary in this patch.
> It extends existing code by adding a flag to ring doorbell or not.
> The end of rx napi is used as an obvious heuristic to flush the pipe.
> Looks pretty generic to me.
> The same code can be used for non-xdp as well once we figure out
> good algorithm for xmit_more in the stack.
>
>> Looking at this patch, I want to bring up a fundamental architectural
>> concern with the development direction of XDP transmit.
>>
>>
>> What you are trying to implement, with delaying the doorbell, is
>> basically TX bulking for TX_XDP.
>>
>>  Why not implement a TX bulking interface directly instead?!?
>>
>> Yes, the tailptr/doorbell is the most costly operation, but why not
>> also take advantage of the benefits of bulking for other parts of the
>> code? (benefit is smaller, by every cycles counts in this area)
>>
>> This hole XDP exercise is about avoiding having a transaction cost per
>> packet, that reads "bulking" or "bundling" of packets, where possible.
>>
>>  Lets do bundling/bulking from the start!

Jesper, what we did here is also bulking, instead of bulkin in a
temporary list in the driver
we list the packets in the HW and once done we transmit all at once via the
xdp_doorbell indication.

I agree with you that we can take advantage and improve the icahce by
bulkin first in software and then queue all at once in the hw then
ring one doorbell.

but I also agree with Alexei that this will introduce an extra
pointer/list handling
in the diver and we need to do the comparison between both approaches
before we decide which is better.

this must be marked as future work and not have this from the start.

>
> mlx4 already does bulking and this proposed mlx5 set of patches
> does bulking as well.
> See nothing wrong about it. RX side processes the packets and
> when it's done it tells TX to xmit whatever it collected.
>
>> The reason behind the xmit_more API is that we could not change the
>> API of all the drivers.  And we found that calling an explicit NDO
>> flush came at a cost (only approx 7 ns IIRC), but it still a cost that
>> would hit the common single packet use-case.
>>
>> It should be really easy to build a bundle of packets that need XDP_TX
>> action, especially given you only have a single destination "port".
>> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.
>
> not sure what are you proposing here?
> Sounds like you want to extend it to multi port in the future?
> Sure. The proposed code is easily extendable.
>
> Or you want to see something like a link list of packets
> or an array of packets that RX side is preparing and then
> send the whole array/list to TX port?
> I don't think that would be efficient, since it would mean
> unnecessary copy of pointers.
>
>> In the future, XDP need to support XDP_FWD forwarding of packets/pages
>> out other interfaces.  I also want bulk transmit from day-1 here.  It
>> is slightly more tricky to sort packets for multiple outgoing
>> interfaces efficiently in the pool loop.
>
> I don't think so. Multi port is natural extension to this set of patches.
> With multi port the end of RX will tell multiple ports (that were
> used to tx) to ring the bell. Pretty trivial and doesn't involve any
> extra arrays or link lists.
>
>> But the mSwitch[1] article actually already solved this destination
>> sorting.  Please read[1] section 3.3 "Switch Fabric Algorithm" for
>> understanding the next steps, for a smarter data structure, when
>> starting to have more TX "ports".  And perhaps align your single
>> XDP_TX destination data structure to this future development.
>>
>> [1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf
>
> I don't see how this particular paper applies to the existing kernel code.
> It's great to take ideas from research papers, but real code is different.
>
>> --Jesper
>> (top post)
>
> since when it's ok to top post?
>
>> On Wed,  7 Sep 2016 15:42:32 +0300 Saeed Mahameed  
>> wrote:
>>
>> > Previously we rang XDP SQ doorbell on every forwarded XDP packet.
>> >
>> > Here we introduce a xmit more like mechanism that will queue up more
>> > than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
>> >
>> > Once RX napi budget is consumed and we exit napi RX loop, we will
>> > flush (doorbell) all XDP looped packets in case there are such.
>> >
>> > XDP forward packet rate:
>> >
>> > Comparing XDP with and w/o xmit more (bulk transmit):
>> >
>> > Streams XDP TX   XDP TX (xmit more)
>> > ---
>> > 1   4.90Mpps  7.50Mpps
>> > 2   9.50Mpps  14.8Mpps
>> > 4   16.5Mpps  

  1   2   >