date:20180723

Re: [PATCH net-next 3/4] net/tc: introduce TC_ACT_MIRRED.

2018-07-23 Thread Paolo Abeni

Hi,

On Mon, 2018-07-23 at 14:12 -0700, Cong Wang wrote:
> On Fri, Jul 20, 2018 at 2:54 AM Paolo Abeni  wrote:
> > Note this is what already happens with TC_ACT_REDIRECT: currently the
> > user space uses it freely, even if only {cls,act}_bpf can return such
> > value in a meaningful way, and only from the ingress and the egress
> > hooks.
>
> Yes, my question is why do we give user such a freedom?
> 
> In other words, what do you want users to choose here? To scrub or not
> to scrub? To clone or not to clone?
> 
> From my understanding of your whole patchset, your goal is to get rid
> of clone, and users definitely don't care about clone or not clone for
> redirections, this is why I insist it doesn't need to be visible to user.

Thank you for your kind reply!

No, my intention is not to expose to the user-space another option. I
added the  additional tcfa_action value in response to concerns exposed
vs the v1 version of this series (it changed the act_mirred behaviour
and possibly broke some use-case).

When assembling the v2 I did not implemented the (deserved) isolation
vs user-space because of the already existing TC_ACT_REDIRECT: its
current implementation fooled me to think such considerations were not
relevant.

> If your goal is not just skipping clone, but also, let's say, scrub or not
> scrub, then it should be visible to users. However, I don't see why
> users care about scrub or not, they have to understand what scrub
> is at least, it is a purely kernel-internal behavior.

I agree to hide TC_ACT_REINJECT and any choice about scrubbing to user-
space, as per the code chunk I  posted before. I'll send a v3
implementing such schema.

Cheers,

Paolo

[PATCH net-next] net/tls: Removed redundant checks for non-NULL

2018-07-23 Thread Vakul Garg

Removed checks against non-NULL before calling kfree_skb() and
crypto_free_aead(). These functions are safe to be called with NULL
as an argument.

Signed-off-by: Vakul Garg 
---
 net/tls/tls_sw.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 0c2d029c9d4c..ef445478239c 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1044,8 +1044,7 @@ void tls_sw_free_resources_tx(struct sock *sk)
struct tls_context *tls_ctx = tls_get_ctx(sk);
struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
 
-   if (ctx->aead_send)
-   crypto_free_aead(ctx->aead_send);
+   crypto_free_aead(ctx->aead_send);
tls_free_both_sg(sk);
 
kfree(ctx);
@@ -1057,10 +1056,8 @@ void tls_sw_release_resources_rx(struct sock *sk)
struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 
if (ctx->aead_recv) {
-   if (ctx->recv_pkt) {
-   kfree_skb(ctx->recv_pkt);
-   ctx->recv_pkt = NULL;
-   }
+   kfree_skb(ctx->recv_pkt);
+   ctx->recv_pkt = NULL;
crypto_free_aead(ctx->aead_recv);
strp_stop(&ctx->strp);
write_lock_bh(&sk->sk_callback_lock);
-- 
2.13.6

Re: [PATCH rdma-next v2 0/8] Support mlx5 flow steering with RAW data

2018-07-23 Thread Leon Romanovsky

On Mon, Jul 23, 2018 at 08:42:36PM -0600, Jason Gunthorpe wrote:
> On Mon, Jul 23, 2018 at 03:25:04PM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky 
> >
> > Changelog:
> > v1->v2:
> >  * Fix matcher to use the correct size.
> >  * Rephrase commit log of the first patch.
> > v0->v1:
> >  * Fixed ADD_UVERBS_ATTRIBUTES_SIMPLE macro to pass the real address.
> >  ?* Replaced UA_ALLOC_AND_COPY to regular copy_from
> >  * Added UVERBS_ATTR_NO_DATA new macro for cleaner code.
> >  * Used ib_dev from uobj when it exists.
> >  * ib_is_destroy_retryable was replaced by ib_destroy_usecnt
> >
> > >From Yishai:
> >
> > This series introduces vendor create and destroy flow methods on the
> > uverbs flow object by using the KABI infra-structure.
> >
> > It's done in a way that enables the driver to get its specific device
> > attributes in a raw data to match its underlay specification while still
> > using the generic ib_flow object for cleanup and code sharing.
> >
> > In addition, a specific mlx5 matcher object and its create/destroy
> > methods were introduced. This object matches the underlay flow steering
> > mask specification and is used as part of mlx5 create flow input data.
> >
> > This series supports IB_QP/TIR as its flow steering destination as
> > applicable today via the ib_create_flow API, however, it adds also an
> > option to work with DEVX object which its destination can be both TIR
> > and flow table.
> >
> > Few changes were done in the mlx5 core layer to support forward
> > compatible for the device specification raw data and to support flow
> > table when the DEVX destination is used.
> >
> > As part of this series the default IB destroy handler
> > (i.e. uverbs_destroy_def_handler()) was exposed from IB core to be
> > used by the drivers and existing code was refactored to use it.
> >
> > Thanks
> >
> > Yishai Hadas (8):
> >   net/mlx5: Add forward compatible support for the FTE match data
> >   net/mlx5: Add support for flow table destination number
> >   IB/mlx5: Introduce flow steering matcher object
> >   IB: Consider ib_flow creation by the KABI infrastructure
> >   IB/mlx5: Introduce vendor create and destroy flow methods
> >   IB/mlx5: Support adding flow steering rule by raw data
> >   IB/mlx5: Add support for a flow table destination
> >   IB/mlx5: Expose vendor flow trees
>
> This seems fine to me. Can you send the mlx5 shared branch for the
> first two patches?

I applied two first patches with Acked-by from Saeed to mlx5-next

664000b6bb43 net/mlx5: Add support for flow table destination number
2aada6c0c96e net/mlx5: Add forward compatible support for the FTE match data

Thanks

>
> Thanks,
> Jason


signature.asc
Description: PGP signature

[PATCH net-next] net/tls: Do not call msg_data_left() twice

2018-07-23 Thread Vakul Garg

In function tls_sw_sendmsg(), msg_data_left() needs to be called only
once. The second invocation of msg_data_left() for assigning variable
try_to_copy can be removed and merged with the first one.

Signed-off-by: Vakul Garg 
---
 net/tls/tls_sw.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 0c2d029c9d4c..fd51ce65b99c 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -377,7 +377,7 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
goto send_end;
}
 
-   while (msg_data_left(msg)) {
+   while ((try_to_copy = msg_data_left(msg))) {
if (sk->sk_err) {
ret = -sk->sk_err;
goto send_end;
@@ -385,7 +385,6 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
 
orig_size = ctx->sg_plaintext_size;
full_record = false;
-   try_to_copy = msg_data_left(msg);
record_room = TLS_MAX_PAYLOAD_SIZE - ctx->sg_plaintext_size;
if (try_to_copy >= record_room) {
try_to_copy = record_room;
-- 
2.13.6

Re: [PATCH 1/4] MIPS: lantiq: Do not enable IRQs in dma open

2018-07-23 Thread David Miller

From: Hauke Mehrtens 
Date: Tue, 24 Jul 2018 07:32:27 +0200

> 
> 
> On 07/24/2018 02:19 AM, Paul Burton wrote:
>> Hi Hauke,
>> 
>> On Sat, Jul 21, 2018 at 09:13:55PM +0200, Hauke Mehrtens wrote:
>>> When a DMA channel is opened the IRQ should not get activated
>>> automatically, this allows it to pull data out manually without the help
>>> of interrupts. This is needed for a workaround in the vrx200 Ethernet
>>> driver.
>>>
>>> Signed-off-by: Hauke Mehrtens 
>>> ---
>>>  arch/mips/lantiq/xway/dma.c| 1 -
>>>  drivers/net/ethernet/lantiq_etop.c | 1 +
>>>  2 files changed, 1 insertion(+), 1 deletion(-)
>> 
>> If you'd like this to go via the netdev tree to keep it with the rest of
>> the series:
>> 
>> Acked-by: Paul Burton 
> 
> Thanks, I also prefer that this goes through netdev.

Please be sure to repost your series with Paul's ACK added.

Also, in the patch postings and cover letter, put "net-next" in
the Subject line so that the target tree is clear, like:

Subject: [PATCH net-next 1/4] MIPS: ...


Thank you.

Re: [PATCH 1/4] MIPS: lantiq: Do not enable IRQs in dma open

2018-07-23 Thread Hauke Mehrtens




On 07/24/2018 02:19 AM, Paul Burton wrote:
> Hi Hauke,
> 
> On Sat, Jul 21, 2018 at 09:13:55PM +0200, Hauke Mehrtens wrote:
>> When a DMA channel is opened the IRQ should not get activated
>> automatically, this allows it to pull data out manually without the help
>> of interrupts. This is needed for a workaround in the vrx200 Ethernet
>> driver.
>>
>> Signed-off-by: Hauke Mehrtens 
>> ---
>>  arch/mips/lantiq/xway/dma.c| 1 -
>>  drivers/net/ethernet/lantiq_etop.c | 1 +
>>  2 files changed, 1 insertion(+), 1 deletion(-)
> 
> If you'd like this to go via the netdev tree to keep it with the rest of
> the series:
> 
> Acked-by: Paul Burton 

Thanks, I also prefer that this goes through netdev.

> Though I'd be happier if we didn't have DMA code seemingly used only by
> an ethernet driver in arch/mips/ :)

There are also some out of tree driver that use this DMA code. This
should probably be converted to a DMA channel driver but that is not
very high on my todo list.

Hauke

Re: [PATCH 3/4] net: lantiq: Add Lantiq / Intel vrx200 Ethernet driver

2018-07-23 Thread Hauke Mehrtens

Hi Paul,

On 07/24/2018 02:34 AM, Paul Burton wrote:
> Hi Hauke,
> 
> On Sat, Jul 21, 2018 at 09:13:57PM +0200, Hauke Mehrtens wrote:
>> diff --git a/arch/mips/lantiq/xway/sysctrl.c 
>> b/arch/mips/lantiq/xway/sysctrl.c
>> index e0af39b33e28..c704312ef7d5 100644
>> --- a/arch/mips/lantiq/xway/sysctrl.c
>> +++ b/arch/mips/lantiq/xway/sysctrl.c
>> @@ -536,7 +536,7 @@ void __init ltq_soc_init(void)
>>  clkdev_add_pmu(NULL, "ahb", 1, 0, PMU_AHBM | PMU_AHBS);
>>  
>>  clkdev_add_pmu("1da0.usif", "NULL", 1, 0, PMU_USIF);
>> -clkdev_add_pmu("1e108000.eth", NULL, 0, 0,
>> +clkdev_add_pmu("1e10b308.eth", NULL, 0, 0,
>>  PMU_SWITCH | PMU_PPE_DPLUS | PMU_PPE_DPLUM |
>>  PMU_PPE_EMA | PMU_PPE_TC | PMU_PPE_SLL01 |
>>  PMU_PPE_QSB | PMU_PPE_TOP);
> 
> Is this intentional?

Yes

> Why is it needed? Was the old address wrong? Does it change anything
> functionally?

The Ethernet driver is newly added in these patches, this entry was not
used before.
This has to match the device name and the device name is now named
1e10b308.eth because this only uses the register range of the pmac and
not of the complete switch core, this is different to the old driver
used in OpenWrt.

The lantiq clock code should really be converted to the common clock
framework so we can define this in device tree and do not need this code
any more.
I am planning to do this, but want to wait till the xrx500 clk code from
these patches is in mainline:
https://www.linux-mips.org/archives/linux-mips/2018-06/msg00092.html
There are already some more recent versions available internally.

> If it is needed it seems like a separate change - unless there's some
> reason it's tied to adding this driver?
> 
> Should this really apply only to the lantiq,vr9 case or also to the
> similar lantiq,grx390 & lantiq,ar10 paths?

The AR10 has a similar switch core, but I haven't tested this device
with this Ethernet driver, but there is a good chance it works out of
the box when the sysctrl.c gets adapted and the correct device tree is
provided.
I do not know exactly what the grx390 SoC is, this is probably some
uncommon name for one of the Lantiq / Intel SoCs, I have to look this up.

> 
> Whatever the answers to these questions it would be good to include them
> in the commit message.

I will update the commit massage for the v2.

Hauke

RE: [PATCH net-next] tls: Fix improper revert in zerocopy_from_iter

2018-07-23 Thread Vakul Garg




> -Original Message-
> From: Doron Roberts-Kedes [mailto:doro...@fb.com]
> Sent: Tuesday, July 24, 2018 3:50 AM
> To: David S . Miller 
> Cc: Dave Watson ; Vakul Garg
> ; Matt Mullins ;
> netdev@vger.kernel.org; Doron Roberts-Kedes 
> Subject: [PATCH net-next] tls: Fix improper revert in zerocopy_from_iter
> 
> The current code is problematic because the iov_iter is reverted and never
> advanced in the non-error case. This patch skips the revert in the non-error
> case. This patch also fixes the amount by which the iov_iter is reverted.
> Currently, iov_iter is reverted by size, which can be greater than the amount
> by which the iter was actually advanced.
> Instead, mimic the tx path which reverts by the difference before and after
> zerocopy_from_iter.
> 
> Fixes: 4718799817c5 ("tls: Fix zerocopy_from_iter iov handling")
> Signed-off-by: Doron Roberts-Kedes 
> ---
>  net/tls/tls_sw.c | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index
> 490f2bcc6313..2ea000baebf8 100644
> --- a/net/tls/tls_sw.c
> +++ b/net/tls/tls_sw.c
> @@ -276,7 +276,7 @@ static int zerocopy_from_iter(struct sock *sk, struct
> iov_iter *from,
> int length, int *pages_used,
> unsigned int *size_used,
> struct scatterlist *to, int to_max_pages,
> -   bool charge, bool revert)
> +   bool charge)
>  {
>   struct page *pages[MAX_SKB_FRAGS];
> 
> @@ -327,8 +327,6 @@ static int zerocopy_from_iter(struct sock *sk, struct
> iov_iter *from,
>  out:
>   *size_used = size;
>   *pages_used = num_elem;
> - if (revert)
> - iov_iter_revert(from, size);
> 
>   return rc;
>  }
> @@ -431,7 +429,7 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr
> *msg, size_t size)
>   &ctx->sg_plaintext_size,
>   ctx->sg_plaintext_data,
>   ARRAY_SIZE(ctx->sg_plaintext_data),
> - true, false);
> + true);
>   if (ret)
>   goto fallback_to_reg_send;
> 
> @@ -811,6 +809,7 @@ int tls_sw_recvmsg(struct sock *sk,
>   likely(!(flags & MSG_PEEK)))  {
>   struct scatterlist sgin[MAX_SKB_FRAGS + 1];
>   int pages = 0;
> + int orig_chunk = chunk;
> 
>   zc = true;
>   sg_init_table(sgin, MAX_SKB_FRAGS + 1);
> @@ -820,9 +819,11 @@ int tls_sw_recvmsg(struct sock *sk,
>   err = zerocopy_from_iter(sk, &msg-
> >msg_iter,
>to_copy, &pages,
>&chunk, &sgin[1],
> -  MAX_SKB_FRAGS,
>   false, true);
> - if (err < 0)
> +  MAX_SKB_FRAGS,
>   false);
> + if (err < 0) {
> + iov_iter_revert(&msg->msg_iter,
> chunk - orig_chunk);
>   goto fallback_to_reg_recv;
> + }

This assumes that msg_iter gets advanced even if zerocopy_from_iter() fails.
It is easier from code readability perspective if functions upon failure do not 
leave any side effects for the caller to clean-up.
I suggest that iov_iter_revert() should be called from zerocopy_from_iter() 
itself if it is going to fail. 

 

> 
>   err = decrypt_skb(sk, skb, sgin);
>   for (; pages > 0; pages--)
> --
> 2.17.1

[net-next v6 0/2] Minor code cleanup patches

2018-07-23 Thread Vakul Garg

This patch series improves tls_sw.c code by:

1) Using correct socket callback for flagging data availability.
2) Removing redundant variable assignments and wakeup callbacks.


Vakul Garg (2):
  net/tls: Use socket data_ready callback on record availability
  net/tls: Remove redundant variable assignments and wakeup

 net/tls/tls_sw.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

-- 
2.13.6

[net-next v6 1/2] net/tls: Use socket data_ready callback on record availability

2018-07-23 Thread Vakul Garg

On receipt of a complete tls record, use socket's saved data_ready
callback instead of state_change callback.

Signed-off-by: Vakul Garg 
---
 net/tls/tls_sw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 0c2d029c9d4c..fee1240eff92 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1028,7 +1028,7 @@ static void tls_queue(struct strparser *strp, struct 
sk_buff *skb)
ctx->recv_pkt = skb;
strp_pause(strp);
 
-   strp->sk->sk_state_change(strp->sk);
+   ctx->saved_data_ready(strp->sk);
 }
 
 static void tls_data_ready(struct sock *sk)
-- 
2.13.6

[net-next v6 2/2] net/tls: Remove redundant variable assignments and wakeup

2018-07-23 Thread Vakul Garg

In function decrypt_skb_update(), the assignment to tls receive context
variable 'decrypted' is redundant as the same is being done in function
tls_sw_recvmsg() after calling decrypt_skb_update(). Also calling callback
function to wakeup processes sleeping on socket data availability is
useless as decrypt_skb_update() is invoked from user processes only. This
patch cleans these up.

Signed-off-by: Vakul Garg 
---
 net/tls/tls_sw.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index fee1240eff92..6c71da7b147f 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -679,8 +679,6 @@ static int decrypt_skb_update(struct sock *sk, struct 
sk_buff *skb,
rxm->offset += tls_ctx->rx.prepend_size;
rxm->full_len -= tls_ctx->rx.overhead_size;
tls_advance_record_sn(sk, &tls_ctx->rx);
-   ctx->decrypted = true;
-   ctx->saved_data_ready(sk);
 
return err;
 }
-- 
2.13.6

Re: [net-next v5 3/3] net/tls: Remove redundant array allocation.

2018-07-23 Thread David Miller

From: Vakul Garg 
Date: Tue, 24 Jul 2018 04:43:55 +

> Can you still apply the rest of two patches in the series or do I
> need to send them again separately?

When a change of any kind needs to be made to a patch series, you must
always resubmit the entire series.

Thank you.

Re: [net-next v5 3/3] net/tls: Remove redundant array allocation.

2018-07-23 Thread Vakul Garg

Hi Dave

Can you still apply the rest of two patches in the series or do I need to send 
them again separately?

Regards

Vakul



From: netdev-ow...@vger.kernel.org  on behalf of 
David Miller 
Sent: Tuesday, July 24, 2018 10:11:09 AM
To: davejwat...@fb.com
Cc: Vakul Garg; netdev@vger.kernel.org; bor...@mellanox.com; 
avia...@mellanox.com; doro...@fb.com
Subject: Re: [net-next v5 3/3] net/tls: Remove redundant array allocation.

From: Dave Watson 
Date: Mon, 23 Jul 2018 09:35:09 -0700

> I don't think this patch is safe as-is.  sgin_arr is a stack array of
> size MAX_SKB_FRAGS (+ overhead), while my read of skb_cow_data is that
> it walks the whole chain of skbs from skb->next, and can return any
> number of segments.  Therefore we need to heap allocate.  I think I
> copied the IPSEC code here.

Ok I see what you are saying.

So it means that, when a non-NULL sgout is passed into decrypt_skb(),
via decrypt_skb_update(), via tls_sw_recvmsg() it means that it is the
zerocopy case and you know that you only have page frags and no SKB
frag list, right?

I agree with you that this change is therefore incorrect.

Re: [net-next v5 3/3] net/tls: Remove redundant array allocation.

2018-07-23 Thread David Miller

From: Dave Watson 
Date: Mon, 23 Jul 2018 09:35:09 -0700

> I don't think this patch is safe as-is.  sgin_arr is a stack array of
> size MAX_SKB_FRAGS (+ overhead), while my read of skb_cow_data is that
> it walks the whole chain of skbs from skb->next, and can return any
> number of segments.  Therefore we need to heap allocate.  I think I
> copied the IPSEC code here.

Ok I see what you are saying.

So it means that, when a non-NULL sgout is passed into decrypt_skb(),
via decrypt_skb_update(), via tls_sw_recvmsg() it means that it is the
zerocopy case and you know that you only have page frags and no SKB
frag list, right?

I agree with you that this change is therefore incorrect.

Re: [PATCH net] sock: fix sg page frag coalescing in sk_alloc_sg

2018-07-23 Thread David Miller

From: Daniel Borkmann 
Date: Mon, 23 Jul 2018 22:37:54 +0200

> Current sg coalescing logic in sk_alloc_sg() (latter is used by tls and
> sockmap) is not quite correct in that we do fetch the previous sg entry,
> however the subsequent check whether the refilled page frag from the
> socket is still the same as from the last entry with prior offset and
> length matching the start of the current buffer is comparing always the
> first sg list entry instead of the prior one.
> 
> Fixes: 3c4d7559159b ("tls: kernel TLS support")
> Signed-off-by: Daniel Borkmann 
> Acked-by: Dave Watson 

Applied and queued up for -stable, thanks Daniel.

Re: [PATCH v5 net-next 0/3] rds: IPv6 support

2018-07-23 Thread David Miller

From: Ka-Cheong Poon 
Date: Mon, 23 Jul 2018 20:51:20 -0700

> This patch set adds IPv6 support to the kernel RDS and related
> modules.

Series applied.

Re: [PATCH v5 net-next 2/3] rds: Enable RDS IPv6 support

2018-07-23 Thread santosh.shilim...@oracle.com


On 7/23/18 8:51 PM, Ka-Cheong Poon wrote:

This patch enables RDS to use IPv6 addresses. For RDS/TCP, the
listener is now an IPv6 endpoint which accepts both IPv4 and IPv6
connection requests.  RDS/RDMA/IB uses a private data (struct
rds_ib_connect_private) exchange between endpoints at RDS connection
establishment time to support RDMA. This private data exchange uses a
32 bit integer to represent an IP address. This needs to be changed in
order to support IPv6. A new private data struct
rds6_ib_connect_private is introduced to handle this. To ensure
backward compatibility, an IPv6 capable RDS stack uses another RDMA
listener port (RDS_CM_PORT) to accept IPv6 connection. And it
continues to use the original RDS_PORT for IPv4 RDS connections. When
it needs to communicate with an IPv6 peer, it uses the RDS_CM_PORT to
send the connection set up request.

v5: Fixed syntax problem (David Miller).

v4: Changed port history comments in rds.h (Sowmini Varadhan).

v3: Added support to set up IPv4 connection using mapped address
 (David Miller).
 Added support to set up connection between link local and non-link
 addresses.
 Various review comments from Santosh Shilimkar and Sowmini Varadhan.

v2: Fixed bound and peer address scope mismatched issue.
 Added back rds_connect() IPv6 changes.

Signed-off-by: Ka-Cheong Poon
---

Acked-by: Santosh Shilimkar

Re: [PATCH v5 net-next 3/3] rds: Extend RDS API for IPv6 support

2018-07-23 Thread santosh.shilim...@oracle.com


On 7/23/18 8:51 PM, Ka-Cheong Poon wrote:

There are many data structures (RDS socket options) used by RDS apps
which use a 32 bit integer to store IP address. To support IPv6,
struct in6_addr needs to be used. To ensure backward compatibility, a
new data structure is introduced for each of those data structures
which use a 32 bit integer to represent an IP address. And new socket
options are introduced to use those new structures. This means that
existing apps should work without a problem with the new RDS module.
For apps which want to use IPv6, those new data structures and socket
options can be used. IPv4 mapped address is used to represent IPv4
address in the new data structures.

v4: Revert changes to SO_RDS_TRANSPORT

Signed-off-by: Ka-Cheong Poon
---

Acked-by: Santosh Shilimkar

Re: [PATCH v5 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr

2018-07-23 Thread santosh.shilim...@oracle.com


On 7/23/18 8:51 PM, Ka-Cheong Poon wrote:

This patch changes the internal representation of an IP address to use
struct in6_addr.  IPv4 address is stored as an IPv4 mapped address.
All the functions which take an IP address as argument are also
changed to use struct in6_addr.  But RDS socket layer is not modified
such that it still does not accept IPv6 address from an application.
And RDS layer does not accept nor initiate IPv6 connections.

v2: Fixed sparse warnings.

Signed-off-by: Ka-Cheong Poon
---

Acked-by: Santosh Shilimkar

Re: [PATCH v5 net-next 0/3] rds: IPv6 support

2018-07-23 Thread David Miller



Hello,

Since you have not fundamentally changed the code, just made
a build failure fix, would you please retain the ACKs that the
previous version received?

I either have to apply this as-is without the ACKs, or wait and
see if that person does the ACKs again for you.

Thank you.

Re: [PATCH v3 bpf-next 6/8] xdp: Add a flag for disabling napi_direct of xdp_return_frame in xdp_mem_info

2018-07-23 Thread Toshiaki Makita

On 2018/07/24 12:38, Jakub Kicinski wrote:
> On Tue, 24 Jul 2018 11:43:11 +0900, Toshiaki Makita wrote:
>> On 2018/07/24 10:22, Jakub Kicinski wrote:
>>> On Mon, 23 Jul 2018 00:13:06 +0900, Toshiaki Makita wrote:  
 From: Toshiaki Makita 

 We need some mechanism to disable napi_direct on calling
 xdp_return_frame_rx_napi() from some context.
 When veth gets support of XDP_REDIRECT, it will redirects packets which
 are redirected from other devices. On redirection veth will reuse
 xdp_mem_info of the redirection source device to make return_frame work.
 But in this case .ndo_xdp_xmit() called from veth redirection uses
 xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit is
 not called directly from the rxq which owns the xdp_mem_info.

 This approach introduces a flag in xdp_mem_info to indicate that
 napi_direct should be disabled even when _rx_napi variant is used.

 Signed-off-by: Toshiaki Makita   
>>>
>>> To be clear - you will modify flags of the original source device if it
>>> ever redirected a frame to a software device like veth?  Seems a bit
>>> heavy handed.  The xdp_return_frame_rx_napi() is only really used on
>>> error paths, but still..  Also as you note the original NAPI can run
>>> concurrently with your veth dest one, but also with NAPIs of other veth
>>> devices, so the non-atomic xdp.rxq->mem.flags |= XDP_MEM_RF_NO_DIRECT;
>>> makes me worried.  
>>
>> xdp_mem_info is copied in xdp_frame in convert_to_xdp_frame() so the
>> field is local to the frame. Changing flags affects only the frame.
>> xdp.rxq is local to NAPI thread, so no worries about atomicity.
> 
> Ah, right!  mem_info used to be just 8B, now it would be 12B.
> Alternatively we could perhaps add this info to struct redirect_info,
> through xdp_do_redirect() to avoid the per-frame cost.  I'm not sure
> that's better.

OK, let me check if this works.

-- 
Toshiaki Makita

[PATCH v5 net-next 2/3] rds: Enable RDS IPv6 support

2018-07-23 Thread Ka-Cheong Poon

This patch enables RDS to use IPv6 addresses. For RDS/TCP, the
listener is now an IPv6 endpoint which accepts both IPv4 and IPv6
connection requests.  RDS/RDMA/IB uses a private data (struct
rds_ib_connect_private) exchange between endpoints at RDS connection
establishment time to support RDMA. This private data exchange uses a
32 bit integer to represent an IP address. This needs to be changed in
order to support IPv6. A new private data struct
rds6_ib_connect_private is introduced to handle this. To ensure
backward compatibility, an IPv6 capable RDS stack uses another RDMA
listener port (RDS_CM_PORT) to accept IPv6 connection. And it
continues to use the original RDS_PORT for IPv4 RDS connections. When
it needs to communicate with an IPv6 peer, it uses the RDS_CM_PORT to
send the connection set up request.

v5: Fixed syntax problem (David Miller).

v4: Changed port history comments in rds.h (Sowmini Varadhan).

v3: Added support to set up IPv4 connection using mapped address
(David Miller).
Added support to set up connection between link local and non-link
addresses.
Various review comments from Santosh Shilimkar and Sowmini Varadhan.

v2: Fixed bound and peer address scope mismatched issue.
Added back rds_connect() IPv6 changes.

Signed-off-by: Ka-Cheong Poon 
---
 net/rds/af_rds.c | 91 
 net/rds/bind.c   | 59 ++-
 net/rds/connection.c | 54 
 net/rds/ib.c | 55 -
 net/rds/ib_cm.c  | 20 ---
 net/rds/rdma_transport.c | 30 +++-
 net/rds/rdma_transport.h |  5 +++
 net/rds/rds.h| 22 +++-
 net/rds/recv.c   |  2 +-
 net/rds/send.c   | 61 
 net/rds/tcp.c| 54 +---
 net/rds/tcp.h|  2 +-
 net/rds/tcp_connect.c| 54 +---
 net/rds/tcp_listen.c | 64 +++---
 14 files changed, 459 insertions(+), 114 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index fc1a5c6..fc5c48b 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -142,15 +142,32 @@ static int rds_getname(struct socket *sock, struct 
sockaddr *uaddr,
uaddr_len = sizeof(*sin6);
}
} else {
-   /* If socket is not yet bound, set the return address family
-* to be AF_UNSPEC (value 0) and the address size to be that
-* of an IPv4 address.
+   /* If socket is not yet bound and the socket is connected,
+* set the return address family to be the same as the
+* connected address, but with 0 address value.  If it is not
+* connected, set the family to be AF_UNSPEC (value 0) and
+* the address size to be that of an IPv4 address.
 */
if (ipv6_addr_any(&rs->rs_bound_addr)) {
-   sin = (struct sockaddr_in *)uaddr;
-   memset(sin, 0, sizeof(*sin));
-   sin->sin_family = AF_UNSPEC;
-   return sizeof(*sin);
+   if (ipv6_addr_any(&rs->rs_conn_addr)) {
+   sin = (struct sockaddr_in *)uaddr;
+   memset(sin, 0, sizeof(*sin));
+   sin->sin_family = AF_UNSPEC;
+   return sizeof(*sin);
+   }
+
+   if (ipv6_addr_type(&rs->rs_conn_addr) &
+   IPV6_ADDR_MAPPED) {
+   sin = (struct sockaddr_in *)uaddr;
+   memset(sin, 0, sizeof(*sin));
+   sin->sin_family = AF_INET;
+   return sizeof(*sin);
+   }
+
+   sin6 = (struct sockaddr_in6 *)uaddr;
+   memset(sin6, 0, sizeof(*sin6));
+   sin6->sin6_family = AF_INET6;
+   return sizeof(*sin6);
}
if (ipv6_addr_v4mapped(&rs->rs_bound_addr)) {
sin = (struct sockaddr_in *)uaddr;
@@ -484,16 +501,18 @@ static int rds_connect(struct socket *sock, struct 
sockaddr *uaddr,
 {
struct sock *sk = sock->sk;
struct sockaddr_in *sin;
+   struct sockaddr_in6 *sin6;
struct rds_sock *rs = rds_sk_to_rs(sk);
+   int addr_type;
int ret = 0;
 
lock_sock(sk);
 
-   switch (addr_len) {
-   case sizeof(struct sockaddr_in):
+   switch (uaddr->sa_family) {
+   case AF_INET:
sin = (struct sockaddr_in *)uaddr;
-   if (sin->sin_family != AF_INET) {
-   ret = -EAFNOSUPPORT;
+   if (addr_len < sizeof(struct sockaddr_in)) {
+

[PATCH v5 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr

2018-07-23 Thread Ka-Cheong Poon

This patch changes the internal representation of an IP address to use
struct in6_addr.  IPv4 address is stored as an IPv4 mapped address.
All the functions which take an IP address as argument are also
changed to use struct in6_addr.  But RDS socket layer is not modified
such that it still does not accept IPv6 address from an application.
And RDS layer does not accept nor initiate IPv6 connections.

v2: Fixed sparse warnings.

Signed-off-by: Ka-Cheong Poon 
---
 net/rds/af_rds.c | 138 +++---
 net/rds/bind.c   |  91 ++-
 net/rds/cong.c   |  23 ++--
 net/rds/connection.c | 132 +
 net/rds/ib.c |  17 +--
 net/rds/ib.h |  51 ++--
 net/rds/ib_cm.c  | 299 ++-
 net/rds/ib_rdma.c|  15 +--
 net/rds/ib_recv.c|  18 +--
 net/rds/ib_send.c|  10 +-
 net/rds/loop.c   |   7 +-
 net/rds/rdma.c   |   6 +-
 net/rds/rdma_transport.c |  56 ++---
 net/rds/rds.h|  70 +++
 net/rds/recv.c   |  51 +---
 net/rds/send.c   |  67 ---
 net/rds/tcp.c|  32 -
 net/rds/tcp_connect.c|  34 +++---
 net/rds/tcp_listen.c |  18 +--
 net/rds/tcp_recv.c   |   9 +-
 net/rds/tcp_send.c   |   4 +-
 net/rds/threads.c|  69 +--
 net/rds/transport.c  |  15 ++-
 23 files changed, 863 insertions(+), 369 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index ab751a1..fc1a5c6 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006 Oracle.  All rights reserved.
+ * Copyright (c) 2006, 2018 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -113,26 +114,63 @@ void rds_wake_sk_sleep(struct rds_sock *rs)
 static int rds_getname(struct socket *sock, struct sockaddr *uaddr,
   int peer)
 {
-   struct sockaddr_in *sin = (struct sockaddr_in *)uaddr;
struct rds_sock *rs = rds_sk_to_rs(sock->sk);
-
-   memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
+   struct sockaddr_in6 *sin6;
+   struct sockaddr_in *sin;
+   int uaddr_len;
 
/* racey, don't care */
if (peer) {
-   if (!rs->rs_conn_addr)
+   if (ipv6_addr_any(&rs->rs_conn_addr))
return -ENOTCONN;
 
-   sin->sin_port = rs->rs_conn_port;
-   sin->sin_addr.s_addr = rs->rs_conn_addr;
+   if (ipv6_addr_v4mapped(&rs->rs_conn_addr)) {
+   sin = (struct sockaddr_in *)uaddr;
+   memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
+   sin->sin_family = AF_INET;
+   sin->sin_port = rs->rs_conn_port;
+   sin->sin_addr.s_addr = rs->rs_conn_addr_v4;
+   uaddr_len = sizeof(*sin);
+   } else {
+   sin6 = (struct sockaddr_in6 *)uaddr;
+   sin6->sin6_family = AF_INET6;
+   sin6->sin6_port = rs->rs_conn_port;
+   sin6->sin6_addr = rs->rs_conn_addr;
+   sin6->sin6_flowinfo = 0;
+   /* scope_id is the same as in the bound address. */
+   sin6->sin6_scope_id = rs->rs_bound_scope_id;
+   uaddr_len = sizeof(*sin6);
+   }
} else {
-   sin->sin_port = rs->rs_bound_port;
-   sin->sin_addr.s_addr = rs->rs_bound_addr;
+   /* If socket is not yet bound, set the return address family
+* to be AF_UNSPEC (value 0) and the address size to be that
+* of an IPv4 address.
+*/
+   if (ipv6_addr_any(&rs->rs_bound_addr)) {
+   sin = (struct sockaddr_in *)uaddr;
+   memset(sin, 0, sizeof(*sin));
+   sin->sin_family = AF_UNSPEC;
+   return sizeof(*sin);
+   }
+   if (ipv6_addr_v4mapped(&rs->rs_bound_addr)) {
+   sin = (struct sockaddr_in *)uaddr;
+   memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
+   sin->sin_family = AF_INET;
+   sin->sin_port = rs->rs_bound_port;
+   sin->sin_addr.s_addr = rs->rs_bound_addr_v4;
+   uaddr_len = sizeof(*sin);
+   } else {
+   sin6 = (struct sockaddr_in6 *)uaddr;
+   sin6->sin6_family = AF_INET6;
+   sin6->sin6_port = rs->rs_bound_port;
+   sin6->sin6_addr = rs->rs_bound_addr;
+

[PATCH v5 net-next 3/3] rds: Extend RDS API for IPv6 support

2018-07-23 Thread Ka-Cheong Poon

There are many data structures (RDS socket options) used by RDS apps
which use a 32 bit integer to store IP address. To support IPv6,
struct in6_addr needs to be used. To ensure backward compatibility, a
new data structure is introduced for each of those data structures
which use a 32 bit integer to represent an IP address. And new socket
options are introduced to use those new structures. This means that
existing apps should work without a problem with the new RDS module.
For apps which want to use IPv6, those new data structures and socket
options can be used. IPv4 mapped address is used to represent IPv4
address in the new data structures.

v4: Revert changes to SO_RDS_TRANSPORT

Signed-off-by: Ka-Cheong Poon 
---
 include/uapi/linux/rds.h |  69 +++-
 net/rds/connection.c | 101 +++
 net/rds/ib.c |  52 
 net/rds/ib_mr.h  |   2 +
 net/rds/ib_rdma.c|  11 +-
 net/rds/recv.c   |  25 
 net/rds/tcp.c|  44 +
 7 files changed, 293 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index 20c6bd0..dc520e1 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR 
Linux-OpenIB) */
 /*
- * Copyright (c) 2008 Oracle.  All rights reserved.
+ * Copyright (c) 2008, 2018 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -118,7 +118,17 @@
 #define RDS_INFO_IB_CONNECTIONS10008
 #define RDS_INFO_CONNECTION_STATS  10009
 #define RDS_INFO_IWARP_CONNECTIONS 10010
-#define RDS_INFO_LAST  10010
+
+/* PF_RDS6 options */
+#define RDS6_INFO_CONNECTIONS  10011
+#define RDS6_INFO_SEND_MESSAGES10012
+#define RDS6_INFO_RETRANS_MESSAGES 10013
+#define RDS6_INFO_RECV_MESSAGES10014
+#define RDS6_INFO_SOCKETS  10015
+#define RDS6_INFO_TCP_SOCKETS  10016
+#define RDS6_INFO_IB_CONNECTIONS   10017
+
+#define RDS_INFO_LAST  10017
 
 struct rds_info_counter {
__u8name[32];
@@ -140,6 +150,15 @@ struct rds_info_connection {
__u8flags;
 } __attribute__((packed));
 
+struct rds6_info_connection {
+   __u64   next_tx_seq;
+   __u64   next_rx_seq;
+   struct in6_addr laddr;
+   struct in6_addr faddr;
+   __u8transport[TRANSNAMSIZ]; /* null term ascii */
+   __u8flags;
+} __attribute__((packed));
+
 #define RDS_INFO_MESSAGE_FLAG_ACK   0x01
 #define RDS_INFO_MESSAGE_FLAG_FAST_ACK  0x02
 
@@ -153,6 +172,17 @@ struct rds_info_message {
__u8flags;
 } __attribute__((packed));
 
+struct rds6_info_message {
+   __u64   seq;
+   __u32   len;
+   struct in6_addr laddr;
+   struct in6_addr faddr;
+   __be16  lport;
+   __be16  fport;
+   __u8flags;
+   __u8tos;
+} __attribute__((packed));
+
 struct rds_info_socket {
__u32   sndbuf;
__be32  bound_addr;
@@ -163,6 +193,16 @@ struct rds_info_socket {
__u64   inum;
 } __attribute__((packed));
 
+struct rds6_info_socket {
+   __u32   sndbuf;
+   struct in6_addr bound_addr;
+   struct in6_addr connected_addr;
+   __be16  bound_port;
+   __be16  connected_port;
+   __u32   rcvbuf;
+   __u64   inum;
+} __attribute__((packed));
+
 struct rds_info_tcp_socket {
__be32  local_addr;
__be16  local_port;
@@ -175,6 +215,18 @@ struct rds_info_tcp_socket {
__u32   last_seen_una;
 } __attribute__((packed));
 
+struct rds6_info_tcp_socket {
+   struct in6_addr local_addr;
+   __be16  local_port;
+   struct in6_addr peer_addr;
+   __be16  peer_port;
+   __u64   hdr_rem;
+   __u64   data_rem;
+   __u32   last_sent_nxt;
+   __u32   last_expected_una;
+   __u32   last_seen_una;
+} __attribute__((packed));
+
 #define RDS_IB_GID_LEN 16
 struct rds_info_rdma_connection {
__be32  src_addr;
@@ -189,6 +241,19 @@ struct rds_info_rdma_connection {
__u32   rdma_mr_size;
 };
 
+struct rds6_info_rdma_connection {
+   struct in6_addr src_addr;
+   struct in6_addr dst_addr;
+   __u8src_gid[RDS_IB_GID_LEN];
+   __u8dst_gid[RDS_IB_GID_LEN];
+
+   __u32   max_send_wr;
+   __u32   max_recv_wr;
+   __u32   max_send_sge;
+   __u32   rdma_mr_max;
+   __u32   rdma_mr_size;
+};
+

[PATCH v5 net-next 0/3] rds: IPv6 support

2018-07-23 Thread Ka-Cheong Poon

This patch set adds IPv6 support to the kernel RDS and related
modules.  Existing RDS apps using IPv4 address continue to run without
any problem.  New RDS apps which want to use IPv6 address can do so by
passing the address in struct sockaddr_in6 to bind(), connect() or
sendmsg().  And those apps also need to use the new IPv6 equivalents
of some of the existing socket options as the existing options use a
32 bit integer to store IP address.

All RDS code now use struct in6_addr to store IP address.  IPv4
address is stored as an IPv4 mapped address.

Header file changes

There are many data structures (RDS socket options) used by RDS apps
which use a 32 bit integer to store IP address. To support IPv6,
struct in6_addr needs to be used. To ensure backward compatibility, a
new data structure is introduced for each of those data structures
which use a 32 bit integer to represent an IP address. And new socket
options are introduced to use those new structures. This means that
existing apps should work without a problem with the new RDS module.
For apps which want to use IPv6, those new data structures and socket
options can be used. IPv4 mapped address is used to represent IPv4
address in the new data structures.

Internally, all RDS data structures which contain an IP address are
changed to use struct in6_addr to store the address. IPv4 address is
stored as an IPv4 mapped address. All the functions which take an IP
address as argument are also changed to use struct in6_addr.

RDS/RDMA/IB uses a private data (struct rds_ib_connect_private)
exchange between endpoints at RDS connection establishment time to
support RDMA. This private data exchange uses a 32 bit integer to
represent an IP address. This needs to be changed in order to support
IPv6. A new private data struct rds6_ib_connect_private is introduced
to handle this. To ensure backward compatibility, an IPv6 capable RDS
stack uses another RDMA listener port (RDS_CM_PORT) to accept IPv6
connection. And it continues to use the original RDS_PORT for IPv4 RDS
connections. When it needs to communicate with an IPv6 peer, it uses
the RDS_TCP_PORT to send the connection set up request.

RDS/TCP changes

TCP related code is changed to support IPv6.  Note that only an IPv6
TCP listener on port RDS_TCP_PORT is created as it can accept both
IPv4 and IPv6 connection requests.

IB/RDMA changes

The initial private data exchange between IB endpoints using RDMA is
changed to support IPv6 address instead, if the peer address is IPv6.
To ensure backward compatibility, annother RDMA listener port
(RDS_CM_PORT) is used to accept IPv6 connection. An IPv6 capable RDS
module continues to use the original RDS_PORT for IPv4 RDS
connections. When it needs to communicate with an IPv6 peer, it uses
the RDS_CM_PORT to send the connection set up request.

Ka-Cheong Poon (3):
  rds: Changing IP address internal representation to struct in6_addr
  rds: Enable RDS IPv6 support
  rds: Extend RDS API for IPv6 support

 include/uapi/linux/rds.h |  69 ++-
 net/rds/af_rds.c | 201 --
 net/rds/bind.c   | 136 -
 net/rds/cong.c   |  23 ++--
 net/rds/connection.c | 259 ++-
 net/rds/ib.c | 114 +++--
 net/rds/ib.h |  51 ++--
 net/rds/ib_cm.c  | 309 +++
 net/rds/ib_mr.h  |   2 +
 net/rds/ib_rdma.c|  24 ++--
 net/rds/ib_recv.c|  18 +--
 net/rds/ib_send.c|  10 +-
 net/rds/loop.c   |   7 +-
 net/rds/rdma.c   |   6 +-
 net/rds/rdma_transport.c |  84 ++---
 net/rds/rdma_transport.h |   5 +
 net/rds/rds.h|  88 +-
 net/rds/recv.c   |  76 +---
 net/rds/send.c   | 114 ++---
 net/rds/tcp.c| 128 
 net/rds/tcp.h|   2 +-
 net/rds/tcp_connect.c|  68 ---
 net/rds/tcp_listen.c |  74 +---
 net/rds/tcp_recv.c   |   9 +-
 net/rds/tcp_send.c   |   4 +-
 net/rds/threads.c|  69 +--
 net/rds/transport.c  |  15 ++-
 27 files changed, 1543 insertions(+), 422 deletions(-)

-- 
1.8.3.1

Re: [patch net-next v4 00/12] sched: introduce chain templates support with offloading to mlxsw

2018-07-23 Thread David Miller

From: Jiri Pirko 
Date: Mon, 23 Jul 2018 09:23:03 +0200

> For the TC clsact offload these days, some of HW drivers need
> to hold a magic ball. The reason is, with the first inserted rule inside
> HW they need to guess what fields will be used for the matching. If
> later on this guess proves to be wrong and user adds a filter with a
> different field to match, there's a problem. Mlxsw resolves it now with
> couple of patterns. Those try to cover as many match fields as possible.
> This aproach is far from optimal, both performance-wise and scale-wise.
> Also, there is a combination of filters that in certain order won't
> succeed.
> 
> Most of the time, when user inserts filters in chain, he knows right away
> how the filters are going to look like - what type and option will they
> have. For example, he knows that he will only insert filters of type
> flower matching destination IP address. He can specify a template that
> would cover all the filters in the chain.
> 
> This patchset is providing the possibility to user to provide such
> template to kernel and propagate it all the way down to device
> drivers.

Series applied, thanks Jiri!

Re: [PATCH v3 bpf-next 6/8] xdp: Add a flag for disabling napi_direct of xdp_return_frame in xdp_mem_info

2018-07-23 Thread Jakub Kicinski

On Tue, 24 Jul 2018 11:43:11 +0900, Toshiaki Makita wrote:
> On 2018/07/24 10:22, Jakub Kicinski wrote:
> > On Mon, 23 Jul 2018 00:13:06 +0900, Toshiaki Makita wrote:  
> >> From: Toshiaki Makita 
> >>
> >> We need some mechanism to disable napi_direct on calling
> >> xdp_return_frame_rx_napi() from some context.
> >> When veth gets support of XDP_REDIRECT, it will redirects packets which
> >> are redirected from other devices. On redirection veth will reuse
> >> xdp_mem_info of the redirection source device to make return_frame work.
> >> But in this case .ndo_xdp_xmit() called from veth redirection uses
> >> xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit is
> >> not called directly from the rxq which owns the xdp_mem_info.
> >>
> >> This approach introduces a flag in xdp_mem_info to indicate that
> >> napi_direct should be disabled even when _rx_napi variant is used.
> >>
> >> Signed-off-by: Toshiaki Makita   
> > 
> > To be clear - you will modify flags of the original source device if it
> > ever redirected a frame to a software device like veth?  Seems a bit
> > heavy handed.  The xdp_return_frame_rx_napi() is only really used on
> > error paths, but still..  Also as you note the original NAPI can run
> > concurrently with your veth dest one, but also with NAPIs of other veth
> > devices, so the non-atomic xdp.rxq->mem.flags |= XDP_MEM_RF_NO_DIRECT;
> > makes me worried.  
> 
> xdp_mem_info is copied in xdp_frame in convert_to_xdp_frame() so the
> field is local to the frame. Changing flags affects only the frame.
> xdp.rxq is local to NAPI thread, so no worries about atomicity.

Ah, right!  mem_info used to be just 8B, now it would be 12B.
Alternatively we could perhaps add this info to struct redirect_info,
through xdp_do_redirect() to avoid the per-frame cost.  I'm not sure
that's better.

> > Would you mind elaborating why not handle the RX completely in the NAPI
> > context of the original device?  
> 
> Originally it was difficult to implement .ndo_xdp_xmit() and
> .ndo_xdp_flush() model without creating NAPI in veth. Now it is changed
> so I'm not sure how difficult it is at this point.
> But in any case I want to avoid stack inflation by veth NAPI. (Imagine
> some misconfiguration like calling XDP_TX on both side of veth...)

True :/

Re: [PATCH v4 net-next 2/3] rds: Enable RDS IPv6 support

2018-07-23 Thread Ka-Cheong Poon

On 07/24/2018 11:20 AM, David Miller wrote:

From: Ka-Cheong Poon 
Date: Tue, 24 Jul 2018 11:18:24 +0800

On 07/24/2018 02:15 AM, David Miller wrote:

From: Ka-Cheong Poon 
Date: Mon, 23 Jul 2018 07:16:11 -0700

@@ -163,15 +165,29 @@ int rds_tcp_accept_one(struct socket *sock)
inet = inet_sk(new_sock->sk);
   +my_addr = &new_sock->sk->sk_v6_rcv_saddr;
+   peer_addr = &new_sock->sk->sk_v6_daddr,
rdsdebug("accepted tcp %pI6c:%u -> %pI6c:%u\n",

Note that comma, instead of a semicolon, at the end of the peer_addr
assignment.
This doesn't even compile.

Strange, the compiler did not complain.  Will check why's
that.

Try allmodconfig

That catches it.  Thanks!

--
K. Poon
ka-cheong.p...@oracle.com

Re: [pull request][net-next V2 00/12] Mellanox, mlx5e updates 2018-07-18

2018-07-23 Thread David Miller

From: Saeed Mahameed 
Date: Mon, 23 Jul 2018 15:11:17 -0700

> This series includes updates for mlx5e net device driver, with a couple
> of major features and some misc updates.
> 
> Please notice the mlx5-next merge patch at the beginning:
> "Merge branch 'mlx5-next' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux"
> 
> For more information please see tag log below.
> 
> Please pull and let me know if there's any problem.
> 
> v1->v2:
> - Dropped "Support PCIe buffer congestion handling via Devlink" patches until 
> the
> comments are addressed.

Pulled, thanks Saeed.

Re: [PATCH v4 net-next 2/3] rds: Enable RDS IPv6 support

2018-07-23 Thread David Miller

From: Ka-Cheong Poon 
Date: Tue, 24 Jul 2018 11:18:24 +0800

> On 07/24/2018 02:15 AM, David Miller wrote:
>> From: Ka-Cheong Poon 
>> Date: Mon, 23 Jul 2018 07:16:11 -0700
>> 
>>> @@ -163,15 +165,29 @@ int rds_tcp_accept_one(struct socket *sock)
>>> inet = inet_sk(new_sock->sk);
>>>   + my_addr = &new_sock->sk->sk_v6_rcv_saddr;
>>> +   peer_addr = &new_sock->sk->sk_v6_daddr,
>>> rdsdebug("accepted tcp %pI6c:%u -> %pI6c:%u\n",
>> Note that comma, instead of a semicolon, at the end of the peer_addr
>> assignment.
>> This doesn't even compile.
> 
> 
> Strange, the compiler did not complain.  Will check why's
> that.

Try allmodconfig

Re: [PATCH v4 net-next 2/3] rds: Enable RDS IPv6 support

2018-07-23 Thread Ka-Cheong Poon

On 07/24/2018 02:15 AM, David Miller wrote:

From: Ka-Cheong Poon 
Date: Mon, 23 Jul 2018 07:16:11 -0700

@@ -163,15 +165,29 @@ int rds_tcp_accept_one(struct socket *sock)

  	inet = inet_sk(new_sock->sk);

+	my_addr = &new_sock->sk->sk_v6_rcv_saddr;

+   peer_addr = &new_sock->sk->sk_v6_daddr,
rdsdebug("accepted tcp %pI6c:%u -> %pI6c:%u\n",

Note that comma, instead of a semicolon, at the end of the peer_addr
assignment.

This doesn't even compile.

Strange, the compiler did not complain.  Will check why's
that.

Thanks.

--
K. Poon
ka-cheong.p...@oracle.com

Re: [PATCH v3 bpf-next 6/8] xdp: Add a flag for disabling napi_direct of xdp_return_frame in xdp_mem_info

2018-07-23 Thread Toshiaki Makita

On 2018/07/24 10:22, Jakub Kicinski wrote:
> On Mon, 23 Jul 2018 00:13:06 +0900, Toshiaki Makita wrote:
>> From: Toshiaki Makita 
>>
>> We need some mechanism to disable napi_direct on calling
>> xdp_return_frame_rx_napi() from some context.
>> When veth gets support of XDP_REDIRECT, it will redirects packets which
>> are redirected from other devices. On redirection veth will reuse
>> xdp_mem_info of the redirection source device to make return_frame work.
>> But in this case .ndo_xdp_xmit() called from veth redirection uses
>> xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit is
>> not called directly from the rxq which owns the xdp_mem_info.
>>
>> This approach introduces a flag in xdp_mem_info to indicate that
>> napi_direct should be disabled even when _rx_napi variant is used.
>>
>> Signed-off-by: Toshiaki Makita 
> 
> To be clear - you will modify flags of the original source device if it
> ever redirected a frame to a software device like veth?  Seems a bit
> heavy handed.  The xdp_return_frame_rx_napi() is only really used on
> error paths, but still..  Also as you note the original NAPI can run
> concurrently with your veth dest one, but also with NAPIs of other veth
> devices, so the non-atomic xdp.rxq->mem.flags |= XDP_MEM_RF_NO_DIRECT;
> makes me worried.

xdp_mem_info is copied in xdp_frame in convert_to_xdp_frame() so the
field is local to the frame. Changing flags affects only the frame.
xdp.rxq is local to NAPI thread, so no worries about atomicity.

> Would you mind elaborating why not handle the RX completely in the NAPI
> context of the original device?

Originally it was difficult to implement .ndo_xdp_xmit() and
.ndo_xdp_flush() model without creating NAPI in veth. Now it is changed
so I'm not sure how difficult it is at this point.
But in any case I want to avoid stack inflation by veth NAPI. (Imagine
some misconfiguration like calling XDP_TX on both side of veth...)

> 
>> diff --git a/include/net/xdp.h b/include/net/xdp.h
>> index fcb033f51d8c..1d1bc6553ff2 100644
>> --- a/include/net/xdp.h
>> +++ b/include/net/xdp.h
>> @@ -41,6 +41,9 @@ enum xdp_mem_type {
>>  MEM_TYPE_MAX,
>>  };
>>  
>> +/* XDP flags for xdp_mem_info */
>> +#define XDP_MEM_RF_NO_DIRECTBIT(0)  /* don't use napi_direct */
>> +
>>  /* XDP flags for ndo_xdp_xmit */
>>  #define XDP_XMIT_FLUSH  (1U << 0)   /* doorbell signal 
>> consumer */
>>  #define XDP_XMIT_FLAGS_MASK XDP_XMIT_FLUSH
>> @@ -48,6 +51,7 @@ enum xdp_mem_type {
>>  struct xdp_mem_info {
>>  u32 type; /* enum xdp_mem_type, but known size type */
>>  u32 id;
>> +u32 flags;
>>  };
>>  
>>  struct page_pool;
>> diff --git a/net/core/xdp.c b/net/core/xdp.c
>> index 57285383ed00..1426c608fd75 100644
>> --- a/net/core/xdp.c
>> +++ b/net/core/xdp.c
>> @@ -330,10 +330,12 @@ static void __xdp_return(void *data, struct 
>> xdp_mem_info *mem, bool napi_direct,
>>  /* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
>>  xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
>>  page = virt_to_head_page(data);
>> -if (xa)
>> +if (xa) {
>> +napi_direct &= !(mem->flags & XDP_MEM_RF_NO_DIRECT);
>>  page_pool_put_page(xa->page_pool, page, napi_direct);
>> -else
>> +} else {
>>  put_page(page);
>> +}
>>  rcu_read_unlock();
>>  break;
>>  case MEM_TYPE_PAGE_SHARED:
> 
> 
> 

-- 
Toshiaki Makita

Re: [PATCH rdma-next v2 0/8] Support mlx5 flow steering with RAW data

2018-07-23 Thread Jason Gunthorpe

On Mon, Jul 23, 2018 at 03:25:04PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> Changelog:
> v1->v2:
>  * Fix matcher to use the correct size.
>  * Rephrase commit log of the first patch.
> v0->v1:
>  * Fixed ADD_UVERBS_ATTRIBUTES_SIMPLE macro to pass the real address.
>  ?* Replaced UA_ALLOC_AND_COPY to regular copy_from
>  * Added UVERBS_ATTR_NO_DATA new macro for cleaner code.
>  * Used ib_dev from uobj when it exists.
>  * ib_is_destroy_retryable was replaced by ib_destroy_usecnt
> 
> >From Yishai:
> 
> This series introduces vendor create and destroy flow methods on the
> uverbs flow object by using the KABI infra-structure.
> 
> It's done in a way that enables the driver to get its specific device
> attributes in a raw data to match its underlay specification while still
> using the generic ib_flow object for cleanup and code sharing.
> 
> In addition, a specific mlx5 matcher object and its create/destroy
> methods were introduced. This object matches the underlay flow steering
> mask specification and is used as part of mlx5 create flow input data.
> 
> This series supports IB_QP/TIR as its flow steering destination as
> applicable today via the ib_create_flow API, however, it adds also an
> option to work with DEVX object which its destination can be both TIR
> and flow table.
> 
> Few changes were done in the mlx5 core layer to support forward
> compatible for the device specification raw data and to support flow
> table when the DEVX destination is used.
> 
> As part of this series the default IB destroy handler
> (i.e. uverbs_destroy_def_handler()) was exposed from IB core to be
> used by the drivers and existing code was refactored to use it.
> 
> Thanks
> 
> Yishai Hadas (8):
>   net/mlx5: Add forward compatible support for the FTE match data
>   net/mlx5: Add support for flow table destination number
>   IB/mlx5: Introduce flow steering matcher object
>   IB: Consider ib_flow creation by the KABI infrastructure
>   IB/mlx5: Introduce vendor create and destroy flow methods
>   IB/mlx5: Support adding flow steering rule by raw data
>   IB/mlx5: Add support for a flow table destination
>   IB/mlx5: Expose vendor flow trees

This seems fine to me. Can you send the mlx5 shared branch for the
first two patches?

Thanks,
Jason

Re: [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit

2018-07-23 Thread Toshiaki Makita

On 2018/07/24 10:02, Jakub Kicinski wrote:
> On Mon, 23 Jul 2018 00:13:05 +0900, Toshiaki Makita wrote:
>> From: Toshiaki Makita 
>>
>> This allows NIC's XDP to redirect packets to veth. The destination veth
>> device enqueues redirected packets to the napi ring of its peer, then
>> they are processed by XDP on its peer veth device.
>> This can be thought as calling another XDP program by XDP program using
>> REDIRECT, when the peer enables driver XDP.
>>
>> Note that when the peer veth device does not set driver xdp, redirected
>> packets will be dropped because the peer is not ready for NAPI.
...
>> +static int veth_xdp_xmit(struct net_device *dev, int n,
>> + struct xdp_frame **frames, u32 flags)
>> +{
>> +struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>> +struct net_device *rcv;
>> +int i, drops = 0;
>> +
>> +if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
>> +return -EINVAL;
>> +
>> +rcv = rcu_dereference(priv->peer);
>> +if (unlikely(!rcv))
>> +return -ENXIO;
>> +
>> +rcv_priv = netdev_priv(rcv);
>> +/* xdp_ring is initialized on receive side? */
>> +if (!rcu_access_pointer(rcv_priv->xdp_prog))
>> +return -ENXIO;
>> +
>> +spin_lock(&rcv_priv->xdp_ring.producer_lock);
>> +for (i = 0; i < n; i++) {
>> +struct xdp_frame *frame = frames[i];
>> +void *ptr = veth_xdp_to_ptr(frame);
>> +
>> +if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
>> + __ptr_ring_produce(&rcv_priv->xdp_ring, ptr))) {
> 
> Would you mind sparing a few more words how this is safe vs the
> .ndo_close() on the peer?  Personally I'm a bit uncomfortable with the
> IFF_UP check in xdp_ok_fwd_dev(), I'm not sure what's supposed to
> guarantee the device doesn't go down right after that check, or is
> already down, but netdev->flags are not atomic...  

Actually it is guarded by RCU. On closing the device rcv_priv->xdp_prog
is set to be NULL, and synchronize_net() is called from within
netif_napi_del(). Then ptr_ring is cleaned-up.
xdp_ok_fwd_dev() is doing the same check as non-XDP case, but it may not
be appropriate because IFF_UP check here is not usable as you say.

> 
>> +xdp_return_frame_rx_napi(frame);
>> +drops++;
>> +}
>> +}
>> +spin_unlock(&rcv_priv->xdp_ring.producer_lock);
>> +
>> +if (flags & XDP_XMIT_FLUSH)
>> +__veth_xdp_flush(rcv_priv);
>> +
>> +return n - drops;
>> +}
>> +
>>  static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
>>  struct xdp_frame *frame)
>>  {
>> @@ -760,6 +804,7 @@ static const struct net_device_ops veth_netdev_ops = {
>>  .ndo_features_check = passthru_features_check,
>>  .ndo_set_rx_headroom= veth_set_rx_headroom,
>>  .ndo_bpf= veth_xdp,
>> +.ndo_xdp_xmit   = veth_xdp_xmit,
>>  };
>>  
>>  #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
> 
> 
> 

-- 
Toshiaki Makita

Re: [PATCH net-next] tcp: ack immediately when a cwr packet arrives

2018-07-23 Thread Daniel Borkmann

On 07/24/2018 04:15 AM, Neal Cardwell wrote:
> On Mon, Jul 23, 2018 at 8:49 PM Lawrence Brakmo  wrote:
>>
>> We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The
>> problem is triggered when the last packet of a request arrives CE
>> marked. The reply will carry the ECE mark causing TCP to shrink its cwnd
>> to 1 (because there are no packets in flight). When the 1st packet of
>> the next request arrives, the ACK was sometimes delayed even though it
>> is CWR marked, adding up to 40ms to the RPC latency.
>>
>> This patch insures that CWR marked data packets arriving will be acked
>> immediately.
> ...
>> Modified based on comments by Neal Cardwell 
>>
>> Signed-off-by: Lawrence Brakmo 
>> ---
>>  net/ipv4/tcp_input.c | 9 -
>>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> Seems like a nice mechanism to have, IMHO.
> 
> Acked-by: Neal Cardwell 

Should this go to net tree instead where all the other fixes went?

Thanks,
Daniel

Re: [PATCH net-next] tcp: ack immediately when a cwr packet arrives

2018-07-23 Thread Neal Cardwell

On Mon, Jul 23, 2018 at 8:49 PM Lawrence Brakmo  wrote:
>
> We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The
> problem is triggered when the last packet of a request arrives CE
> marked. The reply will carry the ECE mark causing TCP to shrink its cwnd
> to 1 (because there are no packets in flight). When the 1st packet of
> the next request arrives, the ACK was sometimes delayed even though it
> is CWR marked, adding up to 40ms to the RPC latency.
>
> This patch insures that CWR marked data packets arriving will be acked
> immediately.
...
> Modified based on comments by Neal Cardwell 
>
> Signed-off-by: Lawrence Brakmo 
> ---
>  net/ipv4/tcp_input.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)

Seems like a nice mechanism to have, IMHO.

Acked-by: Neal Cardwell 

Thanks!
neal

Re: [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit

2018-07-23 Thread Toshiaki Makita

On 2018/07/24 10:02, Jakub Kicinski wrote:
> On Mon, 23 Jul 2018 00:13:05 +0900, Toshiaki Makita wrote:
>> From: Toshiaki Makita 
>>
>> This allows NIC's XDP to redirect packets to veth. The destination veth
>> device enqueues redirected packets to the napi ring of its peer, then
>> they are processed by XDP on its peer veth device.
>> This can be thought as calling another XDP program by XDP program using
>> REDIRECT, when the peer enables driver XDP.
>>
>> Note that when the peer veth device does not set driver xdp, redirected
>> packets will be dropped because the peer is not ready for NAPI.
> 
> Often we can't redirect to devices which don't have am xdp program
> installed.  In your case we can't redirect unless the peer of the
> target doesn't have a program installed?  :(

Right. I tried to avoid this case by converting xdp_frames to skb but
realized that should not be done.
https://patchwork.ozlabs.org/patch/903536/

> Perhaps it is time to reconsider what Saeed once asked for, a flag or
> attribute to enable being the destination of a XDP_REDIRECT.

Yes, something will be necessary. Jesper said Tariq had some ideas to
implement it.

> 
>> v2:
>> - Drop the part converting xdp_frame into skb when XDP is not enabled.
>> - Implement bulk interface of ndo_xdp_xmit.
>> - Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.
>>
>> Signed-off-by: Toshiaki Makita 
>> ---
>>  drivers/net/veth.c | 45 +
>>  1 file changed, 45 insertions(+)
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index 4be75c58bc6a..57187e955fea 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -17,6 +17,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -125,6 +126,11 @@ static void *veth_ptr_to_xdp(void *ptr)
>>  return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
>>  }
>>  
>> +static void *veth_xdp_to_ptr(void *ptr)
>> +{
>> +return (void *)((unsigned long)ptr | VETH_XDP_FLAG);
>> +}
>> +
>>  static void veth_ptr_free(void *ptr)
>>  {
>>  if (veth_is_xdp_frame(ptr))
>> @@ -267,6 +273,44 @@ static struct sk_buff *veth_build_skb(void *head, int 
>> headroom, int len,
>>  return skb;
>>  }
>>  
>> +static int veth_xdp_xmit(struct net_device *dev, int n,
>> + struct xdp_frame **frames, u32 flags)
>> +{
>> +struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>> +struct net_device *rcv;
>> +int i, drops = 0;
>> +
>> +if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
>> +return -EINVAL;
>> +
>> +rcv = rcu_dereference(priv->peer);
>> +if (unlikely(!rcv))
>> +return -ENXIO;
>> +
>> +rcv_priv = netdev_priv(rcv);
>> +/* xdp_ring is initialized on receive side? */
>> +if (!rcu_access_pointer(rcv_priv->xdp_prog))
>> +return -ENXIO;
>> +
>> +spin_lock(&rcv_priv->xdp_ring.producer_lock);
>> +for (i = 0; i < n; i++) {
>> +struct xdp_frame *frame = frames[i];
>> +void *ptr = veth_xdp_to_ptr(frame);
>> +
>> +if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
>> + __ptr_ring_produce(&rcv_priv->xdp_ring, ptr))) {
> 
> Would you mind sparing a few more words how this is safe vs the
> .ndo_close() on the peer?  Personally I'm a bit uncomfortable with the
> IFF_UP check in xdp_ok_fwd_dev(), I'm not sure what's supposed to
> guarantee the device doesn't go down right after that check, or is
> already down, but netdev->flags are not atomic...  
> 
>> +xdp_return_frame_rx_napi(frame);
>> +drops++;
>> +}
>> +}
>> +spin_unlock(&rcv_priv->xdp_ring.producer_lock);
>> +
>> +if (flags & XDP_XMIT_FLUSH)
>> +__veth_xdp_flush(rcv_priv);
>> +
>> +return n - drops;
>> +}
>> +
>>  static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
>>  struct xdp_frame *frame)
>>  {
>> @@ -760,6 +804,7 @@ static const struct net_device_ops veth_netdev_ops = {
>>  .ndo_features_check = passthru_features_check,
>>  .ndo_set_rx_headroom= veth_set_rx_headroom,
>>  .ndo_bpf= veth_xdp,
>> +.ndo_xdp_xmit   = veth_xdp_xmit,
>>  };
>>  
>>  #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
> 
> 
> 

-- 
Toshiaki Makita

Re: [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit

2018-07-23 Thread Toshiaki Makita

On 2018/07/24 9:19, kbuild test robot wrote:
> Hi Toshiaki,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on bpf-next/master]
> 
> url:
> https://github.com/0day-ci/linux/commits/Toshiaki-Makita/veth-Driver-XDP/20180724-065517
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 
> master
> config: i386-randconfig-x001-201829 (attached as .config)
> compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386 
> 
> All errors (new ones prefixed by >>):
> 
>In file included from include/linux/kernel.h:10:0,
> from include/linux/list.h:9,
> from include/linux/timer.h:5,
> from include/linux/netdevice.h:28,
> from drivers//net/veth.c:11:
>drivers//net/veth.c: In function 'veth_xdp_xmit':
>>> drivers//net/veth.c:300:16: error: implicit declaration of function 
>>> 'xdp_ok_fwd_dev' [-Werror=implicit-function-declaration]
>   if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||

This is because this series depends on commit d8d7218ad842 ("xdp:
XDP_REDIRECT should check IFF_UP and MTU") which is currently in DaveM's
net-next tree, as I noted in the cover letter.

-- 
Toshiaki Makita

Re: [PATCH v3 bpf-next 3/8] veth: Avoid drops by oversized packets when XDP is enabled

2018-07-23 Thread Toshiaki Makita

On 2018/07/24 9:27, Jakub Kicinski wrote:
> On Mon, 23 Jul 2018 00:13:03 +0900, Toshiaki Makita wrote:
>> From: Toshiaki Makita 
>>
>> All oversized packets including GSO packets are dropped if XDP is
>> enabled on receiver side, so don't send such packets from peer.
>>
>> Drop TSO and SCTP fragmentation features so that veth devices themselves
>> segment packets with XDP enabled. Also cap MTU accordingly.
>>
>> Signed-off-by: Toshiaki Makita 
> 
> Is there any precedence for fixing up features and MTU like this?  Most
> drivers just refuse to install the program if settings are incompatible.

I don't know any precedence. I can refuse the program on installing it
when features and MTU are not appropriate. Is it preferred?
Note that with current implementation wanted_features are not touched so
features will be restored when the XDP program is removed. MTU will not
be restored though, as I do not remember the original MTU.


>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index 78fa08cb6e24..f5b72e937d9d 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -542,6 +542,23 @@ static int veth_get_iflink(const struct net_device *dev)
>>  return iflink;
>>  }
>>  
>> +static netdev_features_t veth_fix_features(struct net_device *dev,
>> +   netdev_features_t features)
>> +{
>> +struct veth_priv *priv = netdev_priv(dev);
>> +struct net_device *peer;
>> +
>> +peer = rtnl_dereference(priv->peer);
>> +if (peer) {
>> +struct veth_priv *peer_priv = netdev_priv(peer);
>> +
>> +if (peer_priv->_xdp_prog)
>> +features &= ~NETIF_F_GSO_SOFTWARE;
>> +}
>> +
>> +return features;
>> +}
>> +
>>  static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
>>  {
>>  struct veth_priv *peer_priv, *priv = netdev_priv(dev);
>> @@ -591,14 +608,33 @@ static int veth_xdp_set(struct net_device *dev, struct 
>> bpf_prog *prog,
>>  goto err;
>>  }
>>  }
>> +
>> +if (!old_prog) {
>> +peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
>> +peer->max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
>> +peer->hard_header_len -
>> +SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> +if (peer->mtu > peer->max_mtu)
>> +dev_set_mtu(peer, peer->max_mtu);
>> +}
>>  }
>>  
>>  if (old_prog) {
>> -if (!prog && dev->flags & IFF_UP)
>> -veth_disable_xdp(dev);
>> +if (!prog) {
>> +if (dev->flags & IFF_UP)
>> +veth_disable_xdp(dev);
>> +
>> +if (peer) {
>> +peer->hw_features |= NETIF_F_GSO_SOFTWARE;
>> +peer->max_mtu = ETH_MAX_MTU;
>> +}
>> +}
>>  bpf_prog_put(old_prog);
>>  }
>>  
>> +if ((!!old_prog ^ !!prog) && peer)
>> +netdev_update_features(peer);
>> +
>>  return 0;
>>  err:
>>  priv->_xdp_prog = old_prog;
>> @@ -643,6 +679,7 @@ static const struct net_device_ops veth_netdev_ops = {
>>  .ndo_poll_controller= veth_poll_controller,
>>  #endif
>>  .ndo_get_iflink = veth_get_iflink,
>> +.ndo_fix_features   = veth_fix_features,
>>  .ndo_features_check = passthru_features_check,
>>  .ndo_set_rx_headroom= veth_set_rx_headroom,
>>  .ndo_bpf= veth_xdp,

-- 
Toshiaki Makita

Re: [PATCH v3 bpf-next 2/8] veth: Add driver XDP

2018-07-23 Thread Toshiaki Makita

Hi Jakub,

Thanks for reviewing!

On 2018/07/24 9:23, Jakub Kicinski wrote:
> On Mon, 23 Jul 2018 00:13:02 +0900, Toshiaki Makita wrote:
>> From: Toshiaki Makita 
>>
>> This is the basic implementation of veth driver XDP.
>>
>> Incoming packets are sent from the peer veth device in the form of skb,
>> so this is generally doing the same thing as generic XDP.
>>
>> This itself is not so useful, but a starting point to implement other
>> useful veth XDP features like TX and REDIRECT.
>>
>> This introduces NAPI when XDP is enabled, because XDP is now heavily
>> relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
>> enqueues packets to the ring and peer NAPI handler drains the ring.
>>
>> Currently only one ring is allocated for each veth device, so it does
>> not scale on multiqueue env. This can be resolved by allocating rings
>> on the per-queue basis later.
>>
>> Note that NAPI is not used but netif_rx is used when XDP is not loaded,
>> so this does not change the default behaviour.
>>
>> v3:
>> - Fix race on closing the device.
>> - Add extack messages in ndo_bpf.
>>
>> v2:
>> - Squashed with the patch adding NAPI.
>> - Implement adjust_tail.
>> - Don't acquire consumer lock because it is guarded by NAPI.
>> - Make poll_controller noop since it is unnecessary.
>> - Register rxq_info on enabling XDP rather than on opening the device.
>>
>> Signed-off-by: Toshiaki Makita 
> 
>> +static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
>> +struct sk_buff *skb)
>> +{
>> +u32 pktlen, headroom, act, metalen;
>> +void *orig_data, *orig_data_end;
>> +int size, mac_len, delta, off;
>> +struct bpf_prog *xdp_prog;
>> +struct xdp_buff xdp;
>> +
>> +rcu_read_lock();
>> +xdp_prog = rcu_dereference(priv->xdp_prog);
>> +if (unlikely(!xdp_prog)) {
>> +rcu_read_unlock();
>> +goto out;
>> +}
>> +
>> +mac_len = skb->data - skb_mac_header(skb);
>> +pktlen = skb->len + mac_len;
>> +size = SKB_DATA_ALIGN(VETH_XDP_HEADROOM + pktlen) +
>> +   SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> +if (size > PAGE_SIZE)
>> +goto drop;
>> +
>> +headroom = skb_headroom(skb) - mac_len;
>> +if (skb_shared(skb) || skb_head_is_locked(skb) ||
>> +skb_is_nonlinear(skb) || headroom < XDP_PACKET_HEADROOM) {
>> +struct sk_buff *nskb;
>> +void *head, *start;
>> +struct page *page;
>> +int head_off;
>> +
>> +page = alloc_page(GFP_ATOMIC);
>> +if (!page)
>> +goto drop;
>> +
>> +head = page_address(page);
>> +start = head + VETH_XDP_HEADROOM;
>> +if (skb_copy_bits(skb, -mac_len, start, pktlen)) {
>> +page_frag_free(head);
>> +goto drop;
>> +}
>> +
>> +nskb = veth_build_skb(head,
>> +  VETH_XDP_HEADROOM + mac_len, skb->len,
>> +  PAGE_SIZE);
>> +if (!nskb) {
>> +page_frag_free(head);
>> +goto drop;
>> +}
> 
>> +static int veth_enable_xdp(struct net_device *dev)
>> +{
>> +struct veth_priv *priv = netdev_priv(dev);
>> +int err;
>> +
>> +if (!xdp_rxq_info_is_reg(&priv->xdp_rxq)) {
>> +err = xdp_rxq_info_reg(&priv->xdp_rxq, dev, 0);
>> +if (err < 0)
>> +return err;
>> +
>> +err = xdp_rxq_info_reg_mem_model(&priv->xdp_rxq,
>> + MEM_TYPE_PAGE_SHARED, NULL);
> 
> nit: doesn't matter much but looks like a mix of MEM_TYPE_PAGE_SHARED
>  and MEM_TYPE_PAGE_ORDER0

Actually I'm not sure when to use MEM_TYPE_PAGE_ORDER0. It seems a page
allocated by alloc_page() can be freed by page_frag_free() and it is
more lightweight than put_page(), isn't it?
virtio_net is doing it in a similar way.

-- 
Toshiaki Makita

Re: [EXTERNAL] Re: VRF with enslaved L3 enabled bridge

2018-07-23 Thread D'Souza, Nelson

Hi David,

I copy and pasted the configs onto my device, but pings on test-vrf do not work 
in my setup. 
I'm essentially seeing the same issue as I reported before.

In this case, pings sent out on test-vrf (host ns) are received and replied to 
by the loopback interface (foo ns). Although the replies are seen at the 
test-vrf level, they are not locally delivered to the ping application.

Logs are as follows...

a) pings on test-vrf or br0 fail.

# ping -I test-vrf 172.16.2.2 -c1 -w1
PING 172.16.2.2 (172.16.2.2): 56 data bytes

--- 172.16.2.2 ping statistics ---
1 packets transmitted, 0 packets received, 100% packet loss

b) tcpdump in the foo namespace, shows icmp echos/replies on veth2

# ip netns exec foo tcpdump -i veth2 icmp -c 2
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on veth2, link-type EN10MB (Ethernet), capture size 262144 bytes
18:34:13.205210 IP 172.16.1.1 > 172.16.2.2: ICMP echo request, id 19513, seq 0, 
length 64
18:34:13.205253 IP 172.16.2.2 > 172.16.1.1: ICMP echo reply, id 19513, seq 0, 
length 64
2 packets captured
2 packets received by filter
0 packets dropped by kernel

c) tcpdump in the host namespace, shows icmp echos/replies on test-vrf, br0 and 
veth1:

# tcpdump -i test-vrf icmp -c 2
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on test-vrf, link-type EN10MB (Ethernet), capture size 262144 bytes
18:34:13.204061 IP 172.16.1.1 > 172.16.2.2: ICMP echo request, id 19513, seq 0, 
length 64
18:34:13.205278 IP 172.16.2.2 > 172.16.1.1: ICMP echo reply, id 19513, seq 0, 
length 64
2 packets captured
2 packets received by filter
0 packets dropped by kernel

Thanks,
Nelson

On 7/23/18, 3:00 PM, "David Ahern"  wrote:

On 7/20/18 1:03 PM, D'Souza, Nelson wrote:
> Setup is as follows:
> 
> ethUSB(ingress port) -> mgmtbr0 (bridge) -> mgmtvrf (vrf)



 |  netns foo
 [ test-vrf ]|
   | |
[ br0 ] 172.16.1.1   |
   | |
   [ veth1 ] |=== [ veth2 ]  lo
 |   172.16.1.2 172.16.2.2
 |


Copy and paste the following into your environment:

ip netns add foo
ip li add veth1 type veth peer name veth2
ip li set veth2 netns foo

ip -netns foo li set lo up
ip -netns foo li set veth2 up
ip -netns foo addr add 172.16.1.2/24 dev veth2


ip li add test-vrf type vrf table 123
ip li set test-vrf up
ip ro add vrf test-vrf unreachable default

ip li add  br0 type bridge
ip li set veth1 master br0
ip li set veth1 up
ip li set br0 up
ip addr add dev br0 172.16.1.1/24
ip li set br0 master test-vrf

ip -netns foo addr add 172.16.2.2/32 dev lo
ip ro add vrf test-vrf 172.16.2.2/32 via 172.16.1.2

Does ping work?
# ping -I test-vrf 172.16.2.2
ping: Warning: source address might be selected on device other than
test-vrf.
PING 172.16.2.2 (172.16.2.2) from 172.16.1.1 test-vrf: 56(84) bytes of data.
64 bytes from 172.16.2.2: icmp_seq=1 ttl=64 time=0.228 ms
64 bytes from 172.16.2.2: icmp_seq=2 ttl=64 time=0.263 ms

and:
# ping -I br0 172.16.2.2
PING 172.16.2.2 (172.16.2.2) from 172.16.1.1 br0: 56(84) bytes of data.
64 bytes from 172.16.2.2: icmp_seq=1 ttl=64 time=0.227 ms
64 bytes from 172.16.2.2: icmp_seq=2 ttl=64 time=0.223 ms
^C
--- 172.16.2.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.223/0.225/0.227/0.002 ms

Re: [PATCH net-next] virtio_net: force_napi_tx module param.

2018-07-23 Thread Willem de Bruijn

On Mon, Jul 23, 2018 at 8:55 PM Stephen Hemminger
 wrote:
>
> On Mon, 23 Jul 2018 16:11:19 -0700
> Caleb Raitto  wrote:
>
> > From: Caleb Raitto 
> >
> > The driver disables tx napi if it's not certain that completions will
> > be processed affine with tx service.
> >
> > Its heuristic doesn't account for some scenarios where it is, such as
> > when the queue pair count matches the core but not hyperthread count.
> >
> > Allow userspace to override the heuristic. This is an alternative
> > solution to that in the linked patch. That added more logic in the
> > kernel for these cases, but the agreement was that this was better left
> > to user control.
> >
> > Do not expand the existing napi_tx variable to a ternary value,
> > because doing so can break user applications that expect
> > boolean ('Y'/'N') instead of integer output. Add a new param instead.
> >
> > Link: https://patchwork.ozlabs.org/patch/725249/
> > Acked-by: Willem de Bruijn 
> > Acked-by: Jon Olson 
> > Signed-off-by: Caleb Raitto 
> > ---
>
> Not a fan of this.
> Module parameters are frowned on by the distributions because they
> never get well tested and they force the user to do magic things
> to enable features. It looks like you are using it to paper
> over a bug in this case.

This has actually been a catch-22 that this patch tries to break.

In micro benchmarks napi-tx was an improvement for most cases. We need
wider validation to make it default. Or even to start enabling it for some
users. But we cannot get the data, because understandably no one is
going to make it default without more data.

Enabling the feature selectively to safely roll out is the intent of
(temporary) param napi_tx. But the requirement to have
has_affinity_set is proving a real obstruction.

Especially in cases where we enable the feature to do A:B comparisons,
and thus are monitoring performance metrics closely, we should be able
to override the kernel heuristic.

Re: [PATCH v3 bpf-next 6/8] xdp: Add a flag for disabling napi_direct of xdp_return_frame in xdp_mem_info

2018-07-23 Thread Jakub Kicinski

On Mon, 23 Jul 2018 00:13:06 +0900, Toshiaki Makita wrote:
> From: Toshiaki Makita 
> 
> We need some mechanism to disable napi_direct on calling
> xdp_return_frame_rx_napi() from some context.
> When veth gets support of XDP_REDIRECT, it will redirects packets which
> are redirected from other devices. On redirection veth will reuse
> xdp_mem_info of the redirection source device to make return_frame work.
> But in this case .ndo_xdp_xmit() called from veth redirection uses
> xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit is
> not called directly from the rxq which owns the xdp_mem_info.
> 
> This approach introduces a flag in xdp_mem_info to indicate that
> napi_direct should be disabled even when _rx_napi variant is used.
> 
> Signed-off-by: Toshiaki Makita 

To be clear - you will modify flags of the original source device if it
ever redirected a frame to a software device like veth?  Seems a bit
heavy handed.  The xdp_return_frame_rx_napi() is only really used on
error paths, but still..  Also as you note the original NAPI can run
concurrently with your veth dest one, but also with NAPIs of other veth
devices, so the non-atomic xdp.rxq->mem.flags |= XDP_MEM_RF_NO_DIRECT;
makes me worried.

Would you mind elaborating why not handle the RX completely in the NAPI
context of the original device?

> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index fcb033f51d8c..1d1bc6553ff2 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -41,6 +41,9 @@ enum xdp_mem_type {
>   MEM_TYPE_MAX,
>  };
>  
> +/* XDP flags for xdp_mem_info */
> +#define XDP_MEM_RF_NO_DIRECT BIT(0)  /* don't use napi_direct */
> +
>  /* XDP flags for ndo_xdp_xmit */
>  #define XDP_XMIT_FLUSH   (1U << 0)   /* doorbell signal 
> consumer */
>  #define XDP_XMIT_FLAGS_MASK  XDP_XMIT_FLUSH
> @@ -48,6 +51,7 @@ enum xdp_mem_type {
>  struct xdp_mem_info {
>   u32 type; /* enum xdp_mem_type, but known size type */
>   u32 id;
> + u32 flags;
>  };
>  
>  struct page_pool;
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 57285383ed00..1426c608fd75 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -330,10 +330,12 @@ static void __xdp_return(void *data, struct 
> xdp_mem_info *mem, bool napi_direct,
>   /* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
>   xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
>   page = virt_to_head_page(data);
> - if (xa)
> + if (xa) {
> + napi_direct &= !(mem->flags & XDP_MEM_RF_NO_DIRECT);
>   page_pool_put_page(xa->page_pool, page, napi_direct);
> - else
> + } else {
>   put_page(page);
> + }
>   rcu_read_unlock();
>   break;
>   case MEM_TYPE_PAGE_SHARED:

Re: [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit

2018-07-23 Thread Jakub Kicinski

On Mon, 23 Jul 2018 00:13:05 +0900, Toshiaki Makita wrote:
> From: Toshiaki Makita 
> 
> This allows NIC's XDP to redirect packets to veth. The destination veth
> device enqueues redirected packets to the napi ring of its peer, then
> they are processed by XDP on its peer veth device.
> This can be thought as calling another XDP program by XDP program using
> REDIRECT, when the peer enables driver XDP.
> 
> Note that when the peer veth device does not set driver xdp, redirected
> packets will be dropped because the peer is not ready for NAPI.

Often we can't redirect to devices which don't have am xdp program
installed.  In your case we can't redirect unless the peer of the
target doesn't have a program installed?  :(

Perhaps it is time to reconsider what Saeed once asked for, a flag or
attribute to enable being the destination of a XDP_REDIRECT.

> v2:
> - Drop the part converting xdp_frame into skb when XDP is not enabled.
> - Implement bulk interface of ndo_xdp_xmit.
> - Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.
> 
> Signed-off-by: Toshiaki Makita 
> ---
>  drivers/net/veth.c | 45 +
>  1 file changed, 45 insertions(+)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 4be75c58bc6a..57187e955fea 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -125,6 +126,11 @@ static void *veth_ptr_to_xdp(void *ptr)
>   return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
>  }
>  
> +static void *veth_xdp_to_ptr(void *ptr)
> +{
> + return (void *)((unsigned long)ptr | VETH_XDP_FLAG);
> +}
> +
>  static void veth_ptr_free(void *ptr)
>  {
>   if (veth_is_xdp_frame(ptr))
> @@ -267,6 +273,44 @@ static struct sk_buff *veth_build_skb(void *head, int 
> headroom, int len,
>   return skb;
>  }
>  
> +static int veth_xdp_xmit(struct net_device *dev, int n,
> +  struct xdp_frame **frames, u32 flags)
> +{
> + struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
> + struct net_device *rcv;
> + int i, drops = 0;
> +
> + if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
> + return -EINVAL;
> +
> + rcv = rcu_dereference(priv->peer);
> + if (unlikely(!rcv))
> + return -ENXIO;
> +
> + rcv_priv = netdev_priv(rcv);
> + /* xdp_ring is initialized on receive side? */
> + if (!rcu_access_pointer(rcv_priv->xdp_prog))
> + return -ENXIO;
> +
> + spin_lock(&rcv_priv->xdp_ring.producer_lock);
> + for (i = 0; i < n; i++) {
> + struct xdp_frame *frame = frames[i];
> + void *ptr = veth_xdp_to_ptr(frame);
> +
> + if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
> +  __ptr_ring_produce(&rcv_priv->xdp_ring, ptr))) {

Would you mind sparing a few more words how this is safe vs the
.ndo_close() on the peer?  Personally I'm a bit uncomfortable with the
IFF_UP check in xdp_ok_fwd_dev(), I'm not sure what's supposed to
guarantee the device doesn't go down right after that check, or is
already down, but netdev->flags are not atomic...  

> + xdp_return_frame_rx_napi(frame);
> + drops++;
> + }
> + }
> + spin_unlock(&rcv_priv->xdp_ring.producer_lock);
> +
> + if (flags & XDP_XMIT_FLUSH)
> + __veth_xdp_flush(rcv_priv);
> +
> + return n - drops;
> +}
> +
>  static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
>   struct xdp_frame *frame)
>  {
> @@ -760,6 +804,7 @@ static const struct net_device_ops veth_netdev_ops = {
>   .ndo_features_check = passthru_features_check,
>   .ndo_set_rx_headroom= veth_set_rx_headroom,
>   .ndo_bpf= veth_xdp,
> + .ndo_xdp_xmit   = veth_xdp_xmit,
>  };
>  
>  #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \

Re: [PATCH net-next] virtio_net: force_napi_tx module param.

2018-07-23 Thread Stephen Hemminger

On Mon, 23 Jul 2018 16:11:19 -0700
Caleb Raitto  wrote:

> From: Caleb Raitto 
> 
> The driver disables tx napi if it's not certain that completions will
> be processed affine with tx service.
> 
> Its heuristic doesn't account for some scenarios where it is, such as
> when the queue pair count matches the core but not hyperthread count.
> 
> Allow userspace to override the heuristic. This is an alternative
> solution to that in the linked patch. That added more logic in the
> kernel for these cases, but the agreement was that this was better left
> to user control.
> 
> Do not expand the existing napi_tx variable to a ternary value,
> because doing so can break user applications that expect
> boolean ('Y'/'N') instead of integer output. Add a new param instead.
> 
> Link: https://patchwork.ozlabs.org/patch/725249/
> Acked-by: Willem de Bruijn 
> Acked-by: Jon Olson 
> Signed-off-by: Caleb Raitto 
> ---

Not a fan of this.
Module parameters are frowned on by the distributions because they
never get well tested and they force the user to do magic things
to enable features. It looks like you are using it to paper
over a bug in this case.

[PATCH net-next] tcp: ack immediately when a cwr packet arrives

2018-07-23 Thread Lawrence Brakmo

We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The
problem is triggered when the last packet of a request arrives CE
marked. The reply will carry the ECE mark causing TCP to shrink its cwnd
to 1 (because there are no packets in flight). When the 1st packet of
the next request arrives, the ACK was sometimes delayed even though it
is CWR marked, adding up to 40ms to the RPC latency.

This patch insures that CWR marked data packets arriving will be acked
immediately.

Packetdrill script to reproduce the problem:

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < [ect0] SEW 0:0(0) win 32792 
0.100 > SE. 0:0(0) ack 1 
0.110 < [ect0] . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4

0.200 < [ect0] . 1:1001(1000) ack 1 win 257
0.200 > [ect01] . 1:1(0) ack 1001

0.200 write(4, ..., 1) = 1
0.200 > [ect01] P. 1:2(1) ack 1001

0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
0.200 write(4, ..., 1) = 1
0.200 > [ect01] P. 2:3(1) ack 2001

0.200 < [ect0] . 2001:3001(1000) ack 3 win 257
0.200 < [ect0] . 3001:4001(1000) ack 3 win 257
0.200 > [ect01] . 3:3(0) ack 4001

0.210 < [ce] P. 4001:4501(500) ack 3 win 257

+0.001 read(4, ..., 4500) = 4500
+0 write(4, ..., 1) = 1
+0 > [ect01] PE. 3:4(1) ack 4501

+0.010 < [ect0] W. 4501:5501(1000) ack 4 win 257
// Previously the ACK sequence below would be 4501, causing a long RTO
+0.040~+0.045 > [ect01] . 4:4(0) ack 5501   // delayed ack

+0.311 < [ect0] . 5501:6501(1000) ack 4 win 257  // More data
+0 > [ect01] . 4:4(0) ack 6501 // now acks everything

+0.500 < F. 9501:9501(0) ack 4 win 257

Modified based on comments by Neal Cardwell 

Signed-off-by: Lawrence Brakmo 
---
 net/ipv4/tcp_input.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 91dbb9afb950..2370fd79c5c5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -246,8 +246,15 @@ static void tcp_ecn_queue_cwr(struct tcp_sock *tp)
 
 static void tcp_ecn_accept_cwr(struct tcp_sock *tp, const struct sk_buff *skb)
 {
-   if (tcp_hdr(skb)->cwr)
+   if (tcp_hdr(skb)->cwr) {
tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
+
+   /* If the sender is telling us it has entered CWR, then its
+* cwnd may be very low (even just 1 packet), so we should ACK
+* immediately.
+*/
+   tcp_enter_quickack_mode((struct sock *)tp, 2);
+   }
 }
 
 static void tcp_ecn_withdraw_cwr(struct tcp_sock *tp)
-- 
2.17.1

Re: [PATCH 3/4] net: lantiq: Add Lantiq / Intel vrx200 Ethernet driver

2018-07-23 Thread Paul Burton

Hi Hauke,

On Sat, Jul 21, 2018 at 09:13:57PM +0200, Hauke Mehrtens wrote:
> diff --git a/arch/mips/lantiq/xway/sysctrl.c b/arch/mips/lantiq/xway/sysctrl.c
> index e0af39b33e28..c704312ef7d5 100644
> --- a/arch/mips/lantiq/xway/sysctrl.c
> +++ b/arch/mips/lantiq/xway/sysctrl.c
> @@ -536,7 +536,7 @@ void __init ltq_soc_init(void)
>   clkdev_add_pmu(NULL, "ahb", 1, 0, PMU_AHBM | PMU_AHBS);
>  
>   clkdev_add_pmu("1da0.usif", "NULL", 1, 0, PMU_USIF);
> - clkdev_add_pmu("1e108000.eth", NULL, 0, 0,
> + clkdev_add_pmu("1e10b308.eth", NULL, 0, 0,
>   PMU_SWITCH | PMU_PPE_DPLUS | PMU_PPE_DPLUM |
>   PMU_PPE_EMA | PMU_PPE_TC | PMU_PPE_SLL01 |
>   PMU_PPE_QSB | PMU_PPE_TOP);

Is this intentional?

Why is it needed? Was the old address wrong? Does it change anything
functionally?

If it is needed it seems like a separate change - unless there's some
reason it's tied to adding this driver?

Should this really apply only to the lantiq,vr9 case or also to the
similar lantiq,grx390 & lantiq,ar10 paths?

Whatever the answers to these questions it would be good to include them
in the commit message.

Thanks,
Paul

Re: [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit

2018-07-23 Thread kbuild test robot

Hi Toshiaki,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:
https://github.com/0day-ci/linux/commits/Toshiaki-Makita/veth-Driver-XDP/20180724-065517
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: x86_64-randconfig-x010-201829 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from include/linux/kernel.h:10:0,
from include/linux/list.h:9,
from include/linux/timer.h:5,
from include/linux/netdevice.h:28,
from drivers//net/veth.c:11:
   drivers//net/veth.c: In function 'veth_xdp_xmit':
   drivers//net/veth.c:300:16: error: implicit declaration of function 
'xdp_ok_fwd_dev' [-Werror=implicit-function-declaration]
  if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
   ^
   include/linux/compiler.h:58:30: note: in definition of macro '__trace_if'
 if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
 ^~~~
>> drivers//net/veth.c:300:3: note: in expansion of macro 'if'
  if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
  ^~
   include/linux/compiler.h:48:24: note: in expansion of macro 
'__branch_check__'
#  define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x)))
   ^~~~
>> drivers//net/veth.c:300:7: note: in expansion of macro 'unlikely'
  if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
  ^~~~
   cc1: some warnings being treated as errors

vim +/if +300 drivers//net/veth.c

   275  
   276  static int veth_xdp_xmit(struct net_device *dev, int n,
   277   struct xdp_frame **frames, u32 flags)
   278  {
   279  struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
   280  struct net_device *rcv;
   281  int i, drops = 0;
   282  
   283  if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
   284  return -EINVAL;
   285  
   286  rcv = rcu_dereference(priv->peer);
   287  if (unlikely(!rcv))
   288  return -ENXIO;
   289  
   290  rcv_priv = netdev_priv(rcv);
   291  /* xdp_ring is initialized on receive side? */
   292  if (!rcu_access_pointer(rcv_priv->xdp_prog))
   293  return -ENXIO;
   294  
   295  spin_lock(&rcv_priv->xdp_ring.producer_lock);
   296  for (i = 0; i < n; i++) {
   297  struct xdp_frame *frame = frames[i];
   298  void *ptr = veth_xdp_to_ptr(frame);
   299  
 > 300  if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
   301   __ptr_ring_produce(&rcv_priv->xdp_ring, 
ptr))) {
   302  xdp_return_frame_rx_napi(frame);
   303  drops++;
   304  }
   305  }
   306  spin_unlock(&rcv_priv->xdp_ring.producer_lock);
   307  
   308  if (flags & XDP_XMIT_FLUSH)
   309  __veth_xdp_flush(rcv_priv);
   310  
   311  return n - drops;
   312  }
   313  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH v3 bpf-next 3/8] veth: Avoid drops by oversized packets when XDP is enabled

2018-07-23 Thread Jakub Kicinski

On Mon, 23 Jul 2018 00:13:03 +0900, Toshiaki Makita wrote:
> From: Toshiaki Makita 
> 
> All oversized packets including GSO packets are dropped if XDP is
> enabled on receiver side, so don't send such packets from peer.
> 
> Drop TSO and SCTP fragmentation features so that veth devices themselves
> segment packets with XDP enabled. Also cap MTU accordingly.
> 
> Signed-off-by: Toshiaki Makita 

Is there any precedence for fixing up features and MTU like this?  Most
drivers just refuse to install the program if settings are incompatible.

> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 78fa08cb6e24..f5b72e937d9d 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -542,6 +542,23 @@ static int veth_get_iflink(const struct net_device *dev)
>   return iflink;
>  }
>  
> +static netdev_features_t veth_fix_features(struct net_device *dev,
> +netdev_features_t features)
> +{
> + struct veth_priv *priv = netdev_priv(dev);
> + struct net_device *peer;
> +
> + peer = rtnl_dereference(priv->peer);
> + if (peer) {
> + struct veth_priv *peer_priv = netdev_priv(peer);
> +
> + if (peer_priv->_xdp_prog)
> + features &= ~NETIF_F_GSO_SOFTWARE;
> + }
> +
> + return features;
> +}
> +
>  static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
>  {
>   struct veth_priv *peer_priv, *priv = netdev_priv(dev);
> @@ -591,14 +608,33 @@ static int veth_xdp_set(struct net_device *dev, struct 
> bpf_prog *prog,
>   goto err;
>   }
>   }
> +
> + if (!old_prog) {
> + peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
> + peer->max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
> + peer->hard_header_len -
> + SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> + if (peer->mtu > peer->max_mtu)
> + dev_set_mtu(peer, peer->max_mtu);
> + }
>   }
>  
>   if (old_prog) {
> - if (!prog && dev->flags & IFF_UP)
> - veth_disable_xdp(dev);
> + if (!prog) {
> + if (dev->flags & IFF_UP)
> + veth_disable_xdp(dev);
> +
> + if (peer) {
> + peer->hw_features |= NETIF_F_GSO_SOFTWARE;
> + peer->max_mtu = ETH_MAX_MTU;
> + }
> + }
>   bpf_prog_put(old_prog);
>   }
>  
> + if ((!!old_prog ^ !!prog) && peer)
> + netdev_update_features(peer);
> +
>   return 0;
>  err:
>   priv->_xdp_prog = old_prog;
> @@ -643,6 +679,7 @@ static const struct net_device_ops veth_netdev_ops = {
>   .ndo_poll_controller= veth_poll_controller,
>  #endif
>   .ndo_get_iflink = veth_get_iflink,
> + .ndo_fix_features   = veth_fix_features,
>   .ndo_features_check = passthru_features_check,
>   .ndo_set_rx_headroom= veth_set_rx_headroom,
>   .ndo_bpf= veth_xdp,

Re: [PATCH v3 bpf-next 2/8] veth: Add driver XDP

2018-07-23 Thread Jakub Kicinski

On Mon, 23 Jul 2018 00:13:02 +0900, Toshiaki Makita wrote:
> From: Toshiaki Makita 
> 
> This is the basic implementation of veth driver XDP.
> 
> Incoming packets are sent from the peer veth device in the form of skb,
> so this is generally doing the same thing as generic XDP.
> 
> This itself is not so useful, but a starting point to implement other
> useful veth XDP features like TX and REDIRECT.
> 
> This introduces NAPI when XDP is enabled, because XDP is now heavily
> relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
> enqueues packets to the ring and peer NAPI handler drains the ring.
> 
> Currently only one ring is allocated for each veth device, so it does
> not scale on multiqueue env. This can be resolved by allocating rings
> on the per-queue basis later.
> 
> Note that NAPI is not used but netif_rx is used when XDP is not loaded,
> so this does not change the default behaviour.
> 
> v3:
> - Fix race on closing the device.
> - Add extack messages in ndo_bpf.
> 
> v2:
> - Squashed with the patch adding NAPI.
> - Implement adjust_tail.
> - Don't acquire consumer lock because it is guarded by NAPI.
> - Make poll_controller noop since it is unnecessary.
> - Register rxq_info on enabling XDP rather than on opening the device.
> 
> Signed-off-by: Toshiaki Makita 

> +static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
> + struct sk_buff *skb)
> +{
> + u32 pktlen, headroom, act, metalen;
> + void *orig_data, *orig_data_end;
> + int size, mac_len, delta, off;
> + struct bpf_prog *xdp_prog;
> + struct xdp_buff xdp;
> +
> + rcu_read_lock();
> + xdp_prog = rcu_dereference(priv->xdp_prog);
> + if (unlikely(!xdp_prog)) {
> + rcu_read_unlock();
> + goto out;
> + }
> +
> + mac_len = skb->data - skb_mac_header(skb);
> + pktlen = skb->len + mac_len;
> + size = SKB_DATA_ALIGN(VETH_XDP_HEADROOM + pktlen) +
> +SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> + if (size > PAGE_SIZE)
> + goto drop;
> +
> + headroom = skb_headroom(skb) - mac_len;
> + if (skb_shared(skb) || skb_head_is_locked(skb) ||
> + skb_is_nonlinear(skb) || headroom < XDP_PACKET_HEADROOM) {
> + struct sk_buff *nskb;
> + void *head, *start;
> + struct page *page;
> + int head_off;
> +
> + page = alloc_page(GFP_ATOMIC);
> + if (!page)
> + goto drop;
> +
> + head = page_address(page);
> + start = head + VETH_XDP_HEADROOM;
> + if (skb_copy_bits(skb, -mac_len, start, pktlen)) {
> + page_frag_free(head);
> + goto drop;
> + }
> +
> + nskb = veth_build_skb(head,
> +   VETH_XDP_HEADROOM + mac_len, skb->len,
> +   PAGE_SIZE);
> + if (!nskb) {
> + page_frag_free(head);
> + goto drop;
> + }

> +static int veth_enable_xdp(struct net_device *dev)
> +{
> + struct veth_priv *priv = netdev_priv(dev);
> + int err;
> +
> + if (!xdp_rxq_info_is_reg(&priv->xdp_rxq)) {
> + err = xdp_rxq_info_reg(&priv->xdp_rxq, dev, 0);
> + if (err < 0)
> + return err;
> +
> + err = xdp_rxq_info_reg_mem_model(&priv->xdp_rxq,
> +  MEM_TYPE_PAGE_SHARED, NULL);

nit: doesn't matter much but looks like a mix of MEM_TYPE_PAGE_SHARED
 and MEM_TYPE_PAGE_ORDER0

Re: [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit

2018-07-23 Thread kbuild test robot

Hi Toshiaki,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:
https://github.com/0day-ci/linux/commits/Toshiaki-Makita/veth-Driver-XDP/20180724-065517
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: i386-randconfig-x001-201829 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   In file included from include/linux/kernel.h:10:0,
from include/linux/list.h:9,
from include/linux/timer.h:5,
from include/linux/netdevice.h:28,
from drivers//net/veth.c:11:
   drivers//net/veth.c: In function 'veth_xdp_xmit':
>> drivers//net/veth.c:300:16: error: implicit declaration of function 
>> 'xdp_ok_fwd_dev' [-Werror=implicit-function-declaration]
  if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
   ^
   include/linux/compiler.h:77:42: note: in definition of macro 'unlikely'
# define unlikely(x) __builtin_expect(!!(x), 0)
 ^
   cc1: some warnings being treated as errors

vim +/xdp_ok_fwd_dev +300 drivers//net/veth.c

   275  
   276  static int veth_xdp_xmit(struct net_device *dev, int n,
   277   struct xdp_frame **frames, u32 flags)
   278  {
   279  struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
   280  struct net_device *rcv;
   281  int i, drops = 0;
   282  
   283  if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
   284  return -EINVAL;
   285  
   286  rcv = rcu_dereference(priv->peer);
   287  if (unlikely(!rcv))
   288  return -ENXIO;
   289  
   290  rcv_priv = netdev_priv(rcv);
   291  /* xdp_ring is initialized on receive side? */
   292  if (!rcu_access_pointer(rcv_priv->xdp_prog))
   293  return -ENXIO;
   294  
   295  spin_lock(&rcv_priv->xdp_ring.producer_lock);
   296  for (i = 0; i < n; i++) {
   297  struct xdp_frame *frame = frames[i];
   298  void *ptr = veth_xdp_to_ptr(frame);
   299  
 > 300  if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
   301   __ptr_ring_produce(&rcv_priv->xdp_ring, 
ptr))) {
   302  xdp_return_frame_rx_napi(frame);
   303  drops++;
   304  }
   305  }
   306  spin_unlock(&rcv_priv->xdp_ring.producer_lock);
   307  
   308  if (flags & XDP_XMIT_FLUSH)
   309  __veth_xdp_flush(rcv_priv);
   310  
   311  return n - drops;
   312  }
   313  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH 1/4] MIPS: lantiq: Do not enable IRQs in dma open

2018-07-23 Thread Paul Burton

Hi Hauke,

On Sat, Jul 21, 2018 at 09:13:55PM +0200, Hauke Mehrtens wrote:
> When a DMA channel is opened the IRQ should not get activated
> automatically, this allows it to pull data out manually without the help
> of interrupts. This is needed for a workaround in the vrx200 Ethernet
> driver.
> 
> Signed-off-by: Hauke Mehrtens 
> ---
>  arch/mips/lantiq/xway/dma.c| 1 -
>  drivers/net/ethernet/lantiq_etop.c | 1 +
>  2 files changed, 1 insertion(+), 1 deletion(-)

If you'd like this to go via the netdev tree to keep it with the rest of
the series:

Acked-by: Paul Burton 

Though I'd be happier if we didn't have DMA code seemingly used only by
an ethernet driver in arch/mips/ :)

Thanks,
Paul

[PATCH net-next] cbs: Add support for the graft function

2018-07-23 Thread Vinicius Costa Gomes

This will allow to install a child qdisc under cbs. The main use case
is to install ETF (Earliest TxTime First) qdisc under cbs, so there's
another level of control for time-sensitive traffic.

Signed-off-by: Vinicius Costa Gomes 
---
 net/sched/sch_cbs.c | 134 +---
 1 file changed, 125 insertions(+), 9 deletions(-)

diff --git a/net/sched/sch_cbs.c b/net/sched/sch_cbs.c
index cdd96b9a27bc..e26a24017faa 100644
--- a/net/sched/sch_cbs.c
+++ b/net/sched/sch_cbs.c
@@ -78,18 +78,42 @@ struct cbs_sched_data {
s64 sendslope; /* in bytes/s */
s64 idleslope; /* in bytes/s */
struct qdisc_watchdog watchdog;
-   int (*enqueue)(struct sk_buff *skb, struct Qdisc *sch);
+   int (*enqueue)(struct sk_buff *skb, struct Qdisc *sch,
+  struct sk_buff **to_free);
struct sk_buff *(*dequeue)(struct Qdisc *sch);
+   struct Qdisc *qdisc;
 };
 
-static int cbs_enqueue_offload(struct sk_buff *skb, struct Qdisc *sch)
+static int cbs_child_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+struct Qdisc *child,
+struct sk_buff **to_free)
 {
-   return qdisc_enqueue_tail(skb, sch);
+   int err;
+
+   err = child->ops->enqueue(skb, child, to_free);
+   if (err != NET_XMIT_SUCCESS)
+   return err;
+
+   qdisc_qstats_backlog_inc(sch, skb);
+   sch->q.qlen++;
+
+   return NET_XMIT_SUCCESS;
 }
 
-static int cbs_enqueue_soft(struct sk_buff *skb, struct Qdisc *sch)
+static int cbs_enqueue_offload(struct sk_buff *skb, struct Qdisc *sch,
+  struct sk_buff **to_free)
 {
struct cbs_sched_data *q = qdisc_priv(sch);
+   struct Qdisc *qdisc = q->qdisc;
+
+   return cbs_child_enqueue(skb, sch, qdisc, to_free);
+}
+
+static int cbs_enqueue_soft(struct sk_buff *skb, struct Qdisc *sch,
+   struct sk_buff **to_free)
+{
+   struct cbs_sched_data *q = qdisc_priv(sch);
+   struct Qdisc *qdisc = q->qdisc;
 
if (sch->q.qlen == 0 && q->credits > 0) {
/* We need to stop accumulating credits when there's
@@ -99,7 +123,7 @@ static int cbs_enqueue_soft(struct sk_buff *skb, struct 
Qdisc *sch)
q->last = ktime_get_ns();
}
 
-   return qdisc_enqueue_tail(skb, sch);
+   return cbs_child_enqueue(skb, sch, qdisc, to_free);
 }
 
 static int cbs_enqueue(struct sk_buff *skb, struct Qdisc *sch,
@@ -107,7 +131,7 @@ static int cbs_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
 {
struct cbs_sched_data *q = qdisc_priv(sch);
 
-   return q->enqueue(skb, sch);
+   return q->enqueue(skb, sch, to_free);
 }
 
 /* timediff is in ns, slope is in bytes/s */
@@ -132,9 +156,25 @@ static s64 credits_from_len(unsigned int len, s64 slope, 
s64 port_rate)
return div64_s64(len * slope, port_rate);
 }
 
+static struct sk_buff *cbs_child_dequeue(struct Qdisc *sch, struct Qdisc 
*child)
+{
+   struct sk_buff *skb;
+
+   skb = child->ops->dequeue(child);
+   if (!skb)
+   return NULL;
+
+   qdisc_qstats_backlog_dec(sch, skb);
+   qdisc_bstats_update(sch, skb);
+   sch->q.qlen--;
+
+   return skb;
+}
+
 static struct sk_buff *cbs_dequeue_soft(struct Qdisc *sch)
 {
struct cbs_sched_data *q = qdisc_priv(sch);
+   struct Qdisc *qdisc = q->qdisc;
s64 now = ktime_get_ns();
struct sk_buff *skb;
s64 credits;
@@ -157,8 +197,7 @@ static struct sk_buff *cbs_dequeue_soft(struct Qdisc *sch)
return NULL;
}
}
-
-   skb = qdisc_dequeue_head(sch);
+   skb = cbs_child_dequeue(sch, qdisc);
if (!skb)
return NULL;
 
@@ -178,7 +217,10 @@ static struct sk_buff *cbs_dequeue_soft(struct Qdisc *sch)
 
 static struct sk_buff *cbs_dequeue_offload(struct Qdisc *sch)
 {
-   return qdisc_dequeue_head(sch);
+   struct cbs_sched_data *q = qdisc_priv(sch);
+   struct Qdisc *qdisc = q->qdisc;
+
+   return cbs_child_dequeue(sch, qdisc);
 }
 
 static struct sk_buff *cbs_dequeue(struct Qdisc *sch)
@@ -310,6 +352,13 @@ static int cbs_init(struct Qdisc *sch, struct nlattr *opt,
return -EINVAL;
}
 
+   q->qdisc = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops,
+sch->handle, extack);
+   if (!q->qdisc)
+   return -ENOMEM;
+
+   qdisc_hash_add(q->qdisc, false);
+
q->queue = sch->dev_queue - netdev_get_tx_queue(dev, 0);
 
q->enqueue = cbs_enqueue_soft;
@@ -328,6 +377,9 @@ static void cbs_destroy(struct Qdisc *sch)
qdisc_watchdog_cancel(&q->watchdog);
 
cbs_disable_offload(dev, q);
+
+   if (q->qdisc)
+   qdisc_destroy(q->qdisc);
 }
 
 static int cbs_dump(struct Qdisc *sch, struct sk_buff *skb)
@@ -356,8 +408,72 @@ static int cbs_dump(struct Qdisc *sch, struct sk_buff *skb)
retur

[PATCH net v2] ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull

2018-07-23 Thread Willem de Bruijn

From: Willem de Bruijn 

Syzbot reported a read beyond the end of the skb head when returning
IPV6_ORIGDSTADDR:

  BUG: KMSAN: kernel-infoleak in put_cmsg+0x5ef/0x860 net/core/scm.c:242
  CPU: 0 PID: 4501 Comm: syz-executor128 Not tainted 4.17.0+ #9
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
  Google 01/01/2011
  Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x185/0x1d0 lib/dump_stack.c:113
kmsan_report+0x188/0x2a0 mm/kmsan/kmsan.c:1125
kmsan_internal_check_memory+0x138/0x1f0 mm/kmsan/kmsan.c:1219
kmsan_copy_to_user+0x7a/0x160 mm/kmsan/kmsan.c:1261
copy_to_user include/linux/uaccess.h:184 [inline]
put_cmsg+0x5ef/0x860 net/core/scm.c:242
ip6_datagram_recv_specific_ctl+0x1cf3/0x1eb0 net/ipv6/datagram.c:719
ip6_datagram_recv_ctl+0x41c/0x450 net/ipv6/datagram.c:733
rawv6_recvmsg+0x10fb/0x1460 net/ipv6/raw.c:521
[..]

This logic and its ipv4 counterpart read the destination port from
the packet at skb_transport_offset(skb) + 4.

With MSG_MORE and a local SOCK_RAW sender, syzbot was able to cook a
packet that stores headers exactly up to skb_transport_offset(skb) in
the head and the remainder in a frag.

Call pskb_may_pull before accessing the pointer to ensure that it lies
in skb head.

Link: 
http://lkml.kernel.org/r/CAF=yd-lejwzj5a1-baaj2oy_hkmgygv6rsj_woraynv-fna...@mail.gmail.com
Reported-by: syzbot+9adb4b567003cac78...@syzkaller.appspotmail.com
Signed-off-by: Willem de Bruijn 
---
 net/ipv4/ip_sockglue.c | 7 +--
 net/ipv6/datagram.c| 7 +--
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 64c76dcf7386..c0fe5ad996f2 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -150,15 +150,18 @@ static void ip_cmsg_recv_dstaddr(struct msghdr *msg, 
struct sk_buff *skb)
 {
struct sockaddr_in sin;
const struct iphdr *iph = ip_hdr(skb);
-   __be16 *ports = (__be16 *)skb_transport_header(skb);
+   __be16 *ports;
+   int end;
 
-   if (skb_transport_offset(skb) + 4 > (int)skb->len)
+   end = skb_transport_offset(skb) + 4;
+   if (end > 0 && !pskb_may_pull(skb, end))
return;
 
/* All current transport protocols have the port numbers in the
 * first four bytes of the transport header and this function is
 * written with this assumption in mind.
 */
+   ports = (__be16 *)skb_transport_header(skb);
 
sin.sin_family = AF_INET;
sin.sin_addr.s_addr = iph->daddr;
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 2ee08b6a86a4..1a1f876f8e28 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -700,13 +700,16 @@ void ip6_datagram_recv_specific_ctl(struct sock *sk, 
struct msghdr *msg,
}
if (np->rxopt.bits.rxorigdstaddr) {
struct sockaddr_in6 sin6;
-   __be16 *ports = (__be16 *) skb_transport_header(skb);
+   __be16 *ports;
+   int end;
 
-   if (skb_transport_offset(skb) + 4 <= (int)skb->len) {
+   end = skb_transport_offset(skb) + 4;
+   if (end <= 0 || pskb_may_pull(skb, end)) {
/* All current transport protocols have the port 
numbers in the
 * first four bytes of the transport header and this 
function is
 * written with this assumption in mind.
 */
+   ports = (__be16 *)skb_transport_header(skb);
 
sin6.sin6_family = AF_INET6;
sin6.sin6_addr = ipv6_hdr(skb)->daddr;
-- 
2.18.0.233.g985f88cf7e-goog

Re: [patch net-next v4 00/12] sched: introduce chain templates support with offloading to mlxsw

2018-07-23 Thread Jakub Kicinski

On Mon, 23 Jul 2018 09:23:03 +0200, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> For the TC clsact offload these days, some of HW drivers need
> to hold a magic ball. The reason is, with the first inserted rule inside
> HW they need to guess what fields will be used for the matching. If
> later on this guess proves to be wrong and user adds a filter with a
> different field to match, there's a problem. Mlxsw resolves it now with
> couple of patterns. Those try to cover as many match fields as possible.
> This aproach is far from optimal, both performance-wise and scale-wise.
> Also, there is a combination of filters that in certain order won't
> succeed.
> 
> Most of the time, when user inserts filters in chain, he knows right away
> how the filters are going to look like - what type and option will they
> have. For example, he knows that he will only insert filters of type
> flower matching destination IP address. He can specify a template that
> would cover all the filters in the chain.
> 
> This patchset is providing the possibility to user to provide such
> template to kernel and propagate it all the way down to device
> drivers.

LGTM, thanks for the changes!

Re: [PATCH bpf] bpf: btf: Ensure the member->offset is in the right order

2018-07-23 Thread Daniel Borkmann

On 07/23/2018 08:45 PM, Yonghong Song wrote:
> On 7/20/18 5:38 PM, Martin KaFai Lau wrote:
>> This patch ensures the member->offset of a struct
>> is in the correct order (i.e the later member's offset cannot
>> go backward).
>>
>> The current "pahole -J" BTF encoder does not generate something
>> like this.  However, checking this can ensure future encoder
>> will not violate this.
>>
>> Fixes: 69b693f0aefa ("bpf: btf: Introduce BPF Type Format (BTF)")
>> Signed-off-by: Martin KaFai Lau 
> Acked-by: Yonghong Song 

Applied to bpf, thanks!

[PATCH net-next] virtio_net: force_napi_tx module param.

2018-07-23 Thread Caleb Raitto

From: Caleb Raitto 

The driver disables tx napi if it's not certain that completions will
be processed affine with tx service.

Its heuristic doesn't account for some scenarios where it is, such as
when the queue pair count matches the core but not hyperthread count.

Allow userspace to override the heuristic. This is an alternative
solution to that in the linked patch. That added more logic in the
kernel for these cases, but the agreement was that this was better left
to user control.

Do not expand the existing napi_tx variable to a ternary value,
because doing so can break user applications that expect
boolean ('Y'/'N') instead of integer output. Add a new param instead.

Link: https://patchwork.ozlabs.org/patch/725249/
Acked-by: Willem de Bruijn 
Acked-by: Jon Olson 
Signed-off-by: Caleb Raitto 
---
 drivers/net/virtio_net.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 2ff08bc103a9..d9aca4e90d6b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -39,10 +39,11 @@
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
 
-static bool csum = true, gso = true, napi_tx;
+static bool csum = true, gso = true, napi_tx, force_napi_tx;
 module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
 module_param(napi_tx, bool, 0644);
+module_param(force_napi_tx, bool, 0644);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
@@ -1201,7 +1202,7 @@ static void virtnet_napi_tx_enable(struct virtnet_info 
*vi,
/* Tx napi touches cachelines on the cpu handling tx interrupts. Only
 * enable the feature if this is likely affine with the transmit path.
 */
-   if (!vi->affinity_hint_set) {
+   if (!vi->affinity_hint_set && !force_napi_tx) {
napi->weight = 0;
return;
}
@@ -2646,7 +2647,7 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
netif_napi_add(vi->dev, &vi->rq[i].napi, virtnet_poll,
   napi_weight);
netif_tx_napi_add(vi->dev, &vi->sq[i].napi, virtnet_poll_tx,
- napi_tx ? napi_weight : 0);
+ (napi_tx || force_napi_tx) ? napi_weight : 0);
 
sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
ewma_pkt_len_init(&vi->rq[i].mrg_avg_pkt_len);
-- 
2.18.0.233.g985f88cf7e-goog

[PATCH net-next] tls: Fix improper revert in zerocopy_from_iter

2018-07-23 Thread Doron Roberts-Kedes

The current code is problematic because the iov_iter is reverted and
never advanced in the non-error case. This patch skips the revert in the
non-error case. This patch also fixes the amount by which the iov_iter
is reverted. Currently, iov_iter is reverted by size, which can be
greater than the amount by which the iter was actually advanced.
Instead, mimic the tx path which reverts by the difference before and
after zerocopy_from_iter. 

Fixes: 4718799817c5 ("tls: Fix zerocopy_from_iter iov handling")
Signed-off-by: Doron Roberts-Kedes 
---
 net/tls/tls_sw.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 490f2bcc6313..2ea000baebf8 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -276,7 +276,7 @@ static int zerocopy_from_iter(struct sock *sk, struct 
iov_iter *from,
  int length, int *pages_used,
  unsigned int *size_used,
  struct scatterlist *to, int to_max_pages,
- bool charge, bool revert)
+ bool charge)
 {
struct page *pages[MAX_SKB_FRAGS];
 
@@ -327,8 +327,6 @@ static int zerocopy_from_iter(struct sock *sk, struct 
iov_iter *from,
 out:
*size_used = size;
*pages_used = num_elem;
-   if (revert)
-   iov_iter_revert(from, size);
 
return rc;
 }
@@ -431,7 +429,7 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
&ctx->sg_plaintext_size,
ctx->sg_plaintext_data,
ARRAY_SIZE(ctx->sg_plaintext_data),
-   true, false);
+   true);
if (ret)
goto fallback_to_reg_send;
 
@@ -811,6 +809,7 @@ int tls_sw_recvmsg(struct sock *sk,
likely(!(flags & MSG_PEEK)))  {
struct scatterlist sgin[MAX_SKB_FRAGS + 1];
int pages = 0;
+   int orig_chunk = chunk;
 
zc = true;
sg_init_table(sgin, MAX_SKB_FRAGS + 1);
@@ -820,9 +819,11 @@ int tls_sw_recvmsg(struct sock *sk,
err = zerocopy_from_iter(sk, &msg->msg_iter,
 to_copy, &pages,
 &chunk, &sgin[1],
-MAX_SKB_FRAGS, false, 
true);
-   if (err < 0)
+MAX_SKB_FRAGS, false);
+   if (err < 0) {
+   iov_iter_revert(&msg->msg_iter, chunk - 
orig_chunk);
goto fallback_to_reg_recv;
+   }
 
err = decrypt_skb(sk, skb, sgin);
for (; pages > 0; pages--)
-- 
2.17.1

[net-next V2 08/12] net/mlx5e: Remove redundant WARN when we cannot find neigh entry

2018-07-23 Thread Saeed Mahameed

From: Roi Dayan 

It is possible for neigh entry not to exist if it was cleaned already.
When we bring down an interface the neigh gets deleted but it could be
that our listener for neigh event to clear the encap valid bit didn't
start yet and the neigh update last used work is started first.
In this scenario the encap entry has valid bit set but the neigh entry
doesn't exist.

Signed-off-by: Roi Dayan 
Reviewed-by: Paul Blakey 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 0edf4751a8ba..335a08bc381d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1032,10 +1032,8 @@ void mlx5e_tc_update_neigh_used_value(struct 
mlx5e_neigh_hash_entry *nhe)
 * dst ip pair
 */
n = neigh_lookup(tbl, &m_neigh->dst_ip, m_neigh->dev);
-   if (!n) {
-   WARN(1, "The neighbour already freed\n");
+   if (!n)
return;
-   }
 
neigh_event_send(n, NULL);
neigh_release(n);
-- 
2.17.0

[net-next V2 11/12] net/mlx5e: Support offloading double vlan push/pop tc actions

2018-07-23 Thread Saeed Mahameed

From: Jianbo Liu 

As we can configure two push/pop actions in one flow table entry,
add support to offload those double vlan actions in a rule to HW.

Signed-off-by: Jianbo Liu 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 46 ++-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h | 21 ++---
 .../mellanox/mlx5/core/eswitch_offloads.c | 11 +++--
 3 files changed, 58 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 35b3e135ae1d..e9888d6c1f7c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2583,24 +2583,48 @@ static int parse_tc_vlan_action(struct mlx5e_priv *priv,
struct mlx5_esw_flow_attr *attr,
u32 *action)
 {
+   u8 vlan_idx = attr->total_vlan;
+
+   if (vlan_idx >= MLX5_FS_VLAN_DEPTH)
+   return -EOPNOTSUPP;
+
if (tcf_vlan_action(a) == TCA_VLAN_ACT_POP) {
-   *action |= MLX5_FLOW_CONTEXT_ACTION_VLAN_POP;
+   if (vlan_idx) {
+   if (!mlx5_eswitch_vlan_actions_supported(priv->mdev,
+
MLX5_FS_VLAN_DEPTH))
+   return -EOPNOTSUPP;
+
+   *action |= MLX5_FLOW_CONTEXT_ACTION_VLAN_POP_2;
+   } else {
+   *action |= MLX5_FLOW_CONTEXT_ACTION_VLAN_POP;
+   }
} else if (tcf_vlan_action(a) == TCA_VLAN_ACT_PUSH) {
-   *action |= MLX5_FLOW_CONTEXT_ACTION_VLAN_PUSH;
-   attr->vlan_vid[0] = tcf_vlan_push_vid(a);
-   if (mlx5_eswitch_vlan_actions_supported(priv->mdev)) {
-   attr->vlan_prio[0] = tcf_vlan_push_prio(a);
-   attr->vlan_proto[0] = tcf_vlan_push_proto(a);
-   if (!attr->vlan_proto[0])
-   attr->vlan_proto[0] = htons(ETH_P_8021Q);
-   } else if (tcf_vlan_push_proto(a) != htons(ETH_P_8021Q) ||
-  tcf_vlan_push_prio(a)) {
-   return -EOPNOTSUPP;
+   attr->vlan_vid[vlan_idx] = tcf_vlan_push_vid(a);
+   attr->vlan_prio[vlan_idx] = tcf_vlan_push_prio(a);
+   attr->vlan_proto[vlan_idx] = tcf_vlan_push_proto(a);
+   if (!attr->vlan_proto[vlan_idx])
+   attr->vlan_proto[vlan_idx] = htons(ETH_P_8021Q);
+
+   if (vlan_idx) {
+   if (!mlx5_eswitch_vlan_actions_supported(priv->mdev,
+
MLX5_FS_VLAN_DEPTH))
+   return -EOPNOTSUPP;
+
+   *action |= MLX5_FLOW_CONTEXT_ACTION_VLAN_PUSH_2;
+   } else {
+   if (!mlx5_eswitch_vlan_actions_supported(priv->mdev, 1) 
&&
+   (tcf_vlan_push_proto(a) != htons(ETH_P_8021Q) ||
+tcf_vlan_push_prio(a)))
+   return -EOPNOTSUPP;
+
+   *action |= MLX5_FLOW_CONTEXT_ACTION_VLAN_PUSH;
}
} else { /* action is TCA_VLAN_ACT_MODIFY */
return -EOPNOTSUPP;
}
 
+   attr->total_vlan = vlan_idx + 1;
+
return 0;
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index befa0011efee..c17bfcab517c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "lib/mpfs.h"
 
 #ifdef CONFIG_MLX5_ESWITCH
@@ -256,9 +257,10 @@ struct mlx5_esw_flow_attr {
int out_count;
 
int action;
-   __be16  vlan_proto[1];
-   u16 vlan_vid[1];
-   u8  vlan_prio[1];
+   __be16  vlan_proto[MLX5_FS_VLAN_DEPTH];
+   u16 vlan_vid[MLX5_FS_VLAN_DEPTH];
+   u8  vlan_prio[MLX5_FS_VLAN_DEPTH];
+   u8  total_vlan;
boolvlan_handled;
u32 encap_id;
u32 mod_hdr_id;
@@ -282,10 +284,17 @@ int mlx5_eswitch_del_vlan_action(struct mlx5_eswitch *esw,
 int __mlx5_eswitch_set_vport_vlan(struct mlx5_eswitch *esw,
  int vport, u16 vlan, u8 qos, u8 set_flags);
 
-static inline bool mlx5_eswitch_vlan_actions_supported(struct mlx5_core_dev 
*dev)
+static inline bool mlx5_eswitch_vlan_actions_supported(struct mlx5_core_dev 
*dev,
+  u8 vlan_depth)
 {
-   return MLX5_CAP_ESW_FLOWTABLE_FDB(dev, pop_vlan) &&
-  MLX5_CAP_ESW_FLOWTABLE_FDB(dev, push_vlan);
+   bool ret = MLX5_CAP_ESW_FLOWTABLE_FDB(dev, pop_vlan) &&
+  MLX5_CAP_ESW_

[net-next V2 10/12] net/mlx5e: Refactor tc vlan push/pop actions offloading

2018-07-23 Thread Saeed Mahameed

From: Jianbo Liu 

Extract actions offloading code to a new function, and also extend data
structures for double vlan actions.

Signed-off-by: Jianbo Liu 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 51 ---
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  6 +--
 .../mellanox/mlx5/core/eswitch_offloads.c | 12 ++---
 3 files changed, 41 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index dcb8c4993811..35b3e135ae1d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2578,6 +2578,32 @@ static int mlx5e_attach_encap(struct mlx5e_priv *priv,
return err;
 }
 
+static int parse_tc_vlan_action(struct mlx5e_priv *priv,
+   const struct tc_action *a,
+   struct mlx5_esw_flow_attr *attr,
+   u32 *action)
+{
+   if (tcf_vlan_action(a) == TCA_VLAN_ACT_POP) {
+   *action |= MLX5_FLOW_CONTEXT_ACTION_VLAN_POP;
+   } else if (tcf_vlan_action(a) == TCA_VLAN_ACT_PUSH) {
+   *action |= MLX5_FLOW_CONTEXT_ACTION_VLAN_PUSH;
+   attr->vlan_vid[0] = tcf_vlan_push_vid(a);
+   if (mlx5_eswitch_vlan_actions_supported(priv->mdev)) {
+   attr->vlan_prio[0] = tcf_vlan_push_prio(a);
+   attr->vlan_proto[0] = tcf_vlan_push_proto(a);
+   if (!attr->vlan_proto[0])
+   attr->vlan_proto[0] = htons(ETH_P_8021Q);
+   } else if (tcf_vlan_push_proto(a) != htons(ETH_P_8021Q) ||
+  tcf_vlan_push_prio(a)) {
+   return -EOPNOTSUPP;
+   }
+   } else { /* action is TCA_VLAN_ACT_MODIFY */
+   return -EOPNOTSUPP;
+   }
+
+   return 0;
+}
+
 static int parse_tc_fdb_actions(struct mlx5e_priv *priv, struct tcf_exts *exts,
struct mlx5e_tc_flow_parse_attr *parse_attr,
struct mlx5e_tc_flow *flow)
@@ -2589,6 +2615,7 @@ static int parse_tc_fdb_actions(struct mlx5e_priv *priv, 
struct tcf_exts *exts,
LIST_HEAD(actions);
bool encap = false;
u32 action = 0;
+   int err;
 
if (!tcf_exts_has_actions(exts))
return -EINVAL;
@@ -2605,8 +2632,6 @@ static int parse_tc_fdb_actions(struct mlx5e_priv *priv, 
struct tcf_exts *exts,
}
 
if (is_tcf_pedit(a)) {
-   int err;
-
err = parse_tc_pedit_action(priv, a, 
MLX5_FLOW_NAMESPACE_FDB,
parse_attr);
if (err)
@@ -2673,23 +2698,11 @@ static int parse_tc_fdb_actions(struct mlx5e_priv 
*priv, struct tcf_exts *exts,
}
 
if (is_tcf_vlan(a)) {
-   if (tcf_vlan_action(a) == TCA_VLAN_ACT_POP) {
-   action |= MLX5_FLOW_CONTEXT_ACTION_VLAN_POP;
-   } else if (tcf_vlan_action(a) == TCA_VLAN_ACT_PUSH) {
-   action |= MLX5_FLOW_CONTEXT_ACTION_VLAN_PUSH;
-   attr->vlan_vid = tcf_vlan_push_vid(a);
-   if 
(mlx5_eswitch_vlan_actions_supported(priv->mdev)) {
-   attr->vlan_prio = tcf_vlan_push_prio(a);
-   attr->vlan_proto = 
tcf_vlan_push_proto(a);
-   if (!attr->vlan_proto)
-   attr->vlan_proto = 
htons(ETH_P_8021Q);
-   } else if (tcf_vlan_push_proto(a) != 
htons(ETH_P_8021Q) ||
-  tcf_vlan_push_prio(a)) {
-   return -EOPNOTSUPP;
-   }
-   } else { /* action is TCA_VLAN_ACT_MODIFY */
-   return -EOPNOTSUPP;
-   }
+   err = parse_tc_vlan_action(priv, a, attr, &action);
+
+   if (err)
+   return err;
+
attr->mirror_count = attr->out_count;
continue;
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index b174da2884c5..befa0011efee 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -256,9 +256,9 @@ struct mlx5_esw_flow_attr {
int out_count;
 
int action;
-   __be16  vlan_proto;
-   u16 vlan_vid;
-   u8  vlan_prio;
+   __be16  vlan_proto[1];
+   u16 vlan_vid[1];
+   u8

[net-next V2 05/12] net/mlx5: FW tracer, parse traces and kernel tracing support

2018-07-23 Thread Saeed Mahameed

From: Feras Daoud 

For each message the driver should do the following:
1- Find the message string in the strings database
2- Count the param number of each message
3- Wait for the param events and accumulate them
4- Calculate the event timestamp using the local event timestamp
and the first timestamp event following it.
5- Print message to trace log

Enable the tracing by:
echo 1 > /sys/kernel/debug/tracing/events/mlx5/mlx5_fw/enable

Read traces by:
cat /sys/kernel/debug/tracing/trace

Signed-off-by: Feras Daoud 
Signed-off-by: Erez Shitrit 
Signed-off-by: Saeed Mahameed 
---
 .../mellanox/mlx5/core/diag/fw_tracer.c   | 235 +-
 .../mellanox/mlx5/core/diag/fw_tracer.h   |  22 ++
 .../mlx5/core/diag/fw_tracer_tracepoint.h |  78 ++
 3 files changed, 333 insertions(+), 2 deletions(-)
 create mode 100644 
drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer_tracepoint.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c 
b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
index bd887d1d3396..309842de272c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
@@ -29,8 +29,9 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  */
-
+#define CREATE_TRACE_POINTS
 #include "fw_tracer.h"
+#include "fw_tracer_tracepoint.h"
 
 static int mlx5_query_mtrc_caps(struct mlx5_fw_tracer *tracer)
 {
@@ -332,6 +333,109 @@ static void mlx5_fw_tracer_arm(struct mlx5_core_dev *dev)
mlx5_core_warn(dev, "FWTracer: Failed to arm tracer event 
%d\n", err);
 }
 
+static const char *VAL_PARM= "%llx";
+static const char *REPLACE_64_VAL_PARM = "%x%x";
+static const char *PARAM_CHAR  = "%";
+
+static int mlx5_tracer_message_hash(u32 message_id)
+{
+   return jhash_1word(message_id, 0) & (MESSAGE_HASH_SIZE - 1);
+}
+
+static struct tracer_string_format *mlx5_tracer_message_insert(struct 
mlx5_fw_tracer *tracer,
+  struct 
tracer_event *tracer_event)
+{
+   struct hlist_head *head =
+   
&tracer->hash[mlx5_tracer_message_hash(tracer_event->string_event.tmsn)];
+   struct tracer_string_format *cur_string;
+
+   cur_string = kzalloc(sizeof(*cur_string), GFP_KERNEL);
+   if (!cur_string)
+   return NULL;
+
+   hlist_add_head(&cur_string->hlist, head);
+
+   return cur_string;
+}
+
+static struct tracer_string_format *mlx5_tracer_get_string(struct 
mlx5_fw_tracer *tracer,
+  struct tracer_event 
*tracer_event)
+{
+   struct tracer_string_format *cur_string;
+   u32 str_ptr, offset;
+   int i;
+
+   str_ptr = tracer_event->string_event.string_param;
+
+   for (i = 0; i < tracer->str_db.num_string_db; i++) {
+   if (str_ptr > tracer->str_db.base_address_out[i] &&
+   str_ptr < tracer->str_db.base_address_out[i] +
+   tracer->str_db.size_out[i]) {
+   offset = str_ptr - tracer->str_db.base_address_out[i];
+   /* add it to the hash */
+   cur_string = mlx5_tracer_message_insert(tracer, 
tracer_event);
+   if (!cur_string)
+   return NULL;
+   cur_string->string = (char *)(tracer->str_db.buffer[i] +
+   offset);
+   return cur_string;
+   }
+   }
+
+   return NULL;
+}
+
+static void mlx5_tracer_clean_message(struct tracer_string_format *str_frmt)
+{
+   hlist_del(&str_frmt->hlist);
+   kfree(str_frmt);
+}
+
+static int mlx5_tracer_get_num_of_params(char *str)
+{
+   char *substr, *pstr = str;
+   int num_of_params = 0;
+
+   /* replace %llx with %x%x */
+   substr = strstr(pstr, VAL_PARM);
+   while (substr) {
+   memcpy(substr, REPLACE_64_VAL_PARM, 4);
+   pstr = substr;
+   substr = strstr(pstr, VAL_PARM);
+   }
+
+   /* count all the % characters */
+   substr = strstr(str, PARAM_CHAR);
+   while (substr) {
+   num_of_params += 1;
+   str = substr + 1;
+   substr = strstr(str, PARAM_CHAR);
+   }
+
+   return num_of_params;
+}
+
+static struct tracer_string_format *mlx5_tracer_message_find(struct hlist_head 
*head,
+u8 event_id, u32 
tmsn)
+{
+   struct tracer_string_format *message;
+
+   hlist_for_each_entry(message, head, hlist)
+   if (message->event_id == event_id && message->tmsn == tmsn)
+   return message;
+
+   return NULL;
+}
+
+static struct tracer_string_format *mlx5_tracer_message_get(struct 
mlx5_fw_tracer *tracer,
+

[net-next V2 06/12] net/mlx5: FW tracer, Enable tracing

2018-07-23 Thread Saeed Mahameed

From: Feras Daoud 

Add the tracer file to the makefile and add the init
function to the load one flow.

Signed-off-by: Feras Daoud 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/Makefile   |  2 +-
 .../mellanox/mlx5/core/diag/fw_tracer.h|  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 18 --
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index d923f2f58608..55d5a5c2e9d8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -6,7 +6,7 @@ mlx5_core-y :=  main.o cmd.o debugfs.o fw.o eq.o uar.o 
pagealloc.o \
health.o mcg.o cq.o srq.o alloc.o qp.o port.o mr.o pd.o \
mad.o transobj.o vport.o sriov.o fs_cmd.o fs_core.o \
fs_counters.o rl.o lag.o dev.o wq.o lib/gid.o lib/clock.o \
-   diag/fs_tracepoint.o
+   diag/fs_tracepoint.o diag/fw_tracer.o
 
 mlx5_core-$(CONFIG_MLX5_ACCEL) += accel/ipsec.o accel/tls.o
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h 
b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
index 8d310e7d6743..0347f2dd5cee 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
@@ -170,6 +170,6 @@ struct mlx5_fw_tracer *mlx5_fw_tracer_create(struct 
mlx5_core_dev *dev);
 int mlx5_fw_tracer_init(struct mlx5_fw_tracer *tracer);
 void mlx5_fw_tracer_cleanup(struct mlx5_fw_tracer *tracer);
 void mlx5_fw_tracer_destroy(struct mlx5_fw_tracer *tracer);
-void mlx5_fw_tracer_event(struct mlx5_core_dev *dev, struct mlx5_eqe *eqe) { 
return; }
+void mlx5_fw_tracer_event(struct mlx5_core_dev *dev, struct mlx5_eqe *eqe);
 
 #endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index f9b950e1bd85..6ddbb70e95de 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -62,6 +62,7 @@
 #include "accel/ipsec.h"
 #include "accel/tls.h"
 #include "lib/clock.h"
+#include "diag/fw_tracer.h"
 
 MODULE_AUTHOR("Eli Cohen ");
 MODULE_DESCRIPTION("Mellanox 5th generation network adapters (ConnectX series) 
core driver");
@@ -990,6 +991,8 @@ static int mlx5_init_once(struct mlx5_core_dev *dev, struct 
mlx5_priv *priv)
goto err_sriov_cleanup;
}
 
+   dev->tracer = mlx5_fw_tracer_create(dev);
+
return 0;
 
 err_sriov_cleanup:
@@ -1015,6 +1018,7 @@ static int mlx5_init_once(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv)
 
 static void mlx5_cleanup_once(struct mlx5_core_dev *dev)
 {
+   mlx5_fw_tracer_destroy(dev->tracer);
mlx5_fpga_cleanup(dev);
mlx5_sriov_cleanup(dev);
mlx5_eswitch_cleanup(dev->priv.eswitch);
@@ -1167,10 +1171,16 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv,
goto err_put_uars;
}
 
+   err = mlx5_fw_tracer_init(dev->tracer);
+   if (err) {
+   dev_err(&pdev->dev, "Failed to init FW tracer\n");
+   goto err_fw_tracer;
+   }
+
err = alloc_comp_eqs(dev);
if (err) {
dev_err(&pdev->dev, "Failed to alloc completion EQs\n");
-   goto err_stop_eqs;
+   goto err_comp_eqs;
}
 
err = mlx5_irq_set_affinity_hints(dev);
@@ -1252,7 +1262,10 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv,
 err_affinity_hints:
free_comp_eqs(dev);
 
-err_stop_eqs:
+err_comp_eqs:
+   mlx5_fw_tracer_cleanup(dev->tracer);
+
+err_fw_tracer:
mlx5_stop_eqs(dev);
 
 err_put_uars:
@@ -1320,6 +1333,7 @@ static int mlx5_unload_one(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv,
mlx5_fpga_device_stop(dev);
mlx5_irq_clear_affinity_hints(dev);
free_comp_eqs(dev);
+   mlx5_fw_tracer_cleanup(dev->tracer);
mlx5_stop_eqs(dev);
mlx5_put_uars_page(dev, priv->uar);
mlx5_free_irq_vectors(dev);
-- 
2.17.0

[net-next V2 09/12] net/mlx5e: Support offloading tc double vlan headers match

2018-07-23 Thread Saeed Mahameed

From: Jianbo Liu 

We can match on both outer and inner vlan tags, add support for
offloading that.

Signed-off-by: Jianbo Liu 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 55 ++-
 1 file changed, 52 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 335a08bc381d..dcb8c4993811 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1235,6 +1235,10 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
   outer_headers);
void *headers_v = MLX5_ADDR_OF(fte_match_param, spec->match_value,
   outer_headers);
+   void *misc_c = MLX5_ADDR_OF(fte_match_param, spec->match_criteria,
+   misc_parameters);
+   void *misc_v = MLX5_ADDR_OF(fte_match_param, spec->match_value,
+   misc_parameters);
u16 addr_type = 0;
u8 ip_proto = 0;
 
@@ -1245,6 +1249,7 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
  BIT(FLOW_DISSECTOR_KEY_BASIC) |
  BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) |
  BIT(FLOW_DISSECTOR_KEY_VLAN) |
+ BIT(FLOW_DISSECTOR_KEY_CVLAN) |
  BIT(FLOW_DISSECTOR_KEY_IPV4_ADDRS) |
  BIT(FLOW_DISSECTOR_KEY_IPV6_ADDRS) |
  BIT(FLOW_DISSECTOR_KEY_PORTS) |
@@ -1325,9 +1330,18 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
skb_flow_dissector_target(f->dissector,
  FLOW_DISSECTOR_KEY_VLAN,
  f->mask);
-   if (mask->vlan_id || mask->vlan_priority) {
-   MLX5_SET(fte_match_set_lyr_2_4, headers_c, cvlan_tag, 
1);
-   MLX5_SET(fte_match_set_lyr_2_4, headers_v, cvlan_tag, 
1);
+   if (mask->vlan_id || mask->vlan_priority || mask->vlan_tpid) {
+   if (key->vlan_tpid == htons(ETH_P_8021AD)) {
+   MLX5_SET(fte_match_set_lyr_2_4, headers_c,
+svlan_tag, 1);
+   MLX5_SET(fte_match_set_lyr_2_4, headers_v,
+svlan_tag, 1);
+   } else {
+   MLX5_SET(fte_match_set_lyr_2_4, headers_c,
+cvlan_tag, 1);
+   MLX5_SET(fte_match_set_lyr_2_4, headers_v,
+cvlan_tag, 1);
+   }
 
MLX5_SET(fte_match_set_lyr_2_4, headers_c, first_vid, 
mask->vlan_id);
MLX5_SET(fte_match_set_lyr_2_4, headers_v, first_vid, 
key->vlan_id);
@@ -1339,6 +1353,41 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
}
}
 
+   if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_CVLAN)) {
+   struct flow_dissector_key_vlan *key =
+   skb_flow_dissector_target(f->dissector,
+ FLOW_DISSECTOR_KEY_CVLAN,
+ f->key);
+   struct flow_dissector_key_vlan *mask =
+   skb_flow_dissector_target(f->dissector,
+ FLOW_DISSECTOR_KEY_CVLAN,
+ f->mask);
+   if (mask->vlan_id || mask->vlan_priority || mask->vlan_tpid) {
+   if (key->vlan_tpid == htons(ETH_P_8021AD)) {
+   MLX5_SET(fte_match_set_misc, misc_c,
+outer_second_svlan_tag, 1);
+   MLX5_SET(fte_match_set_misc, misc_v,
+outer_second_svlan_tag, 1);
+   } else {
+   MLX5_SET(fte_match_set_misc, misc_c,
+outer_second_cvlan_tag, 1);
+   MLX5_SET(fte_match_set_misc, misc_v,
+outer_second_cvlan_tag, 1);
+   }
+
+   MLX5_SET(fte_match_set_misc, misc_c, outer_second_vid,
+mask->vlan_id);
+   MLX5_SET(fte_match_set_misc, misc_v, outer_second_vid,
+key->vlan_id);
+   MLX5_SET(fte_match_set_misc, misc_c, outer_second_prio,
+mask->vlan_priority);
+   MLX5_SET(fte_match_set_misc, misc_v, outer_second_prio,
+key->vlan_priority);
+
+   *match_level = MLX

[net-next V2 12/12] net/mlx5e: Use PARTIAL_GSO for UDP segmentation

2018-07-23 Thread Saeed Mahameed

From: Boris Pismenny 

This patch removes the splitting of UDP_GSO_L4 packets in the driver,
and exposes UDP_GSO_L4 as a PARTIAL_GSO feature. Thus, the network stack
is not responsible for splitting the packet into two.

Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   4 +-
 .../mellanox/mlx5/core/en_accel/en_accel.h|  27 +++--
 .../mellanox/mlx5/core/en_accel/rxtx.c| 109 --
 .../mellanox/mlx5/core/en_accel/rxtx.h|  14 ---
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   9 +-
 5 files changed, 23 insertions(+), 140 deletions(-)
 delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c
 delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 55d5a5c2e9d8..fa7fcca5dc78 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -14,8 +14,8 @@ mlx5_core-$(CONFIG_MLX5_FPGA) += fpga/cmd.o fpga/core.o 
fpga/conn.o fpga/sdk.o \
fpga/ipsec.o fpga/tls.o
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o en_ethtool.o 
\
-   en_tx.o en_rx.o en_dim.o en_txrx.o en_accel/rxtx.o en_stats.o  \
-   vxlan.o en_arfs.o en_fs_ethtool.o en_selftest.o en/port.o
+   en_tx.o en_rx.o en_dim.o en_txrx.o en_stats.o vxlan.o  \
+   en_arfs.o en_fs_ethtool.o en_selftest.o en/port.o
 
 mlx5_core-$(CONFIG_MLX5_MPFS) += lib/mpfs.o
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
index 39a5d13ba459..1dd225380a66 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
@@ -38,14 +38,22 @@
 #include 
 #include "en_accel/ipsec_rxtx.h"
 #include "en_accel/tls_rxtx.h"
-#include "en_accel/rxtx.h"
 #include "en.h"
 
-static inline struct sk_buff *mlx5e_accel_handle_tx(struct sk_buff *skb,
-   struct mlx5e_txqsq *sq,
-   struct net_device *dev,
-   struct mlx5e_tx_wqe **wqe,
-   u16 *pi)
+static inline void
+mlx5e_udp_gso_handle_tx_skb(struct sk_buff *skb)
+{
+   int payload_len = skb_shinfo(skb)->gso_size + sizeof(struct udphdr);
+
+   udp_hdr(skb)->len = htons(payload_len);
+}
+
+static inline struct sk_buff *
+mlx5e_accel_handle_tx(struct sk_buff *skb,
+ struct mlx5e_txqsq *sq,
+ struct net_device *dev,
+ struct mlx5e_tx_wqe **wqe,
+ u16 *pi)
 {
 #ifdef CONFIG_MLX5_EN_TLS
if (test_bit(MLX5E_SQ_STATE_TLS, &sq->state)) {
@@ -63,11 +71,8 @@ static inline struct sk_buff *mlx5e_accel_handle_tx(struct 
sk_buff *skb,
}
 #endif
 
-   if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) {
-   skb = mlx5e_udp_gso_handle_tx_skb(dev, sq, skb, wqe, pi);
-   if (unlikely(!skb))
-   return NULL;
-   }
+   if (skb_is_gso(skb) && skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)
+   mlx5e_udp_gso_handle_tx_skb(skb);
 
return skb;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c
deleted file mode 100644
index 7b7ec3998e84..
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c
+++ /dev/null
@@ -1,109 +0,0 @@
-#include "en_accel/rxtx.h"
-
-static void mlx5e_udp_gso_prepare_last_skb(struct sk_buff *skb,
-  struct sk_buff *nskb,
-  int remaining)
-{
-   int bytes_needed = remaining, remaining_headlen, remaining_page_offset;
-   int headlen = skb_transport_offset(skb) + sizeof(struct udphdr);
-   int payload_len = remaining + sizeof(struct udphdr);
-   int k = 0, i, j;
-
-   skb_copy_bits(skb, 0, nskb->data, headlen);
-   nskb->dev = skb->dev;
-   skb_reset_mac_header(nskb);
-   skb_set_network_header(nskb, skb_network_offset(skb));
-   skb_set_transport_header(nskb, skb_transport_offset(skb));
-   skb_set_tail_pointer(nskb, headlen);
-
-   /* How many frags do we need? */
-   for (i = skb_shinfo(skb)->nr_frags - 1; i >= 0; i--) {
-   bytes_needed -= skb_frag_size(&skb_shinfo(skb)->frags[i]);
-   k++;
-   if (bytes_needed <= 0)
-   break;
-   }
-
-   /* Fill the first frag and split it if necessary */
-   j = skb_shinfo(skb)->nr_frags - k;
-   remaining_page_offset = -bytes_needed;
-   skb_fill_page_desc(nskb, 0,
-

[net-next V2 04/12] net/mlx5: FW tracer, events handling

2018-07-23 Thread Saeed Mahameed

From: Feras Daoud 

The tracer has one event, event 0x26, with two subtypes:
- Subtype 0: Ownership change
- Subtype 1: Traces available

An ownership change occurs in the following cases:
1- Owner releases his ownership, in this case, an event will be
sent to inform others to reattempt acquire ownership.
2- Ownership was taken by a higher priority tool, in this case
the owner should understand that it lost ownership, and go through
tear down flow.

The second subtype indicates that there are traces in the trace buffer,
in this case, the driver polls the tracer buffer for new traces, parse
them and prepares the messages for printing.

The HW starts tracing from the first address in the tracer buffer.
Driver receives an event notifying that new trace block exists.
HW posts a timestamp event at the last 8B of every 256B block.
Comparing the timestamp to the last handled timestamp would indicate
that this is a new trace block. Once the new timestamp is detected,
the entire block is considered valid.

Block validation and parsing, should be done after copying the current
block to a different location, in order to avoid block overwritten
during processing.

Signed-off-by: Feras Daoud 
Signed-off-by: Saeed Mahameed 
---
 .../mellanox/mlx5/core/diag/fw_tracer.c   | 268 +-
 .../mellanox/mlx5/core/diag/fw_tracer.h   |  71 -
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |  11 +
 include/linux/mlx5/device.h   |   7 +
 4 files changed, 347 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c 
b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
index d6cc27b0ff34..bd887d1d3396 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
@@ -318,25 +318,244 @@ static void mlx5_tracer_read_strings_db(struct 
work_struct *work)
return;
 }
 
-static void mlx5_fw_tracer_ownership_change(struct work_struct *work)
+static void mlx5_fw_tracer_arm(struct mlx5_core_dev *dev)
 {
-   struct mlx5_fw_tracer *tracer = container_of(work, struct 
mlx5_fw_tracer,
-ownership_change_work);
-   struct mlx5_core_dev *dev = tracer->dev;
+   u32 out[MLX5_ST_SZ_DW(mtrc_ctrl)] = {0};
+   u32 in[MLX5_ST_SZ_DW(mtrc_ctrl)] = {0};
int err;
 
-   if (tracer->owner) {
-   mlx5_fw_tracer_ownership_release(tracer);
+   MLX5_SET(mtrc_ctrl, in, arm_event, 1);
+
+   err = mlx5_core_access_reg(dev, in, sizeof(in), out, sizeof(out),
+  MLX5_REG_MTRC_CTRL, 0, 1);
+   if (err)
+   mlx5_core_warn(dev, "FWTracer: Failed to arm tracer event 
%d\n", err);
+}
+
+static void poll_trace(struct mlx5_fw_tracer *tracer,
+  struct tracer_event *tracer_event, u64 *trace)
+{
+   u32 timestamp_low, timestamp_mid, timestamp_high, urts;
+
+   tracer_event->event_id = MLX5_GET(tracer_event, trace, event_id);
+   tracer_event->lost_event = MLX5_GET(tracer_event, trace, lost);
+
+   switch (tracer_event->event_id) {
+   case TRACER_EVENT_TYPE_TIMESTAMP:
+   tracer_event->type = TRACER_EVENT_TYPE_TIMESTAMP;
+   urts = MLX5_GET(tracer_timestamp_event, trace, urts);
+   if (tracer->trc_ver == 0)
+   tracer_event->timestamp_event.unreliable = !!(urts >> 
2);
+   else
+   tracer_event->timestamp_event.unreliable = !!(urts & 1);
+
+   timestamp_low = MLX5_GET(tracer_timestamp_event,
+trace, timestamp7_0);
+   timestamp_mid = MLX5_GET(tracer_timestamp_event,
+trace, timestamp39_8);
+   timestamp_high = MLX5_GET(tracer_timestamp_event,
+ trace, timestamp52_40);
+
+   tracer_event->timestamp_event.timestamp =
+   ((u64)timestamp_high << 40) |
+   ((u64)timestamp_mid << 8) |
+   (u64)timestamp_low;
+   break;
+   default:
+   if (tracer_event->event_id >= tracer->str_db.first_string_trace 
||
+   tracer_event->event_id <= tracer->str_db.first_string_trace 
+
+ tracer->str_db.num_string_trace) {
+   tracer_event->type = TRACER_EVENT_TYPE_STRING;
+   tracer_event->string_event.timestamp =
+   MLX5_GET(tracer_string_event, trace, timestamp);
+   tracer_event->string_event.string_param =
+   MLX5_GET(tracer_string_event, trace, 
string_param);
+   tracer_event->string_event.tmsn =
+   MLX5_GET(tracer_string_event, trace, tmsn);
+

[net-next V2 07/12] net/mlx5: FW tracer, Add debug prints

2018-07-23 Thread Saeed Mahameed

Signed-off-by: Saeed Mahameed 
---
 .../ethernet/mellanox/mlx5/core/diag/fw_tracer.c| 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c 
b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
index 309842de272c..d4ec93bde4de 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
@@ -629,14 +629,14 @@ static void mlx5_fw_tracer_handle_traces(struct 
work_struct *work)
u64 block_timestamp, last_block_timestamp, 
tmp_trace_block[TRACES_PER_BLOCK];
u32 block_count, start_offset, prev_start_offset, prev_consumer_index;
u32 trace_event_size = MLX5_ST_SZ_BYTES(tracer_event);
+   struct mlx5_core_dev *dev = tracer->dev;
struct tracer_event tracer_event;
-   struct mlx5_core_dev *dev;
int i;
 
+   mlx5_core_dbg(dev, "FWTracer: Handle Trace event, owner=(%d)\n", 
tracer->owner);
if (!tracer->owner)
return;
 
-   dev = tracer->dev;
block_count = tracer->buff.size / TRACER_BLOCK_SIZE_BYTE;
start_offset = tracer->buff.consumer_index * TRACER_BLOCK_SIZE_BYTE;
 
@@ -762,6 +762,7 @@ static int mlx5_fw_tracer_start(struct mlx5_fw_tracer 
*tracer)
goto release_ownership;
}
 
+   mlx5_core_dbg(dev, "FWTracer: Ownership granted and active\n");
return 0;
 
 release_ownership:
@@ -774,6 +775,7 @@ static void mlx5_fw_tracer_ownership_change(struct 
work_struct *work)
struct mlx5_fw_tracer *tracer =
container_of(work, struct mlx5_fw_tracer, 
ownership_change_work);
 
+   mlx5_core_dbg(tracer->dev, "FWTracer: ownership changed, 
current=(%d)\n", tracer->owner);
if (tracer->owner) {
tracer->owner = false;
tracer->buff.consumer_index = 0;
@@ -830,6 +832,8 @@ struct mlx5_fw_tracer *mlx5_fw_tracer_create(struct 
mlx5_core_dev *dev)
goto free_log_buf;
}
 
+   mlx5_core_dbg(dev, "FWTracer: Tracer created\n");
+
return tracer;
 
 free_log_buf:
@@ -887,6 +891,9 @@ void mlx5_fw_tracer_cleanup(struct mlx5_fw_tracer *tracer)
if (IS_ERR_OR_NULL(tracer))
return;
 
+   mlx5_core_dbg(tracer->dev, "FWTracer: Cleanup, is owner ? (%d)\n",
+ tracer->owner);
+
cancel_work_sync(&tracer->ownership_change_work);
cancel_work_sync(&tracer->handle_traces_work);
 
@@ -903,6 +910,8 @@ void mlx5_fw_tracer_destroy(struct mlx5_fw_tracer *tracer)
if (IS_ERR_OR_NULL(tracer))
return;
 
+   mlx5_core_dbg(tracer->dev, "FWTracer: Destroy\n");
+
cancel_work_sync(&tracer->read_fw_strings_work);
mlx5_fw_tracer_clean_ready_list(tracer);
mlx5_fw_tracer_clean_print_hash(tracer);
-- 
2.17.0

[net-next V2 02/12] net/mlx5: FW tracer, create trace buffer and copy strings database

2018-07-23 Thread Saeed Mahameed

From: Feras Daoud 

For each PF do the following:
1- Allocate memory for the tracer strings database and read the
strings from the FW to the SW. These strings will be used later for
parsing traces.
2- Allocate and dma map tracer buffers.

Traces that will be written into the buffer will be parsed as a group
of one or more traces, referred to as trace message. The trace message
represents a C-like printf string.
First trace of a message holds the pointer to the correct string in
strings database. The following traces holds the variables of the
message.

Signed-off-by: Feras Daoud 
Signed-off-by: Saeed Mahameed 
---
 .../mellanox/mlx5/core/diag/fw_tracer.c   | 209 +-
 .../mellanox/mlx5/core/diag/fw_tracer.h   |  18 ++
 2 files changed, 224 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c 
b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
index 3ecbf06b4d71..35107b8f76df 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
@@ -119,6 +119,163 @@ static void mlx5_fw_tracer_ownership_release(struct 
mlx5_fw_tracer *tracer)
tracer->owner = false;
 }
 
+static int mlx5_fw_tracer_create_log_buf(struct mlx5_fw_tracer *tracer)
+{
+   struct mlx5_core_dev *dev = tracer->dev;
+   struct device *ddev = &dev->pdev->dev;
+   dma_addr_t dma;
+   void *buff;
+   gfp_t gfp;
+   int err;
+
+   tracer->buff.size = TRACE_BUFFER_SIZE_BYTE;
+
+   gfp = GFP_KERNEL | __GFP_ZERO;
+   buff = (void *)__get_free_pages(gfp,
+   get_order(tracer->buff.size));
+   if (!buff) {
+   err = -ENOMEM;
+   mlx5_core_warn(dev, "FWTracer: Failed to allocate pages, %d\n", 
err);
+   return err;
+   }
+   tracer->buff.log_buf = buff;
+
+   dma = dma_map_single(ddev, buff, tracer->buff.size, DMA_FROM_DEVICE);
+   if (dma_mapping_error(ddev, dma)) {
+   mlx5_core_warn(dev, "FWTracer: Unable to map DMA: %d\n",
+  dma_mapping_error(ddev, dma));
+   err = -ENOMEM;
+   goto free_pages;
+   }
+   tracer->buff.dma = dma;
+
+   return 0;
+
+free_pages:
+   free_pages((unsigned long)tracer->buff.log_buf, 
get_order(tracer->buff.size));
+
+   return err;
+}
+
+static void mlx5_fw_tracer_destroy_log_buf(struct mlx5_fw_tracer *tracer)
+{
+   struct mlx5_core_dev *dev = tracer->dev;
+   struct device *ddev = &dev->pdev->dev;
+
+   if (!tracer->buff.log_buf)
+   return;
+
+   dma_unmap_single(ddev, tracer->buff.dma, tracer->buff.size, 
DMA_FROM_DEVICE);
+   free_pages((unsigned long)tracer->buff.log_buf, 
get_order(tracer->buff.size));
+}
+
+static void mlx5_fw_tracer_free_strings_db(struct mlx5_fw_tracer *tracer)
+{
+   u32 num_string_db = tracer->str_db.num_string_db;
+   int i;
+
+   for (i = 0; i < num_string_db; i++) {
+   kfree(tracer->str_db.buffer[i]);
+   tracer->str_db.buffer[i] = NULL;
+   }
+}
+
+static int mlx5_fw_tracer_allocate_strings_db(struct mlx5_fw_tracer *tracer)
+{
+   u32 *string_db_size_out = tracer->str_db.size_out;
+   u32 num_string_db = tracer->str_db.num_string_db;
+   int i;
+
+   for (i = 0; i < num_string_db; i++) {
+   tracer->str_db.buffer[i] = kzalloc(string_db_size_out[i], 
GFP_KERNEL);
+   if (!tracer->str_db.buffer[i])
+   goto free_strings_db;
+   }
+
+   return 0;
+
+free_strings_db:
+   mlx5_fw_tracer_free_strings_db(tracer);
+   return -ENOMEM;
+}
+
+static void mlx5_tracer_read_strings_db(struct work_struct *work)
+{
+   struct mlx5_fw_tracer *tracer = container_of(work, struct 
mlx5_fw_tracer,
+read_fw_strings_work);
+   u32 num_of_reads, num_string_db = tracer->str_db.num_string_db;
+   struct mlx5_core_dev *dev = tracer->dev;
+   u32 in[MLX5_ST_SZ_DW(mtrc_cap)] = {0};
+   u32 leftovers, offset;
+   int err = 0, i, j;
+   u32 *out, outlen;
+   void *out_value;
+
+   outlen = MLX5_ST_SZ_BYTES(mtrc_stdb) + STRINGS_DB_READ_SIZE_BYTES;
+   out = kzalloc(outlen, GFP_KERNEL);
+   if (!out) {
+   err = -ENOMEM;
+   goto out;
+   }
+
+   for (i = 0; i < num_string_db; i++) {
+   offset = 0;
+   MLX5_SET(mtrc_stdb, in, string_db_index, i);
+   num_of_reads = tracer->str_db.size_out[i] /
+   STRINGS_DB_READ_SIZE_BYTES;
+   leftovers = (tracer->str_db.size_out[i] %
+   STRINGS_DB_READ_SIZE_BYTES) /
+   STRINGS_DB_LEFTOVER_SIZE_BYTES;
+
+   MLX5_SET(mtrc_stdb, in, read_size, STRINGS_DB_READ_SIZE_BYTES);
+   for (j = 0; j <

[net-next V2 01/12] net/mlx5: FW tracer, implement tracer logic

2018-07-23 Thread Saeed Mahameed

From: Feras Daoud 

Implement FW tracer logic and registers access, initialization and
cleanup flows.

Initializing the tracer will be part of load one flow, as multiple
PFs will try to acquire ownership but only one will succeed and will
be the tracer owner.

Signed-off-by: Feras Daoud 
Signed-off-by: Saeed Mahameed 
---
 .../mellanox/mlx5/core/diag/fw_tracer.c   | 196 ++
 .../mellanox/mlx5/core/diag/fw_tracer.h   |  66 ++
 include/linux/mlx5/driver.h   |   3 +
 3 files changed, 265 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c 
b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
new file mode 100644
index ..3ecbf06b4d71
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
@@ -0,0 +1,196 @@
+/*
+ * Copyright (c) 2018, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "fw_tracer.h"
+
+static int mlx5_query_mtrc_caps(struct mlx5_fw_tracer *tracer)
+{
+   u32 *string_db_base_address_out = tracer->str_db.base_address_out;
+   u32 *string_db_size_out = tracer->str_db.size_out;
+   struct mlx5_core_dev *dev = tracer->dev;
+   u32 out[MLX5_ST_SZ_DW(mtrc_cap)] = {0};
+   u32 in[MLX5_ST_SZ_DW(mtrc_cap)] = {0};
+   void *mtrc_cap_sp;
+   int err, i;
+
+   err = mlx5_core_access_reg(dev, in, sizeof(in), out, sizeof(out),
+  MLX5_REG_MTRC_CAP, 0, 0);
+   if (err) {
+   mlx5_core_warn(dev, "FWTracer: Error reading tracer caps %d\n",
+  err);
+   return err;
+   }
+
+   if (!MLX5_GET(mtrc_cap, out, trace_to_memory)) {
+   mlx5_core_dbg(dev, "FWTracer: Device does not support logging 
traces to memory\n");
+   return -ENOTSUPP;
+   }
+
+   tracer->trc_ver = MLX5_GET(mtrc_cap, out, trc_ver);
+   tracer->str_db.first_string_trace =
+   MLX5_GET(mtrc_cap, out, first_string_trace);
+   tracer->str_db.num_string_trace =
+   MLX5_GET(mtrc_cap, out, num_string_trace);
+   tracer->str_db.num_string_db = MLX5_GET(mtrc_cap, out, num_string_db);
+   tracer->owner = !!MLX5_GET(mtrc_cap, out, trace_owner);
+
+   for (i = 0; i < tracer->str_db.num_string_db; i++) {
+   mtrc_cap_sp = MLX5_ADDR_OF(mtrc_cap, out, string_db_param[i]);
+   string_db_base_address_out[i] = MLX5_GET(mtrc_string_db_param,
+mtrc_cap_sp,
+
string_db_base_address);
+   string_db_size_out[i] = MLX5_GET(mtrc_string_db_param,
+mtrc_cap_sp, string_db_size);
+   }
+
+   return err;
+}
+
+static int mlx5_set_mtrc_caps_trace_owner(struct mlx5_fw_tracer *tracer,
+ u32 *out, u32 out_size,
+ u8 trace_owner)
+{
+   struct mlx5_core_dev *dev = tracer->dev;
+   u32 in[MLX5_ST_SZ_DW(mtrc_cap)] = {0};
+
+   MLX5_SET(mtrc_cap, in, trace_owner, trace_owner);
+
+   return mlx5_core_access_reg(dev, in, sizeof(in), out, out_size,
+   MLX5_REG_MTRC_CAP, 0, 1);
+}
+
+static int mlx5_fw_tracer_ownership_acquire(struct mlx5_fw_tracer *tracer)
+{
+   struct mlx5_core_dev *dev = tracer->dev;
+

[pull request][net-next V2 00/12] Mellanox, mlx5e updates 2018-07-18

2018-07-23 Thread Saeed Mahameed

Hi Dave,

This series includes updates for mlx5e net device driver, with a couple
of major features and some misc updates.

Please notice the mlx5-next merge patch at the beginning:
"Merge branch 'mlx5-next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux"

For more information please see tag log below.

Please pull and let me know if there's any problem.

v1->v2:
- Dropped "Support PCIe buffer congestion handling via Devlink" patches until 
the
comments are addressed.

Thanks,
Saeed.

---

The following changes since commit 7854ac44fe86548f8a6c6001938a1a2593b255e4:

  Merge branch 'mlx5-next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux (2018-07-23 
14:58:46 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git 
tags/mlx5e-updates-2018-07-18-v2

for you to fetch changes up to 3f44899ef2ce0c9da49feb0d6f08098a08cb96ae:

  net/mlx5e: Use PARTIAL_GSO for UDP segmentation (2018-07-23 15:01:11 -0700)


mlx5e-updates-2018-07-18

This series includes update for mlx5e net device driver.

1) From Feras Daoud, Added the support for firmware log tracing,
first by introducing the firmware API needed for the task and then
For each PF do the following:
1- Allocate memory for the tracer strings database and read it from the FW 
to the SW.
2- Allocate and dma map tracer buffers.

Traces that will be written into the buffer will be parsed as a group
of one or more traces, referred to as trace message. The trace message
represents a C-like printf string.
Once a new trace is available  FW will generate an event indicates new trace/s 
are
available and the driver will parse them and dump them using tracepoints
event tracing

Enable mlx5 fw tracing by:
echo 1 > /sys/kernel/debug/tracing/events/mlx5/mlx5_fw/enable

Read traces by:
cat /sys/kernel/debug/tracing/trace

2) From Roi Dayan, Remove redundant WARN when we cannot find neigh entry

3) From Jianbo Liu, TC double vlan support
- Support offloading tc double vlan headers match
- Support offloading double vlan push/pop tc actions

4) From Boris, re-visit UDP GSO, remove the splitting of UDP_GSO_L4 packets
in the driver, and exposes UDP_GSO_L4 as a PARTIAL_GSO feature.


Boris Pismenny (1):
  net/mlx5e: Use PARTIAL_GSO for UDP segmentation

Feras Daoud (5):
  net/mlx5: FW tracer, implement tracer logic
  net/mlx5: FW tracer, create trace buffer and copy strings database
  net/mlx5: FW tracer, events handling
  net/mlx5: FW tracer, parse traces and kernel tracing support
  net/mlx5: FW tracer, Enable tracing

Jianbo Liu (3):
  net/mlx5e: Support offloading tc double vlan headers match
  net/mlx5e: Refactor tc vlan push/pop actions offloading
  net/mlx5e: Support offloading double vlan push/pop tc actions

Roi Dayan (1):
  net/mlx5e: Remove redundant WARN when we cannot find neigh entry

Saeed Mahameed (2):
  net/mlx5: FW tracer, register log buffer memory key
  net/mlx5: FW tracer, Add debug prints

 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   6 +-
 .../ethernet/mellanox/mlx5/core/diag/fw_tracer.c   | 947 +
 .../ethernet/mellanox/mlx5/core/diag/fw_tracer.h   | 175 
 .../mellanox/mlx5/core/diag/fw_tracer_tracepoint.h |  78 ++
 .../mellanox/mlx5/core/en_accel/en_accel.h |  27 +-
 .../ethernet/mellanox/mlx5/core/en_accel/rxtx.c| 109 ---
 .../ethernet/mellanox/mlx5/core/en_accel/rxtx.h|  14 -
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   9 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 134 ++-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c   |  11 +
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h  |  21 +-
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c |  23 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c |  18 +-
 include/linux/mlx5/device.h|   7 +
 include/linux/mlx5/driver.h|   3 +
 15 files changed, 1399 insertions(+), 183 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
 create mode 100644 
drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer_tracepoint.h
 delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c
 delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.h

[net-next V2 03/12] net/mlx5: FW tracer, register log buffer memory key

2018-07-23 Thread Saeed Mahameed

Create a memory key and protection domain for the tracer log buffer.

Signed-off-by: Saeed Mahameed 
---
 .../mellanox/mlx5/core/diag/fw_tracer.c   | 64 ++-
 1 file changed, 61 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c 
b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
index 35107b8f76df..d6cc27b0ff34 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
@@ -169,6 +169,48 @@ static void mlx5_fw_tracer_destroy_log_buf(struct 
mlx5_fw_tracer *tracer)
free_pages((unsigned long)tracer->buff.log_buf, 
get_order(tracer->buff.size));
 }
 
+static int mlx5_fw_tracer_create_mkey(struct mlx5_fw_tracer *tracer)
+{
+   struct mlx5_core_dev *dev = tracer->dev;
+   int err, inlen, i;
+   __be64 *mtt;
+   void *mkc;
+   u32 *in;
+
+   inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
+   sizeof(*mtt) * round_up(TRACER_BUFFER_PAGE_NUM, 2);
+
+   in = kvzalloc(inlen, GFP_KERNEL);
+   if (!in)
+   return -ENOMEM;
+
+   MLX5_SET(create_mkey_in, in, translations_octword_actual_size,
+DIV_ROUND_UP(TRACER_BUFFER_PAGE_NUM, 2));
+   mtt = (u64 *)MLX5_ADDR_OF(create_mkey_in, in, klm_pas_mtt);
+   for (i = 0 ; i < TRACER_BUFFER_PAGE_NUM ; i++)
+   mtt[i] = cpu_to_be64(tracer->buff.dma + i * PAGE_SIZE);
+
+   mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
+   MLX5_SET(mkc, mkc, access_mode_1_0, MLX5_MKC_ACCESS_MODE_MTT);
+   MLX5_SET(mkc, mkc, lr, 1);
+   MLX5_SET(mkc, mkc, lw, 1);
+   MLX5_SET(mkc, mkc, pd, tracer->buff.pdn);
+   MLX5_SET(mkc, mkc, bsf_octword_size, 0);
+   MLX5_SET(mkc, mkc, qpn, 0xff);
+   MLX5_SET(mkc, mkc, log_page_size, PAGE_SHIFT);
+   MLX5_SET(mkc, mkc, translations_octword_size,
+DIV_ROUND_UP(TRACER_BUFFER_PAGE_NUM, 2));
+   MLX5_SET64(mkc, mkc, start_addr, tracer->buff.dma);
+   MLX5_SET64(mkc, mkc, len, tracer->buff.size);
+   err = mlx5_core_create_mkey(dev, &tracer->buff.mkey, in, inlen);
+   if (err)
+   mlx5_core_warn(dev, "FWTracer: Failed to create mkey, %d\n", 
err);
+
+   kvfree(in);
+
+   return err;
+}
+
 static void mlx5_fw_tracer_free_strings_db(struct mlx5_fw_tracer *tracer)
 {
u32 num_string_db = tracer->str_db.num_string_db;
@@ -363,13 +405,26 @@ int mlx5_fw_tracer_init(struct mlx5_fw_tracer *tracer)
if (!tracer->str_db.loaded)
queue_work(tracer->work_queue, &tracer->read_fw_strings_work);
 
-   err = mlx5_fw_tracer_ownership_acquire(tracer);
+   err = mlx5_core_alloc_pd(dev, &tracer->buff.pdn);
if (err) {
-   mlx5_core_dbg(dev, "FWTracer: Ownership was not granted %d\n", 
err);
-   return 0; /* return 0 since ownership can be acquired on a 
later FW event */
+   mlx5_core_warn(dev, "FWTracer: Failed to allocate PD %d\n", 
err);
+   return err;
}
 
+   err = mlx5_fw_tracer_create_mkey(tracer);
+   if (err) {
+   mlx5_core_warn(dev, "FWTracer: Failed to create mkey %d\n", 
err);
+   goto err_dealloc_pd;
+   }
+
+   err = mlx5_fw_tracer_ownership_acquire(tracer);
+   if (err) /* Don't fail since ownership can be acquired on a later FW 
event */
+   mlx5_core_dbg(dev, "FWTracer: Ownership was not granted %d\n", 
err);
+
return 0;
+err_dealloc_pd:
+   mlx5_core_dealloc_pd(dev, tracer->buff.pdn);
+   return err;
 }
 
 void mlx5_fw_tracer_cleanup(struct mlx5_fw_tracer *tracer)
@@ -381,6 +436,9 @@ void mlx5_fw_tracer_cleanup(struct mlx5_fw_tracer *tracer)
 
if (tracer->owner)
mlx5_fw_tracer_ownership_release(tracer);
+
+   mlx5_core_destroy_mkey(tracer->dev, &tracer->buff.mkey);
+   mlx5_core_dealloc_pd(tracer->dev, tracer->buff.pdn);
 }
 
 void mlx5_fw_tracer_destroy(struct mlx5_fw_tracer *tracer)
-- 
2.17.0

Re: [EXTERNAL] Re: VRF with enslaved L3 enabled bridge

2018-07-23 Thread David Ahern

On 7/20/18 1:03 PM, D'Souza, Nelson wrote:
> Setup is as follows:
> 
> ethUSB(ingress port) -> mgmtbr0 (bridge) -> mgmtvrf (vrf)



 |  netns foo
 [ test-vrf ]|
   | |
[ br0 ] 172.16.1.1   |
   | |
   [ veth1 ] |=== [ veth2 ]  lo
 |   172.16.1.2 172.16.2.2
 |


Copy and paste the following into your environment:

ip netns add foo
ip li add veth1 type veth peer name veth2
ip li set veth2 netns foo

ip -netns foo li set lo up
ip -netns foo li set veth2 up
ip -netns foo addr add 172.16.1.2/24 dev veth2


ip li add test-vrf type vrf table 123
ip li set test-vrf up
ip ro add vrf test-vrf unreachable default

ip li add  br0 type bridge
ip li set veth1 master br0
ip li set veth1 up
ip li set br0 up
ip addr add dev br0 172.16.1.1/24
ip li set br0 master test-vrf

ip -netns foo addr add 172.16.2.2/32 dev lo
ip ro add vrf test-vrf 172.16.2.2/32 via 172.16.1.2

Does ping work?
# ping -I test-vrf 172.16.2.2
ping: Warning: source address might be selected on device other than
test-vrf.
PING 172.16.2.2 (172.16.2.2) from 172.16.1.1 test-vrf: 56(84) bytes of data.
64 bytes from 172.16.2.2: icmp_seq=1 ttl=64 time=0.228 ms
64 bytes from 172.16.2.2: icmp_seq=2 ttl=64 time=0.263 ms

and:
# ping -I br0 172.16.2.2
PING 172.16.2.2 (172.16.2.2) from 172.16.1.1 br0: 56(84) bytes of data.
64 bytes from 172.16.2.2: icmp_seq=1 ttl=64 time=0.227 ms
64 bytes from 172.16.2.2: icmp_seq=2 ttl=64 time=0.223 ms
^C
--- 172.16.2.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.223/0.225/0.227/0.002 ms

Re: [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18

2018-07-23 Thread Saeed Mahameed

On Wed, 2018-07-18 at 18:00 -0700, Saeed Mahameed wrote:
> Hi dave,
> 
> This series includes updates for mlx5e net device driver, with a
> couple
> of major features and some misc updates.
> 
> Please notice the mlx5-next merge patch at the beginning:
> "Merge branch 'mlx5-next' of
> git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux"
> 
> For more information please see tag log below.
> 
> Please pull and let me know if there's any problem.
> 

I will re-post v2 without the "Support PCIe buffer congestion handling
via Devlink" patches until Eran sorts out the review comments.

Thanks,
Saeed.


> Thanks,
> Saeed.
> 
> --- 
> 
> The following changes since commit
> 681d5d071c8bd5533a14244c0d55d1c0e30aa989:
> 
>   Merge branch 'mlx5-next' of
> git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux (2018-
> 07-18 15:53:31 -0700)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git
> tags/mlx5e-updates-2018-07-18
> 
> for you to fetch changes up to
> a0ba57c09676689eb35f13d48990c9674c9baad4:
> 
>   net/mlx5e: Use PARTIAL_GSO for UDP segmentation (2018-07-18
> 17:26:28 -0700)
> 
> 
> mlx5e-updates-2018-07-18
> 
> This series includes update for mlx5e net device driver.
> 
> 1) From Feras Daoud, Added the support for firmware log tracing,
> first by introducing the firmware API needed for the task and then
> For each PF do the following:
> 1- Allocate memory for the tracer strings database and read it
> from the FW to the SW.
> 2- Allocate and dma map tracer buffers.
> 
> Traces that will be written into the buffer will be parsed as a
> group
> of one or more traces, referred to as trace message. The trace
> message
> represents a C-like printf string.
> Once a new trace is available  FW will generate an event indicates
> new trace/s are
> available and the driver will parse them and dump them using
> tracepoints
> event tracing
> 
> Enable mlx5 fw tracing by:
> echo 1 > /sys/kernel/debug/tracing/events/mlx5/mlx5_fw/enable
> 
> Read traces by:
> cat /sys/kernel/debug/tracing/trace
> 
> 2) From Eran Ben Elisha, Support PCIe buffer congestion handling
> via Devlink, using the new devlink device parameters API, added the
> new
> parameters:
>  - Congestion action
> HW mechanism in the PCIe buffer which monitors the amount
> of
> consumed PCIe buffer per host.  This mechanism supports
> the
> following actions in case of threshold overflow:
> - Disabled - NOP (Default)
> - Drop
> - Mark - Mark CE bit in the CQE of received packet
> - Congestion mode
> - Aggressive - Aggressive static trigger threshold
> (Default)
> - Dynamic - Dynamically change the trigger threshold
> 
> 3) From Natali, Set ECN for received packets using CQE indication.
> Using Eran's congestion settings a user can enable ECN marking, on
> such case
> driver must update ECN CE IP fields when requested by firmware
> (congestion is sensed).
> 
> 4) From Roi Dayan, Remove redundant WARN when we cannot find neigh
> entry
> 
> 5) From Jianbo Liu, TC double vlan support
> - Support offloading tc double vlan headers match
> - Support offloading double vlan push/pop tc actions
> 
> 6) From Boris, re-visit UDP GSO, remove the splitting of UDP_GSO_L4
> packets
> in the driver, and exposes UDP_GSO_L4 as a PARTIAL_GSO feature.
> 
> 
> Boris Pismenny (1):
>   net/mlx5e: Use PARTIAL_GSO for UDP segmentation
> 
> Eran Ben Elisha (3):
>   net/mlx5: Move all devlink related functions calls to devlink.c
>   net/mlx5: Add MPEGC register configuration functionality
>   net/mlx5: Support PCIe buffer congestion handling via Devlink
> 
> Feras Daoud (5):
>   net/mlx5: FW tracer, implement tracer logic
>   net/mlx5: FW tracer, create trace buffer and copy strings
> database
>   net/mlx5: FW tracer, events handling
>   net/mlx5: FW tracer, parse traces and kernel tracing support
>   net/mlx5: FW tracer, Enable tracing
> 
> Jianbo Liu (3):
>   net/mlx5e: Support offloading tc double vlan headers match
>   net/mlx5e: Refactor tc vlan push/pop actions offloading
>   net/mlx5e: Support offloading double vlan push/pop tc actions
> 
> Natali Shechtman (1):
>   net/mlx5e: Set ECN for received packets using CQE indication
> 
> Roi Dayan (1):
>   net/mlx5e: Remove redundant WARN when we cannot find neigh
> entry
> 
> Saeed Mahameed (2):
>   net/mlx5: FW tracer, register log buffer memory key
>   net/mlx5: FW tracer, Add debug prints
> 
>  drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   6 +-
>  drivers/net/ethernet/mellanox/mlx5/core/devlink.c  | 267 ++
>  drivers/net/ethernet/mellanox/mlx5/core/devlink.h  |  41 +
>  .../ethernet/mellanox/mlx5/core/diag/fw_tracer.c   | 947
> ++

Re: [pull request][net 0/8] Mellanox, mlx5 fixes 2018-07-18

2018-07-23 Thread Saeed Mahameed

On Sat, 2018-07-21 at 10:20 -0700, David Miller wrote:
> From: Saeed Mahameed 
> Date: Wed, 18 Jul 2018 18:26:04 -0700
> 
> > The following series provides fixes to mlx5 core and net device
> > driver.
> > 
> > Please pull and let me know if there's any problem.
> 
> Pulled, thanks Saeed.
> 
> Based upon the thread with Or, it would be useful to do some auditing
> and make sure all tunnels set skb->encapsulation.
> 

Thanks Dave, Eran's patch relies on our driver setting skb-
>encapsulation, since the rps rule is injected directly from the device
rx path once netif_receive_skb_internal is called via the gro stack way
before forwarding the skb to the tunnel netdev.

> > For -stable v4.7
> > net/mlx5e: Don't allow aRFS for encapsulated packets
> > net/mlx5e: Fix quota counting in aRFS expire flow
> > 
> > For -stable v4.15
> > net/mlx5e: Only allow offloading decap egress (egdev) flows
> > net/mlx5e: Refine ets validation function
> > net/mlx5: Adjust clock overflow work period
> > 
> > For -stable v4.17
> > net/mlx5: E-Switch, UBSAN fix undefined behavior in
> > mlx5_eswitch_mode
> 
> Queued up.
> 
> Thanks.

Re: [PATCH net-next 3/4] net/tc: introduce TC_ACT_MIRRED.

2018-07-23 Thread Cong Wang

On Fri, Jul 20, 2018 at 2:54 AM Paolo Abeni  wrote:
>
> Hi,
>
> Jiri, Cong, thank you for the feedback. Please allow me to give a
> single reply to both of you, as you rised similar concers.
>
> On Thu, 2018-07-19 at 11:07 -0700, Cong Wang wrote:
> > On Thu, Jul 19, 2018 at 6:03 AM Paolo Abeni  wrote:
> > >
> > > This is similar TC_ACT_REDIRECT, but with a slightly different
> > > semantic:
> > > - on ingress the mirred skbs are passed to the target device
> > > network stack without any additional check not scrubbing.
> > > - the rcu-protected stats provided via the tcf_result struct
> > >   are updated on error conditions.
> >
> > At least its name sucks, it means to skip the skb_clone(),
> > that is avoid a copy, but you still call it MIRRED...
> >
> > MIRRED means MIRror and REDirect.
>
> I was not satified with the name, too, but I also wanted to collect
> some feedback, as the different time zones are not helping here.
>
> Would TC_ACT_REINJECT be a better choice? (renaming skb_tc_redirect as
> skb_tc_reinject, too). Do you have some better name?


Any name not implying a copy is better. I don't worry about it, please
see below.


> > Also, I don't understand why this new TC_ACT code needs
> > to be visible to user-space, whether to clone or not is purely
> > internal.
>
> Note this is what already happens with TC_ACT_REDIRECT: currently the
> user space uses it freely, even if only {cls,act}_bpf can return such
> value in a meaningful way, and only from the ingress and the egress
> hooks.
>

Yes, my question is why do we give user such a freedom?

In other words, what do you want users to choose here? To scrub or not
to scrub? To clone or not to clone?

>From my understanding of your whole patchset, your goal is to get rid
of clone, and users definitely don't care about clone or not clone for
redirections, this is why I insist it doesn't need to be visible to user.

If your goal is not just skipping clone, but also, let's say, scrub or not
scrub, then it should be visible to users. However, I don't see why
users care about scrub or not, they have to understand what scrub
is at least, it is a purely kernel-internal behavior.


> I think we can add a clear separation between the values accessible
> from user-space, and the ones used interanally by the kernel, with
> something like the code below (basically unknown actions are explicitly
> mapped to TC_ACT_UNSPEC), WDYT?
>
> Note: as TC_ACT_REDIRECT is already part of the uAPI, it will remain
> accessible from user-space, so patch 1/4 would be still needed.

I think that is doable too, but we should understand whether we
need to do it or not at first.

Thanks.

Re: [PATCH v4 net-next 0/8] lan743x: Add features to lan743x driver

2018-07-23 Thread David Miller

From: Bryan Whitehead 
Date: Mon, 23 Jul 2018 16:16:25 -0400

> This patch series adds extra features to the lan743x driver.

Series applied, thank you.

Re: [PATCH mlx5-next v2 2/8] net/mlx5: Add support for flow table destination number

2018-07-23 Thread Saeed Mahameed

On Mon, 2018-07-23 at 15:25 +0300, Leon Romanovsky wrote:
> From: Yishai Hadas 
> 
> Add support to set a destination from a flow table number.
> This functionality will be used in downstream patches from this
> series by the DEVX stuff.
> 
> Signed-off-by: Yishai Hadas 
> Signed-off-by: Leon Romanovsky 
> 

Acked-by: Saeed Mahameed 

> ---
>  .../mellanox/mlx5/core/diag/fs_tracepoint.c|  3 +++
>  drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c   | 24
> ++
>  drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  |  4 +++-
>  include/linux/mlx5/fs.h|  1 +
>  include/linux/mlx5/mlx5_ifc.h  |  1 +
>  5 files changed, 23 insertions(+), 10 deletions(-)
> 
> diff --git
> a/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c
> b/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c
> index b3820a34e773..0f11fff32a9b 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c
> @@ -240,6 +240,9 @@ const char *parse_fs_dst(struct trace_seq *p,
>   case MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE:
>   trace_seq_printf(p, "ft=%p\n", dst->ft);
>   break;
> + case MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE_NUM:
> + trace_seq_printf(p, "ft_num=%u\n", dst->ft_num);
> + break;
>   case MLX5_FLOW_DESTINATION_TYPE_TIR:
>   trace_seq_printf(p, "tir=%u\n", dst->tir_num);
>   break;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
> b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
> index 5a00deff5457..910d25f84f2f 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
> @@ -362,18 +362,20 @@ static int mlx5_cmd_set_fte(struct
> mlx5_core_dev *dev,
>   int list_size = 0;
>  
>   list_for_each_entry(dst, &fte->node.children,
> node.list) {
> - unsigned int id;
> + unsigned int id, type = dst->dest_attr.type;
>  
> - if (dst->dest_attr.type ==
> MLX5_FLOW_DESTINATION_TYPE_COUNTER)
> + if (type ==
> MLX5_FLOW_DESTINATION_TYPE_COUNTER)
>   continue;
>  
> - MLX5_SET(dest_format_struct, in_dests,
> destination_type,
> -  dst->dest_attr.type);
> - if (dst->dest_attr.type ==
> - MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE) {
> + switch (type) {
> + case
> MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE_NUM:
> + id = dst->dest_attr.ft_num;
> + type =
> MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE;
> + break;
> + case MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE:
>   id = dst->dest_attr.ft->id;
> - } else if (dst->dest_attr.type ==
> -MLX5_FLOW_DESTINATION_TYPE_VPORT)
> {
> + break;
> + case MLX5_FLOW_DESTINATION_TYPE_VPORT:
>   id = dst->dest_attr.vport.num;
>   MLX5_SET(dest_format_struct,
> in_dests,
>destination_eswitch_owner_v
> hca_id_valid,
> @@ -381,9 +383,13 @@ static int mlx5_cmd_set_fte(struct mlx5_core_dev
> *dev,
>   MLX5_SET(dest_format_struct,
> in_dests,
>destination_eswitch_owner_v
> hca_id,
>dst-
> >dest_attr.vport.vhca_id);
> - } else {
> + break;
> + default:
>   id = dst->dest_attr.tir_num;
>   }
> +
> + MLX5_SET(dest_format_struct, in_dests,
> destination_type,
> +  type);
>   MLX5_SET(dest_format_struct, in_dests,
> destination_id, id);
>   in_dests +=
> MLX5_ST_SZ_BYTES(dest_format_struct);
>   list_size++;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
> b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
> index eba113cf1117..69aa298a0b1c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
> @@ -1356,7 +1356,9 @@ static bool mlx5_flow_dests_cmp(struct
> mlx5_flow_destination *d1,
>   (d1->type ==
> MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE &&
>d1->ft == d2->ft) ||
>   (d1->type == MLX5_FLOW_DESTINATION_TYPE_TIR &&
> -  d1->tir_num == d2->tir_num))
> +  d1->tir_num == d2->tir_num) ||
> + (d1->type ==
> MLX5_FLOW_DESTINATION_TYPE

Re: [PATCH mlx5-next v2 1/8] net/mlx5: Add forward compatible support for the FTE match data

2018-07-23 Thread Saeed Mahameed

On Mon, 2018-07-23 at 15:25 +0300, Leon Romanovsky wrote:
> From: Yishai Hadas 
> 
> Use the PRM size including the reserved when working with the FTE
> match data.
> 
> This comes to support forward compatibility for cases that current
> reserved data will be exposed by the firmware by an application that
> uses the DEVX API without changing the kernel.
> 
> Also drop some driver checks around the match criteria leaving the
> work
> for firmware to enable forward compatibility for future bits there.
> 
> Signed-off-by: Yishai Hadas 
> Signed-off-by: Leon Romanovsky 

Acked-by: Saeed Mahameed 

> ---
>  drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 77 +--
> 
>  1 file changed, 1 insertion(+), 76 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
> b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
> index 49a75d31185e..eba113cf1117 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
> @@ -309,89 +309,17 @@ static struct fs_prio *find_prio(struct
> mlx5_flow_namespace *ns,
>   return NULL;
>  }
>  
> -static bool check_last_reserved(const u32 *match_criteria)
> -{
> - char *match_criteria_reserved =
> - MLX5_ADDR_OF(fte_match_param, match_criteria,
> MLX5_FTE_MATCH_PARAM_RESERVED);
> -
> - return  !match_criteria_reserved[0] &&
> - !memcmp(match_criteria_reserved,
> match_criteria_reserved + 1,
> - MLX5_FLD_SZ_BYTES(fte_match_param,
> -   MLX5_FTE_MATCH_PARAM_RESER
> VED) - 1);
> -}
> -
> -static bool check_valid_mask(u8 match_criteria_enable, const u32
> *match_criteria)
> -{
> - if (match_criteria_enable & ~(
> - (1 <<
> MLX5_CREATE_FLOW_GROUP_IN_MATCH_CRITERIA_ENABLE_OUTER_HEADERS)   |
> - (1 <<
> MLX5_CREATE_FLOW_GROUP_IN_MATCH_CRITERIA_ENABLE_MISC_PARAMETERS) |
> - (1 <<
> MLX5_CREATE_FLOW_GROUP_IN_MATCH_CRITERIA_ENABLE_INNER_HEADERS) |
> - (1 <<
> MLX5_CREATE_FLOW_GROUP_IN_MATCH_CRITERIA_ENABLE_MISC_PARAMETERS_2)))
> - return false;
> -
> - if (!(match_criteria_enable &
> -   1 <<
> MLX5_CREATE_FLOW_GROUP_IN_MATCH_CRITERIA_ENABLE_OUTER_HEADERS)) {
> - char *fg_type_mask = MLX5_ADDR_OF(fte_match_param,
> -   match_criteria,
> outer_headers);
> -
> - if (fg_type_mask[0] ||
> - memcmp(fg_type_mask, fg_type_mask + 1,
> -MLX5_ST_SZ_BYTES(fte_match_set_lyr_2_4) -
> 1))
> - return false;
> - }
> -
> - if (!(match_criteria_enable &
> -   1 <<
> MLX5_CREATE_FLOW_GROUP_IN_MATCH_CRITERIA_ENABLE_MISC_PARAMETERS)) {
> - char *fg_type_mask = MLX5_ADDR_OF(fte_match_param,
> -   match_criteria,
> misc_parameters);
> -
> - if (fg_type_mask[0] ||
> - memcmp(fg_type_mask, fg_type_mask + 1,
> -MLX5_ST_SZ_BYTES(fte_match_set_misc) -
> 1))
> - return false;
> - }
> -
> - if (!(match_criteria_enable &
> -   1 <<
> MLX5_CREATE_FLOW_GROUP_IN_MATCH_CRITERIA_ENABLE_INNER_HEADERS)) {
> - char *fg_type_mask = MLX5_ADDR_OF(fte_match_param,
> -   match_criteria,
> inner_headers);
> -
> - if (fg_type_mask[0] ||
> - memcmp(fg_type_mask, fg_type_mask + 1,
> -MLX5_ST_SZ_BYTES(fte_match_set_lyr_2_4) -
> 1))
> - return false;
> - }
> -
> - if (!(match_criteria_enable &
> -   1 <<
> MLX5_CREATE_FLOW_GROUP_IN_MATCH_CRITERIA_ENABLE_MISC_PARAMETERS_2)) {
> - char *fg_type_mask = MLX5_ADDR_OF(fte_match_param,
> -   match_criteria,
> misc_parameters_2);
> -
> - if (fg_type_mask[0] ||
> - memcmp(fg_type_mask, fg_type_mask + 1,
> -MLX5_ST_SZ_BYTES(fte_match_set_misc2) -
> 1))
> - return false;
> - }
> -
> - return check_last_reserved(match_criteria);
> -}
> -
>  static bool check_valid_spec(const struct mlx5_flow_spec *spec)
>  {
>   int i;
>  
> - if (!check_valid_mask(spec->match_criteria_enable, spec-
> >match_criteria)) {
> - pr_warn("mlx5_core: Match criteria given mismatches
> match_criteria_enable\n");
> - return false;
> - }
> -
>   for (i = 0; i < MLX5_ST_SZ_DW_MATCH_PARAM; i++)
>   if (spec->match_value[i] & ~spec->match_criteria[i]) 
> {
>   pr_warn("mlx5_core: match_value differs from
> match_criteria\n");
>   return false;
>   }
>  
> - return check_last_reserved(spec->match_value);
> + return true;
>  }
>  
>  static struct mlx5_flow_root_namespace *find_root(struct fs_node

Re: [PATCH net-next] net: remove redundant input checks in SIOCSIFTXQLEN case of dev_ifsioc

2018-07-23 Thread David Miller

From: Cong Wang 
Date: Mon, 23 Jul 2018 13:37:22 -0700

> On Sun, Jul 22, 2018 at 12:29 AM Tariq Toukan  wrote:
>>
>>
>>
>> On 19/07/2018 8:21 PM, Cong Wang wrote:
>> > On Thu, Jul 19, 2018 at 7:50 AM Tariq Toukan  wrote:
>> >> --- a/net/core/dev_ioctl.c
>> >> +++ b/net/core/dev_ioctl.c
>> >> @@ -282,14 +282,7 @@ static int dev_ifsioc(struct net *net, struct ifreq 
>> >> *ifr, unsigned int cmd)
>> >>  return dev_mc_del_global(dev, ifr->ifr_hwaddr.sa_data);
>> >>
>> >>  case SIOCSIFTXQLEN:
>> >> -   if (ifr->ifr_qlen < 0)
>> >> -   return -EINVAL;
>> >
>> > Are you sure we can remove this if check too?
>> >
>> > The other one is safe to remove.
>> >
>>
>> Hmm, let's see:
>> dev_change_tx_queue_len gets unsigned long new_len, any negative value
>> passed is interpreted as a very large number, then we test:
>> if (new_len != (unsigned int)new_len)
>>
>> This test returns true if range of unsigned long is larger than range of
>> unsigned int. AFAIK these ranges are Arch dependent and there is no
>> guarantee this holds.
> 
> I am not sure either, you probably have to give it a test.
> And at least, explain it in changelog if you still want to remove it.

On 64-bit we will fail with -ERANGE.  The 32-bit int ifr_qlen will be sign
extended to 64-bits when it is passed into dev_change_tx_queue_len(). And
then for negative values this test triggers:

if (new_len != (unsigned int)new_len)
return -ERANGE;

because:
if (0xWHATEVER != 0xWHATEVER)

On 32-bit the signed value will be accepted, changing behavior.

I think, therefore, that the < 0 check should be retained.

Thank you.

Re: [PATCH iproute2] devlink: CTRL_ATTR_FAMILY_ID is a u16

2018-07-23 Thread Stephen Hemminger

On Fri, 20 Jul 2018 09:35:26 -0700
dsah...@kernel.org wrote:

> From: David Ahern 
> 
> CTRL_ATTR_FAMILY_ID is a u16, not a u32. Update devlink accordingly.
> 
> Fixes: a3c4b484a1edd ("add devlink tool")
> Signed-off-by: David Ahern 

Applied

Re: [PATCH v5 net-next] net/sched: add skbprio scheduler

2018-07-23 Thread Cong Wang

On Mon, Jul 23, 2018 at 7:07 AM Nishanth Devarajan  wrote:
>
> net/sched: add skbprio scheduler
>
> Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes packets
> according to their skb->priority field. Under congestion, already-enqueued 
> lower
> priority packets will be dropped to make space available for higher priority
> packets. Skbprio was conceived as a solution for denial-of-service defenses 
> that
> need to route packets with different priorities as a means to overcome DoS
> attacks.
>
> v5
> *Do not reference qdisc_dev(sch)->tx_queue_len for setting limit. Instead set
> default sch->limit to 64.

Overall, it looks much better now!

Acked-by: Cong Wang 

Thanks the update!

Re: [PATCH net-next] net: remove redundant input checks in SIOCSIFTXQLEN case of dev_ifsioc

2018-07-23 Thread Cong Wang

On Sun, Jul 22, 2018 at 12:29 AM Tariq Toukan  wrote:
>
>
>
> On 19/07/2018 8:21 PM, Cong Wang wrote:
> > On Thu, Jul 19, 2018 at 7:50 AM Tariq Toukan  wrote:
> >> --- a/net/core/dev_ioctl.c
> >> +++ b/net/core/dev_ioctl.c
> >> @@ -282,14 +282,7 @@ static int dev_ifsioc(struct net *net, struct ifreq 
> >> *ifr, unsigned int cmd)
> >>  return dev_mc_del_global(dev, ifr->ifr_hwaddr.sa_data);
> >>
> >>  case SIOCSIFTXQLEN:
> >> -   if (ifr->ifr_qlen < 0)
> >> -   return -EINVAL;
> >
> > Are you sure we can remove this if check too?
> >
> > The other one is safe to remove.
> >
>
> Hmm, let's see:
> dev_change_tx_queue_len gets unsigned long new_len, any negative value
> passed is interpreted as a very large number, then we test:
> if (new_len != (unsigned int)new_len)
>
> This test returns true if range of unsigned long is larger than range of
> unsigned int. AFAIK these ranges are Arch dependent and there is no
> guarantee this holds.
>

I am not sure either, you probably have to give it a test.
And at least, explain it in changelog if you still want to remove it.

Thanks.

[PATCH net] sock: fix sg page frag coalescing in sk_alloc_sg

2018-07-23 Thread Daniel Borkmann

Current sg coalescing logic in sk_alloc_sg() (latter is used by tls and
sockmap) is not quite correct in that we do fetch the previous sg entry,
however the subsequent check whether the refilled page frag from the
socket is still the same as from the last entry with prior offset and
length matching the start of the current buffer is comparing always the
first sg list entry instead of the prior one.

Fixes: 3c4d7559159b ("tls: kernel TLS support")
Signed-off-by: Daniel Borkmann 
Acked-by: Dave Watson 
---
 net/core/sock.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 9e8f655..bc2d7a3 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2277,9 +2277,9 @@ int sk_alloc_sg(struct sock *sk, int len, struct 
scatterlist *sg,
pfrag->offset += use;
 
sge = sg + sg_curr - 1;
-   if (sg_curr > first_coalesce && sg_page(sg) == pfrag->page &&
-   sg->offset + sg->length == orig_offset) {
-   sg->length += use;
+   if (sg_curr > first_coalesce && sg_page(sge) == pfrag->page &&
+   sge->offset + sge->length == orig_offset) {
+   sge->length += use;
} else {
sge = sg + sg_curr;
sg_unmark_end(sge);
-- 
2.9.5

[PATCH v4 net-next 1/8] lan743x: Add support for ethtool get_drvinfo

2018-07-23 Thread Bryan Whitehead

Implement ethtool get_drvinfo

Signed-off-by: Bryan Whitehead 
Reviewed-by: Andrew Lunn 
---
 drivers/net/ethernet/microchip/Makefile  |  2 +-
 drivers/net/ethernet/microchip/lan743x_ethtool.c | 21 +
 drivers/net/ethernet/microchip/lan743x_ethtool.h | 11 +++
 drivers/net/ethernet/microchip/lan743x_main.c|  2 ++
 4 files changed, 35 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/microchip/lan743x_ethtool.c
 create mode 100644 drivers/net/ethernet/microchip/lan743x_ethtool.h

diff --git a/drivers/net/ethernet/microchip/Makefile 
b/drivers/net/ethernet/microchip/Makefile
index 2e982cc..43f47cb 100644
--- a/drivers/net/ethernet/microchip/Makefile
+++ b/drivers/net/ethernet/microchip/Makefile
@@ -6,4 +6,4 @@ obj-$(CONFIG_ENC28J60) += enc28j60.o
 obj-$(CONFIG_ENCX24J600) += encx24j600.o encx24j600-regmap.o
 obj-$(CONFIG_LAN743X) += lan743x.o
 
-lan743x-objs := lan743x_main.o
+lan743x-objs := lan743x_main.o lan743x_ethtool.o
diff --git a/drivers/net/ethernet/microchip/lan743x_ethtool.c 
b/drivers/net/ethernet/microchip/lan743x_ethtool.c
new file mode 100644
index 000..0e20758
--- /dev/null
+++ b/drivers/net/ethernet/microchip/lan743x_ethtool.c
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+/* Copyright (C) 2018 Microchip Technology Inc. */
+
+#include 
+#include "lan743x_main.h"
+#include "lan743x_ethtool.h"
+#include 
+
+static void lan743x_ethtool_get_drvinfo(struct net_device *netdev,
+   struct ethtool_drvinfo *info)
+{
+   struct lan743x_adapter *adapter = netdev_priv(netdev);
+
+   strlcpy(info->driver, DRIVER_NAME, sizeof(info->driver));
+   strlcpy(info->bus_info,
+   pci_name(adapter->pdev), sizeof(info->bus_info));
+}
+
+const struct ethtool_ops lan743x_ethtool_ops = {
+   .get_drvinfo = lan743x_ethtool_get_drvinfo,
+};
diff --git a/drivers/net/ethernet/microchip/lan743x_ethtool.h 
b/drivers/net/ethernet/microchip/lan743x_ethtool.h
new file mode 100644
index 000..d0d11a7
--- /dev/null
+++ b/drivers/net/ethernet/microchip/lan743x_ethtool.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+/* Copyright (C) 2018 Microchip Technology Inc. */
+
+#ifndef _LAN743X_ETHTOOL_H
+#define _LAN743X_ETHTOOL_H
+
+#include "linux/ethtool.h"
+
+extern const struct ethtool_ops lan743x_ethtool_ops;
+
+#endif /* _LAN743X_ETHTOOL_H */
diff --git a/drivers/net/ethernet/microchip/lan743x_main.c 
b/drivers/net/ethernet/microchip/lan743x_main.c
index e1747a4..ade3b04 100644
--- a/drivers/net/ethernet/microchip/lan743x_main.c
+++ b/drivers/net/ethernet/microchip/lan743x_main.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include "lan743x_main.h"
+#include "lan743x_ethtool.h"
 
 static void lan743x_pci_cleanup(struct lan743x_adapter *adapter)
 {
@@ -2689,6 +2690,7 @@ static int lan743x_pcidev_probe(struct pci_dev *pdev,
goto cleanup_hardware;
 
adapter->netdev->netdev_ops = &lan743x_netdev_ops;
+   adapter->netdev->ethtool_ops = &lan743x_ethtool_ops;
adapter->netdev->features = NETIF_F_SG | NETIF_F_TSO | NETIF_F_HW_CSUM;
adapter->netdev->hw_features = adapter->netdev->features;
 
-- 
2.7.4

[PATCH v4 net-next 4/8] lan743x: Add support for ethtool message level

2018-07-23 Thread Bryan Whitehead

Implement ethtool message level

Signed-off-by: Bryan Whitehead 
Reviewed-by: Andrew Lunn 
---
 drivers/net/ethernet/microchip/lan743x_ethtool.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/drivers/net/ethernet/microchip/lan743x_ethtool.c 
b/drivers/net/ethernet/microchip/lan743x_ethtool.c
index 9ed9711..bab1344 100644
--- a/drivers/net/ethernet/microchip/lan743x_ethtool.c
+++ b/drivers/net/ethernet/microchip/lan743x_ethtool.c
@@ -17,6 +17,21 @@ static void lan743x_ethtool_get_drvinfo(struct net_device 
*netdev,
pci_name(adapter->pdev), sizeof(info->bus_info));
 }
 
+static u32 lan743x_ethtool_get_msglevel(struct net_device *netdev)
+{
+   struct lan743x_adapter *adapter = netdev_priv(netdev);
+
+   return adapter->msg_enable;
+}
+
+static void lan743x_ethtool_set_msglevel(struct net_device *netdev,
+u32 msglevel)
+{
+   struct lan743x_adapter *adapter = netdev_priv(netdev);
+
+   adapter->msg_enable = msglevel;
+}
+
 static const char lan743x_set0_hw_cnt_strings[][ETH_GSTRING_LEN] = {
"RX FCS Errors",
"RX Alignment Errors",
@@ -196,6 +211,8 @@ static int lan743x_ethtool_get_sset_count(struct net_device 
*netdev, int sset)
 
 const struct ethtool_ops lan743x_ethtool_ops = {
.get_drvinfo = lan743x_ethtool_get_drvinfo,
+   .get_msglevel = lan743x_ethtool_get_msglevel,
+   .set_msglevel = lan743x_ethtool_set_msglevel,
.get_link = ethtool_op_get_link,
 
.get_strings = lan743x_ethtool_get_strings,
-- 
2.7.4

[PATCH v4 net-next 0/8] lan743x: Add features to lan743x driver

2018-07-23 Thread Bryan Whitehead

This patch series adds extra features to the lan743x driver.

Updates for v4:
Patch 6/8 - Modified get/set_wol to use super set of
MAC and PHY driver support.
Patch 7/9 - In set_eee, return the return value from phy_ethtool_set_eee.

Updates for v3:
Removed patch 9 from this series, regarding PTP support
Patch 6/8 - Add call to phy_ethtool_get_wol to lan743x_ethtool_get_wol
Patch 7/8 - Add call to phy_ethtool_set_eee on (!eee->eee_enabled)

Updates for v2:
Patch 3/9 - Used ARRAY_SIZE macro in lan743x_ethtool_get_ethtool_stats.
Patch 5/9 - Used MAX_EEPROM_SIZE in lan743x_ethtool_set_eeprom.
Patch 6/9 - Removed unnecessary read of PMT_CTL.
Used CRC algorithm from lib.
Removed PHY interrupt settings from lan743x_pm_suspend
Change "#if CONFIG_PM" to "#ifdef CONFIG_PM"

Bryan Whitehead (8):
  lan743x: Add support for ethtool get_drvinfo
  lan743x: Add support for ethtool link settings
  lan743x: Add support for ethtool statistics
  lan743x: Add support for ethtool message level
  lan743x: Add support for ethtool eeprom access
  lan743x: Add power management support
  lan743x: Add EEE support
  lan743x: Add RSS support

 drivers/net/ethernet/microchip/Makefile  |   2 +-
 drivers/net/ethernet/microchip/lan743x_ethtool.c | 696 +++
 drivers/net/ethernet/microchip/lan743x_ethtool.h |  11 +
 drivers/net/ethernet/microchip/lan743x_main.c| 204 ++-
 drivers/net/ethernet/microchip/lan743x_main.h| 133 +
 5 files changed, 1042 insertions(+), 4 deletions(-)
 create mode 100644 drivers/net/ethernet/microchip/lan743x_ethtool.c
 create mode 100644 drivers/net/ethernet/microchip/lan743x_ethtool.h

-- 
2.7.4

[PATCH v4 net-next 3/8] lan743x: Add support for ethtool statistics

2018-07-23 Thread Bryan Whitehead

Implement ethtool statistics

Signed-off-by: Bryan Whitehead 
Reviewed-by: Andrew Lunn 
---
 drivers/net/ethernet/microchip/lan743x_ethtool.c | 180 +++
 drivers/net/ethernet/microchip/lan743x_main.c|   6 +-
 drivers/net/ethernet/microchip/lan743x_main.h|  31 
 3 files changed, 214 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/microchip/lan743x_ethtool.c 
b/drivers/net/ethernet/microchip/lan743x_ethtool.c
index 5c4582c..9ed9711 100644
--- a/drivers/net/ethernet/microchip/lan743x_ethtool.c
+++ b/drivers/net/ethernet/microchip/lan743x_ethtool.c
@@ -17,10 +17,190 @@ static void lan743x_ethtool_get_drvinfo(struct net_device 
*netdev,
pci_name(adapter->pdev), sizeof(info->bus_info));
 }
 
+static const char lan743x_set0_hw_cnt_strings[][ETH_GSTRING_LEN] = {
+   "RX FCS Errors",
+   "RX Alignment Errors",
+   "Rx Fragment Errors",
+   "RX Jabber Errors",
+   "RX Undersize Frame Errors",
+   "RX Oversize Frame Errors",
+   "RX Dropped Frames",
+   "RX Unicast Byte Count",
+   "RX Broadcast Byte Count",
+   "RX Multicast Byte Count",
+   "RX Unicast Frames",
+   "RX Broadcast Frames",
+   "RX Multicast Frames",
+   "RX Pause Frames",
+   "RX 64 Byte Frames",
+   "RX 65 - 127 Byte Frames",
+   "RX 128 - 255 Byte Frames",
+   "RX 256 - 511 Bytes Frames",
+   "RX 512 - 1023 Byte Frames",
+   "RX 1024 - 1518 Byte Frames",
+   "RX Greater 1518 Byte Frames",
+};
+
+static const char lan743x_set1_sw_cnt_strings[][ETH_GSTRING_LEN] = {
+   "RX Queue 0 Frames",
+   "RX Queue 1 Frames",
+   "RX Queue 2 Frames",
+   "RX Queue 3 Frames",
+};
+
+static const char lan743x_set2_hw_cnt_strings[][ETH_GSTRING_LEN] = {
+   "RX Total Frames",
+   "EEE RX LPI Transitions",
+   "EEE RX LPI Time",
+   "RX Counter Rollover Status",
+   "TX FCS Errors",
+   "TX Excess Deferral Errors",
+   "TX Carrier Errors",
+   "TX Bad Byte Count",
+   "TX Single Collisions",
+   "TX Multiple Collisions",
+   "TX Excessive Collision",
+   "TX Late Collisions",
+   "TX Unicast Byte Count",
+   "TX Broadcast Byte Count",
+   "TX Multicast Byte Count",
+   "TX Unicast Frames",
+   "TX Broadcast Frames",
+   "TX Multicast Frames",
+   "TX Pause Frames",
+   "TX 64 Byte Frames",
+   "TX 65 - 127 Byte Frames",
+   "TX 128 - 255 Byte Frames",
+   "TX 256 - 511 Bytes Frames",
+   "TX 512 - 1023 Byte Frames",
+   "TX 1024 - 1518 Byte Frames",
+   "TX Greater 1518 Byte Frames",
+   "TX Total Frames",
+   "EEE TX LPI Transitions",
+   "EEE TX LPI Time",
+   "TX Counter Rollover Status",
+};
+
+static const u32 lan743x_set0_hw_cnt_addr[] = {
+   STAT_RX_FCS_ERRORS,
+   STAT_RX_ALIGNMENT_ERRORS,
+   STAT_RX_FRAGMENT_ERRORS,
+   STAT_RX_JABBER_ERRORS,
+   STAT_RX_UNDERSIZE_FRAME_ERRORS,
+   STAT_RX_OVERSIZE_FRAME_ERRORS,
+   STAT_RX_DROPPED_FRAMES,
+   STAT_RX_UNICAST_BYTE_COUNT,
+   STAT_RX_BROADCAST_BYTE_COUNT,
+   STAT_RX_MULTICAST_BYTE_COUNT,
+   STAT_RX_UNICAST_FRAMES,
+   STAT_RX_BROADCAST_FRAMES,
+   STAT_RX_MULTICAST_FRAMES,
+   STAT_RX_PAUSE_FRAMES,
+   STAT_RX_64_BYTE_FRAMES,
+   STAT_RX_65_127_BYTE_FRAMES,
+   STAT_RX_128_255_BYTE_FRAMES,
+   STAT_RX_256_511_BYTES_FRAMES,
+   STAT_RX_512_1023_BYTE_FRAMES,
+   STAT_RX_1024_1518_BYTE_FRAMES,
+   STAT_RX_GREATER_1518_BYTE_FRAMES,
+};
+
+static const u32 lan743x_set2_hw_cnt_addr[] = {
+   STAT_RX_TOTAL_FRAMES,
+   STAT_EEE_RX_LPI_TRANSITIONS,
+   STAT_EEE_RX_LPI_TIME,
+   STAT_RX_COUNTER_ROLLOVER_STATUS,
+   STAT_TX_FCS_ERRORS,
+   STAT_TX_EXCESS_DEFERRAL_ERRORS,
+   STAT_TX_CARRIER_ERRORS,
+   STAT_TX_BAD_BYTE_COUNT,
+   STAT_TX_SINGLE_COLLISIONS,
+   STAT_TX_MULTIPLE_COLLISIONS,
+   STAT_TX_EXCESSIVE_COLLISION,
+   STAT_TX_LATE_COLLISIONS,
+   STAT_TX_UNICAST_BYTE_COUNT,
+   STAT_TX_BROADCAST_BYTE_COUNT,
+   STAT_TX_MULTICAST_BYTE_COUNT,
+   STAT_TX_UNICAST_FRAMES,
+   STAT_TX_BROADCAST_FRAMES,
+   STAT_TX_MULTICAST_FRAMES,
+   STAT_TX_PAUSE_FRAMES,
+   STAT_TX_64_BYTE_FRAMES,
+   STAT_TX_65_127_BYTE_FRAMES,
+   STAT_TX_128_255_BYTE_FRAMES,
+   STAT_TX_256_511_BYTES_FRAMES,
+   STAT_TX_512_1023_BYTE_FRAMES,
+   STAT_TX_1024_1518_BYTE_FRAMES,
+   STAT_TX_GREATER_1518_BYTE_FRAMES,
+   STAT_TX_TOTAL_FRAMES,
+   STAT_EEE_TX_LPI_TRANSITIONS,
+   STAT_EEE_TX_LPI_TIME,
+   STAT_TX_COUNTER_ROLLOVER_STATUS
+};
+
+static void lan743x_ethtool_get_strings(struct net_device *netdev,
+   u32 stringset, u8 *data)
+{
+   switch (stringset) {
+   case ETH_SS_STATS:
+   memcpy(data, lan743x_set0_hw_cnt_strings,
+  sizeof(lan743x_set0_hw_cnt_strings));
+   mem

[PATCH v4 net-next 6/8] lan743x: Add power management support

2018-07-23 Thread Bryan Whitehead

Implement power management
Supports suspend, resume, and Wake on LAN

Signed-off-by: Bryan Whitehead 
---
 drivers/net/ethernet/microchip/lan743x_ethtool.c |  47 ++
 drivers/net/ethernet/microchip/lan743x_main.c| 176 +++
 drivers/net/ethernet/microchip/lan743x_main.h|  47 ++
 3 files changed, 270 insertions(+)

diff --git a/drivers/net/ethernet/microchip/lan743x_ethtool.c 
b/drivers/net/ethernet/microchip/lan743x_ethtool.c
index f9ad237..56b45aa 100644
--- a/drivers/net/ethernet/microchip/lan743x_ethtool.c
+++ b/drivers/net/ethernet/microchip/lan743x_ethtool.c
@@ -415,6 +415,49 @@ static int lan743x_ethtool_get_sset_count(struct 
net_device *netdev, int sset)
}
 }
 
+#ifdef CONFIG_PM
+static void lan743x_ethtool_get_wol(struct net_device *netdev,
+   struct ethtool_wolinfo *wol)
+{
+   struct lan743x_adapter *adapter = netdev_priv(netdev);
+
+   wol->supported = 0;
+   wol->wolopts = 0;
+   phy_ethtool_get_wol(netdev->phydev, wol);
+
+   wol->supported |= WAKE_BCAST | WAKE_UCAST | WAKE_MCAST |
+   WAKE_MAGIC | WAKE_PHY | WAKE_ARP;
+
+   wol->wolopts |= adapter->wolopts;
+}
+
+static int lan743x_ethtool_set_wol(struct net_device *netdev,
+  struct ethtool_wolinfo *wol)
+{
+   struct lan743x_adapter *adapter = netdev_priv(netdev);
+
+   adapter->wolopts = 0;
+   if (wol->wolopts & WAKE_UCAST)
+   adapter->wolopts |= WAKE_UCAST;
+   if (wol->wolopts & WAKE_MCAST)
+   adapter->wolopts |= WAKE_MCAST;
+   if (wol->wolopts & WAKE_BCAST)
+   adapter->wolopts |= WAKE_BCAST;
+   if (wol->wolopts & WAKE_MAGIC)
+   adapter->wolopts |= WAKE_MAGIC;
+   if (wol->wolopts & WAKE_PHY)
+   adapter->wolopts |= WAKE_PHY;
+   if (wol->wolopts & WAKE_ARP)
+   adapter->wolopts |= WAKE_ARP;
+
+   device_set_wakeup_enable(&adapter->pdev->dev, (bool)wol->wolopts);
+
+   phy_ethtool_set_wol(netdev->phydev, wol);
+
+   return 0;
+}
+#endif /* CONFIG_PM */
+
 const struct ethtool_ops lan743x_ethtool_ops = {
.get_drvinfo = lan743x_ethtool_get_drvinfo,
.get_msglevel = lan743x_ethtool_get_msglevel,
@@ -429,4 +472,8 @@ const struct ethtool_ops lan743x_ethtool_ops = {
.get_sset_count = lan743x_ethtool_get_sset_count,
.get_link_ksettings = phy_ethtool_get_link_ksettings,
.set_link_ksettings = phy_ethtool_set_link_ksettings,
+#ifdef CONFIG_PM
+   .get_wol = lan743x_ethtool_get_wol,
+   .set_wol = lan743x_ethtool_set_wol,
+#endif
 };
diff --git a/drivers/net/ethernet/microchip/lan743x_main.c 
b/drivers/net/ethernet/microchip/lan743x_main.c
index 1e2f8c6..30178f8 100644
--- a/drivers/net/ethernet/microchip/lan743x_main.c
+++ b/drivers/net/ethernet/microchip/lan743x_main.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "lan743x_main.h"
 #include "lan743x_ethtool.h"
 
@@ -2749,10 +2750,182 @@ static void lan743x_pcidev_shutdown(struct pci_dev 
*pdev)
lan743x_netdev_close(netdev);
rtnl_unlock();
 
+#ifdef CONFIG_PM
+   pci_save_state(pdev);
+#endif
+
/* clean up lan743x portion */
lan743x_hardware_cleanup(adapter);
 }
 
+#ifdef CONFIG_PM
+static u16 lan743x_pm_wakeframe_crc16(const u8 *buf, int len)
+{
+   return bitrev16(crc16(0x, buf, len));
+}
+
+static void lan743x_pm_set_wol(struct lan743x_adapter *adapter)
+{
+   const u8 ipv4_multicast[3] = { 0x01, 0x00, 0x5E };
+   const u8 ipv6_multicast[3] = { 0x33, 0x33 };
+   const u8 arp_type[2] = { 0x08, 0x06 };
+   int mask_index;
+   u32 pmtctl;
+   u32 wucsr;
+   u32 macrx;
+   u16 crc;
+
+   for (mask_index = 0; mask_index < MAC_NUM_OF_WUF_CFG; mask_index++)
+   lan743x_csr_write(adapter, MAC_WUF_CFG(mask_index), 0);
+
+   /* clear wake settings */
+   pmtctl = lan743x_csr_read(adapter, PMT_CTL);
+   pmtctl |= PMT_CTL_WUPS_MASK_;
+   pmtctl &= ~(PMT_CTL_GPIO_WAKEUP_EN_ | PMT_CTL_EEE_WAKEUP_EN_ |
+   PMT_CTL_WOL_EN_ | PMT_CTL_MAC_D3_RX_CLK_OVR_ |
+   PMT_CTL_RX_FCT_RFE_D3_CLK_OVR_ | PMT_CTL_ETH_PHY_WAKE_EN_);
+
+   macrx = lan743x_csr_read(adapter, MAC_RX);
+
+   wucsr = 0;
+   mask_index = 0;
+
+   pmtctl |= PMT_CTL_ETH_PHY_D3_COLD_OVR_ | PMT_CTL_ETH_PHY_D3_OVR_;
+
+   if (adapter->wolopts & WAKE_PHY) {
+   pmtctl |= PMT_CTL_ETH_PHY_EDPD_PLL_CTL_;
+   pmtctl |= PMT_CTL_ETH_PHY_WAKE_EN_;
+   }
+   if (adapter->wolopts & WAKE_MAGIC) {
+   wucsr |= MAC_WUCSR_MPEN_;
+   macrx |= MAC_RX_RXEN_;
+   pmtctl |= PMT_CTL_WOL_EN_ | PMT_CTL_MAC_D3_RX_CLK_OVR_;
+   }
+   if (adapter->wolopts & WAKE_UCAST) {
+   wucsr |= MAC_WUCSR_RFE_WAKE_EN_ | MAC_WUCSR_PFDA_EN_;
+   macrx |= MAC_RX_RXEN_;
+   pmtctl |= PMT_CTL_WOL_

[PATCH v4 net-next 2/8] lan743x: Add support for ethtool link settings

2018-07-23 Thread Bryan Whitehead

Use default link setting functions

Signed-off-by: Bryan Whitehead 
Reviewed-by: Andrew Lunn 
---
 drivers/net/ethernet/microchip/lan743x_ethtool.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/microchip/lan743x_ethtool.c 
b/drivers/net/ethernet/microchip/lan743x_ethtool.c
index 0e20758..5c4582c 100644
--- a/drivers/net/ethernet/microchip/lan743x_ethtool.c
+++ b/drivers/net/ethernet/microchip/lan743x_ethtool.c
@@ -5,6 +5,7 @@
 #include "lan743x_main.h"
 #include "lan743x_ethtool.h"
 #include 
+#include 
 
 static void lan743x_ethtool_get_drvinfo(struct net_device *netdev,
struct ethtool_drvinfo *info)
@@ -18,4 +19,8 @@ static void lan743x_ethtool_get_drvinfo(struct net_device 
*netdev,
 
 const struct ethtool_ops lan743x_ethtool_ops = {
.get_drvinfo = lan743x_ethtool_get_drvinfo,
+   .get_link = ethtool_op_get_link,
+
+   .get_link_ksettings = phy_ethtool_get_link_ksettings,
+   .set_link_ksettings = phy_ethtool_set_link_ksettings,
 };
-- 
2.7.4

[PATCH v4 net-next 8/8] lan743x: Add RSS support

2018-07-23 Thread Bryan Whitehead

Implement RSS support

Signed-off-by: Bryan Whitehead 
---
 drivers/net/ethernet/microchip/lan743x_ethtool.c | 132 +++
 drivers/net/ethernet/microchip/lan743x_main.c|  20 
 drivers/net/ethernet/microchip/lan743x_main.h|  19 
 3 files changed, 171 insertions(+)

diff --git a/drivers/net/ethernet/microchip/lan743x_ethtool.c 
b/drivers/net/ethernet/microchip/lan743x_ethtool.c
index 86134d4..c25b3e9 100644
--- a/drivers/net/ethernet/microchip/lan743x_ethtool.c
+++ b/drivers/net/ethernet/microchip/lan743x_ethtool.c
@@ -415,6 +415,133 @@ static int lan743x_ethtool_get_sset_count(struct 
net_device *netdev, int sset)
}
 }
 
+static int lan743x_ethtool_get_rxnfc(struct net_device *netdev,
+struct ethtool_rxnfc *rxnfc,
+u32 *rule_locs)
+{
+   switch (rxnfc->cmd) {
+   case ETHTOOL_GRXFH:
+   rxnfc->data = 0;
+   switch (rxnfc->flow_type) {
+   case TCP_V4_FLOW:case UDP_V4_FLOW:
+   case TCP_V6_FLOW:case UDP_V6_FLOW:
+   rxnfc->data |= RXH_L4_B_0_1 | RXH_L4_B_2_3;
+   /* fall through */
+   case IPV4_FLOW: case IPV6_FLOW:
+   rxnfc->data |= RXH_IP_SRC | RXH_IP_DST;
+   return 0;
+   }
+   break;
+   case ETHTOOL_GRXRINGS:
+   rxnfc->data = LAN743X_USED_RX_CHANNELS;
+   return 0;
+   }
+   return -EOPNOTSUPP;
+}
+
+static u32 lan743x_ethtool_get_rxfh_key_size(struct net_device *netdev)
+{
+   return 40;
+}
+
+static u32 lan743x_ethtool_get_rxfh_indir_size(struct net_device *netdev)
+{
+   return 128;
+}
+
+static int lan743x_ethtool_get_rxfh(struct net_device *netdev,
+   u32 *indir, u8 *key, u8 *hfunc)
+{
+   struct lan743x_adapter *adapter = netdev_priv(netdev);
+
+   if (indir) {
+   int dw_index;
+   int byte_index = 0;
+
+   for (dw_index = 0; dw_index < 32; dw_index++) {
+   u32 four_entries =
+   lan743x_csr_read(adapter, RFE_INDX(dw_index));
+
+   byte_index = dw_index << 2;
+   indir[byte_index + 0] =
+   ((four_entries >> 0) & 0x00FF);
+   indir[byte_index + 1] =
+   ((four_entries >> 8) & 0x00FF);
+   indir[byte_index + 2] =
+   ((four_entries >> 16) & 0x00FF);
+   indir[byte_index + 3] =
+   ((four_entries >> 24) & 0x00FF);
+   }
+   }
+   if (key) {
+   int dword_index;
+   int byte_index = 0;
+
+   for (dword_index = 0; dword_index < 10; dword_index++) {
+   u32 four_entries =
+   lan743x_csr_read(adapter,
+RFE_HASH_KEY(dword_index));
+
+   byte_index = dword_index << 2;
+   key[byte_index + 0] =
+   ((four_entries >> 0) & 0x00FF);
+   key[byte_index + 1] =
+   ((four_entries >> 8) & 0x00FF);
+   key[byte_index + 2] =
+   ((four_entries >> 16) & 0x00FF);
+   key[byte_index + 3] =
+   ((four_entries >> 24) & 0x00FF);
+   }
+   }
+   if (hfunc)
+   (*hfunc) = ETH_RSS_HASH_TOP;
+   return 0;
+}
+
+static int lan743x_ethtool_set_rxfh(struct net_device *netdev,
+   const u32 *indir, const u8 *key,
+   const u8 hfunc)
+{
+   struct lan743x_adapter *adapter = netdev_priv(netdev);
+
+   if (hfunc != ETH_RSS_HASH_NO_CHANGE && hfunc != ETH_RSS_HASH_TOP)
+   return -EOPNOTSUPP;
+
+   if (indir) {
+   u32 indir_value = 0;
+   int dword_index = 0;
+   int byte_index = 0;
+
+   for (dword_index = 0; dword_index < 32; dword_index++) {
+   byte_index = dword_index << 2;
+   indir_value =
+   (((indir[byte_index + 0] & 0x00FF) << 0) |
+   ((indir[byte_index + 1] & 0x00FF) << 8) |
+   ((indir[byte_index + 2] & 0x00FF) << 16) |
+   ((indir[byte_index + 3] & 0x00FF) << 24));
+   lan743x_csr_write(adapter, RFE_INDX(dword_index),
+ indir_value);
+   }
+   }
+   if (key) {
+   int dword_index = 0;
+   int byte_index = 0;
+   u32 key_va

[PATCH v4 net-next 5/8] lan743x: Add support for ethtool eeprom access

2018-07-23 Thread Bryan Whitehead

Implement ethtool eeprom access
Also provides access to OTP (One Time Programming)

Signed-off-by: Bryan Whitehead 
Reviewed-by: Andrew Lunn 
---
 drivers/net/ethernet/microchip/lan743x_ethtool.c | 209 +++
 drivers/net/ethernet/microchip/lan743x_main.h|  33 
 2 files changed, 242 insertions(+)

diff --git a/drivers/net/ethernet/microchip/lan743x_ethtool.c 
b/drivers/net/ethernet/microchip/lan743x_ethtool.c
index bab1344..f9ad237 100644
--- a/drivers/net/ethernet/microchip/lan743x_ethtool.c
+++ b/drivers/net/ethernet/microchip/lan743x_ethtool.c
@@ -7,6 +7,178 @@
 #include 
 #include 
 
+/* eeprom */
+#define LAN743X_EEPROM_MAGIC   (0x74A5)
+#define LAN743X_OTP_MAGIC  (0x74F3)
+#define EEPROM_INDICATOR_1 (0xA5)
+#define EEPROM_INDICATOR_2 (0xAA)
+#define EEPROM_MAC_OFFSET  (0x01)
+#define MAX_EEPROM_SIZE512
+#define OTP_INDICATOR_1(0xF3)
+#define OTP_INDICATOR_2(0xF7)
+
+static int lan743x_otp_write(struct lan743x_adapter *adapter, u32 offset,
+u32 length, u8 *data)
+{
+   unsigned long timeout;
+   u32 buf;
+   int i;
+
+   buf = lan743x_csr_read(adapter, OTP_PWR_DN);
+
+   if (buf & OTP_PWR_DN_PWRDN_N_) {
+   /* clear it and wait to be cleared */
+   lan743x_csr_write(adapter, OTP_PWR_DN, 0);
+
+   timeout = jiffies + HZ;
+   do {
+   udelay(1);
+   buf = lan743x_csr_read(adapter, OTP_PWR_DN);
+   if (time_after(jiffies, timeout)) {
+   netif_warn(adapter, drv, adapter->netdev,
+  "timeout on OTP_PWR_DN 
completion\n");
+   return -EIO;
+   }
+   } while (buf & OTP_PWR_DN_PWRDN_N_);
+   }
+
+   /* set to BYTE program mode */
+   lan743x_csr_write(adapter, OTP_PRGM_MODE, OTP_PRGM_MODE_BYTE_);
+
+   for (i = 0; i < length; i++) {
+   lan743x_csr_write(adapter, OTP_ADDR1,
+ ((offset + i) >> 8) &
+ OTP_ADDR1_15_11_MASK_);
+   lan743x_csr_write(adapter, OTP_ADDR2,
+ ((offset + i) &
+ OTP_ADDR2_10_3_MASK_));
+   lan743x_csr_write(adapter, OTP_PRGM_DATA, data[i]);
+   lan743x_csr_write(adapter, OTP_TST_CMD, OTP_TST_CMD_PRGVRFY_);
+   lan743x_csr_write(adapter, OTP_CMD_GO, OTP_CMD_GO_GO_);
+
+   timeout = jiffies + HZ;
+   do {
+   udelay(1);
+   buf = lan743x_csr_read(adapter, OTP_STATUS);
+   if (time_after(jiffies, timeout)) {
+   netif_warn(adapter, drv, adapter->netdev,
+  "Timeout on OTP_STATUS 
completion\n");
+   return -EIO;
+   }
+   } while (buf & OTP_STATUS_BUSY_);
+   }
+
+   return 0;
+}
+
+static int lan743x_eeprom_wait(struct lan743x_adapter *adapter)
+{
+   unsigned long start_time = jiffies;
+   u32 val;
+
+   do {
+   val = lan743x_csr_read(adapter, E2P_CMD);
+
+   if (!(val & E2P_CMD_EPC_BUSY_) ||
+   (val & E2P_CMD_EPC_TIMEOUT_))
+   break;
+   usleep_range(40, 100);
+   } while (!time_after(jiffies, start_time + HZ));
+
+   if (val & (E2P_CMD_EPC_TIMEOUT_ | E2P_CMD_EPC_BUSY_)) {
+   netif_warn(adapter, drv, adapter->netdev,
+  "EEPROM read operation timeout\n");
+   return -EIO;
+   }
+
+   return 0;
+}
+
+static int lan743x_eeprom_confirm_not_busy(struct lan743x_adapter *adapter)
+{
+   unsigned long start_time = jiffies;
+   u32 val;
+
+   do {
+   val = lan743x_csr_read(adapter, E2P_CMD);
+
+   if (!(val & E2P_CMD_EPC_BUSY_))
+   return 0;
+
+   usleep_range(40, 100);
+   } while (!time_after(jiffies, start_time + HZ));
+
+   netif_warn(adapter, drv, adapter->netdev, "EEPROM is busy\n");
+   return -EIO;
+}
+
+static int lan743x_eeprom_read(struct lan743x_adapter *adapter,
+  u32 offset, u32 length, u8 *data)
+{
+   int retval;
+   u32 val;
+   int i;
+
+   retval = lan743x_eeprom_confirm_not_busy(adapter);
+   if (retval)
+   return retval;
+
+   for (i = 0; i < length; i++) {
+   val = E2P_CMD_EPC_BUSY_ | E2P_CMD_EPC_CMD_READ_;
+   val |= (offset & E2P_CMD_EPC_ADDR_MASK_);
+   lan743x_csr_write(adapter, E2P_CMD, val);
+
+   retval = lan743x_eeprom_wait(adapter);
+

[PATCH v4 net-next 7/8] lan743x: Add EEE support

2018-07-23 Thread Bryan Whitehead

Implement EEE support

Signed-off-by: Bryan Whitehead 
---
 drivers/net/ethernet/microchip/lan743x_ethtool.c | 85 
 drivers/net/ethernet/microchip/lan743x_main.h|  3 +
 2 files changed, 88 insertions(+)

diff --git a/drivers/net/ethernet/microchip/lan743x_ethtool.c 
b/drivers/net/ethernet/microchip/lan743x_ethtool.c
index 56b45aa..86134d4 100644
--- a/drivers/net/ethernet/microchip/lan743x_ethtool.c
+++ b/drivers/net/ethernet/microchip/lan743x_ethtool.c
@@ -415,6 +415,89 @@ static int lan743x_ethtool_get_sset_count(struct 
net_device *netdev, int sset)
}
 }
 
+static int lan743x_ethtool_get_eee(struct net_device *netdev,
+  struct ethtool_eee *eee)
+{
+   struct lan743x_adapter *adapter = netdev_priv(netdev);
+   struct phy_device *phydev = netdev->phydev;
+   u32 buf;
+   int ret;
+
+   if (!phydev)
+   return -EIO;
+   if (!phydev->drv) {
+   netif_err(adapter, drv, adapter->netdev,
+ "Missing PHY Driver\n");
+   return -EIO;
+   }
+
+   ret = phy_ethtool_get_eee(phydev, eee);
+   if (ret < 0)
+   return ret;
+
+   buf = lan743x_csr_read(adapter, MAC_CR);
+   if (buf & MAC_CR_EEE_EN_) {
+   eee->eee_enabled = true;
+   eee->eee_active = !!(eee->advertised & eee->lp_advertised);
+   eee->tx_lpi_enabled = true;
+   /* EEE_TX_LPI_REQ_DLY & tx_lpi_timer are same uSec unit */
+   buf = lan743x_csr_read(adapter, MAC_EEE_TX_LPI_REQ_DLY_CNT);
+   eee->tx_lpi_timer = buf;
+   } else {
+   eee->eee_enabled = false;
+   eee->eee_active = false;
+   eee->tx_lpi_enabled = false;
+   eee->tx_lpi_timer = 0;
+   }
+
+   return 0;
+}
+
+static int lan743x_ethtool_set_eee(struct net_device *netdev,
+  struct ethtool_eee *eee)
+{
+   struct lan743x_adapter *adapter = netdev_priv(netdev);
+   struct phy_device *phydev = NULL;
+   u32 buf = 0;
+   int ret = 0;
+
+   if (!netdev)
+   return -EINVAL;
+   adapter = netdev_priv(netdev);
+   if (!adapter)
+   return -EINVAL;
+   phydev = netdev->phydev;
+   if (!phydev)
+   return -EIO;
+   if (!phydev->drv) {
+   netif_err(adapter, drv, adapter->netdev,
+ "Missing PHY Driver\n");
+   return -EIO;
+   }
+
+   if (eee->eee_enabled) {
+   ret = phy_init_eee(phydev, 0);
+   if (ret) {
+   netif_err(adapter, drv, adapter->netdev,
+ "EEE initialization failed\n");
+   return ret;
+   }
+
+   buf = (u32)eee->tx_lpi_timer;
+   lan743x_csr_write(adapter, MAC_EEE_TX_LPI_REQ_DLY_CNT, buf);
+
+   buf = lan743x_csr_read(adapter, MAC_CR);
+   buf |= MAC_CR_EEE_EN_;
+   lan743x_csr_write(adapter, MAC_CR, buf);
+   } else {
+   buf = lan743x_csr_read(adapter, MAC_CR);
+   buf &= ~MAC_CR_EEE_EN_;
+   lan743x_csr_write(adapter, MAC_CR, buf);
+   }
+
+   return phy_ethtool_set_eee(phydev, eee);
+}
+
 #ifdef CONFIG_PM
 static void lan743x_ethtool_get_wol(struct net_device *netdev,
struct ethtool_wolinfo *wol)
@@ -470,6 +553,8 @@ const struct ethtool_ops lan743x_ethtool_ops = {
.get_strings = lan743x_ethtool_get_strings,
.get_ethtool_stats = lan743x_ethtool_get_ethtool_stats,
.get_sset_count = lan743x_ethtool_get_sset_count,
+   .get_eee = lan743x_ethtool_get_eee,
+   .set_eee = lan743x_ethtool_set_eee,
.get_link_ksettings = phy_ethtool_get_link_ksettings,
.set_link_ksettings = phy_ethtool_set_link_ksettings,
 #ifdef CONFIG_PM
diff --git a/drivers/net/ethernet/microchip/lan743x_main.h 
b/drivers/net/ethernet/microchip/lan743x_main.h
index 72b9beb..93cb60a 100644
--- a/drivers/net/ethernet/microchip/lan743x_main.h
+++ b/drivers/net/ethernet/microchip/lan743x_main.h
@@ -82,6 +82,7 @@
((value << 0) & FCT_FLOW_CTL_ON_THRESHOLD_)
 
 #define MAC_CR (0x100)
+#define MAC_CR_EEE_EN_ BIT(17)
 #define MAC_CR_ADD_BIT(12)
 #define MAC_CR_ASD_BIT(11)
 #define MAC_CR_CNTR_RST_   BIT(5)
@@ -117,6 +118,8 @@
 
 #define MAC_MII_DATA   (0x124)
 
+#define MAC_EEE_TX_LPI_REQ_DLY_CNT (0x130)
+
 #define MAC_WUCSR  (0x140)
 #define MAC_WUCSR_RFE_WAKE_EN_ BIT(14)
 #define MAC_WUCSR_PFDA_EN_ BIT(3)
-- 
2.7.4

Re: [PATCH bpf] xdp: add NULL pointer check in __xdp_return()

2018-07-23 Thread Jakub Kicinski

On Mon, 23 Jul 2018 11:39:36 +0200, Björn Töpel wrote:
> Den fre 20 juli 2018 kl 22:08 skrev Jakub Kicinski:
> > On Fri, 20 Jul 2018 10:18:21 -0700, Martin KaFai Lau wrote:  
> > > On Sat, Jul 21, 2018 at 01:04:45AM +0900, Taehee Yoo wrote:  
> > > > rhashtable_lookup() can return NULL. so that NULL pointer
> > > > check routine should be added.
> > > >
> > > > Fixes: 02b55e5657c3 ("xdp: add MEM_TYPE_ZERO_COPY")
> > > > Signed-off-by: Taehee Yoo 
> > > > ---
> > > >  net/core/xdp.c | 3 ++-
> > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > > > index 9d1f220..1c12bc7 100644
> > > > --- a/net/core/xdp.c
> > > > +++ b/net/core/xdp.c
> > > > @@ -345,7 +345,8 @@ static void __xdp_return(void *data, struct 
> > > > xdp_mem_info *mem, bool napi_direct,
> > > > rcu_read_lock();
> > > > /* mem->id is valid, checked in 
> > > > xdp_rxq_info_reg_mem_model() */
> > > > xa = rhashtable_lookup(mem_id_ht, &mem->id, 
> > > > mem_id_rht_params);
> > > > -   xa->zc_alloc->free(xa->zc_alloc, handle);
> > > > +   if (xa)
> > > > +   xa->zc_alloc->free(xa->zc_alloc, handle);  
> > > hmm...It is not clear to me the "!xa" case don't have to be handled?  
> >
> > Actually I have a more fundamental question about this interface I've
> > been meaning to ask.
> >
> > IIUC free() can happen on any CPU at any time, when whatever device,
> > socket or CPU this got redirected to completed the TX.  IOW there may
> > be multiple producers.  Drivers would need to create spin lock a'la the
> > a9744f7ca200 ("xsk: fix potential race in SKB TX completion code") fix?
> >  
> 
> Jakub, apologies for the slow response. I'm still in
> "holiday/hammock&beer mode", but will be back in a week. :-P

Ah, sorry to interrupt! :)

> The idea with the xdp_return_* functions are that an xdp_buff and
> xdp_frame can have custom allocations schemes. The difference beween
> struct xdp_buff and struct xdp_frame is lifetime. The xdp_buff
> lifetime is within the napi context, whereas xdp_frame can have a
> lifetime longer/outside the napi context. E.g. for a XDP_REDIRECT
> scenario an xdp_buff is converted to a xdp_frame. The conversion is
> done in include/net/xdp.h:convert_to_xdp_frame.
> 
> Currently, the zero-copy MEM_TYPE_ZERO_COPY memtype can *only* be used
> for xdp_buff, meaning that the lifetime is constrained to a napi
> context. Further, given an xdp_buff with memtype MEM_TYPE_ZERO_COPY,
> doing XDP_REDIRECT to a target that is *not* an AF_XDP socket would
> mean converting the xdp_buff to an xdp_frame. The xdp_frame can then
> be free'd on any CPU.
> 
> Note that the xsk_rcv* functions is always called from an napi
> context, and therefore is using the xdp_return_buff calls.
> 
> To answer your question -- no, this fix is *not* needed, because the
> xdp_buff napi constrained, and the xdp_buff will only be free'd on one
> CPU.

Oh, thanks, I missed the check in convert_to_xdp_frame(), so the only
frames which can come back via the free path are out of the error path
in __xsk_rcv_zc()?

That path looks a little surprising too, isn't the expectation that if
xdp_do_redirect() returns an error the driver retains the ownership of
the buffer? 
 
static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
{
int err = xskq_produce_batch_desc(xs->rx, (u64)xdp->handle, len);

if (err) {
xdp_return_buff(xdp);
xs->rx_dropped++;
}

return err;
}

This seems to call xdp_return_buff() *and* return an error.

> > We need some form of internal kernel circulation which would be MPSC.
> > I'm currently hacking up the XSK code to tell me whether the frame was
> > consumed by the correct XSK, and always clone the frame otherwise
> > (claiming to be the "traditional" MEM_TYPE_PAGE_ORDER0).
> >
> > I feel like I'm missing something about the code.  Is redirect of
> > ZC/UMEM frame outside the xsk not possible and the only returns we will
> > see are from net/xdp/xsk.c?  That would work, but I don't see such a
> > check.  Help would be appreciated.
> >  
> 
> Right now, this is the case (refer to the TODO in
> convert_to_xdp_frame), i.e. you cannot redirect an ZC/UMEM allocated
> xdp_buff to a target that is not an xsk. This must, obviously, change
> so that an xdp_buff (of MEM_TYPE_ZERO_COPY) can be converted to an
> xdp_frame. The xdp_frame must be able to be free'd from multiple CPUs,
> so here the a more sophisticated allocation scheme is required.
> 
> > Also the fact that XSK bufs can't be freed, only completed, adds to the
> > pain of implementing AF_XDP, we'd certainly need some form of "give
> > back the frame, but I may need it later" SPSC mechanism, otherwise
> > driver writers will have tough time.  Unless, again, I'm missing
> > something about the code :)
> >  
> 
> Yup, moving the recycling scheme from driver to "generic" is a good
> idea! I nee

[PATCH v2 net-next] net: phy: add helper phy_polling_mode

2018-07-23 Thread Heiner Kallweit

Add a helper for checking whether polling is used to detect PHY status
changes.

Signed-off-by: Heiner Kallweit 
---
v2:
- merge both patches
---
 drivers/net/phy/phy.c |  8 
 include/linux/phy.h   | 10 ++
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 914fe8e6..7ade22a7 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -519,7 +519,7 @@ static int phy_start_aneg_priv(struct phy_device *phydev, 
bool sync)
 * negotiation may already be done and aneg interrupt may not be
 * generated.
 */
-   if (phydev->irq != PHY_POLL && phydev->state == PHY_AN) {
+   if (!phy_polling_mode(phydev) && phydev->state == PHY_AN) {
err = phy_aneg_done(phydev);
if (err > 0) {
trigger = true;
@@ -977,7 +977,7 @@ void phy_state_machine(struct work_struct *work)
needs_aneg = true;
break;
case PHY_NOLINK:
-   if (phydev->irq != PHY_POLL)
+   if (!phy_polling_mode(phydev))
break;
 
err = phy_read_status(phydev);
@@ -1018,7 +1018,7 @@ void phy_state_machine(struct work_struct *work)
/* Only register a CHANGE if we are polling and link changed
 * since latest checking.
 */
-   if (phydev->irq == PHY_POLL) {
+   if (phy_polling_mode(phydev)) {
old_link = phydev->link;
err = phy_read_status(phydev);
if (err)
@@ -1117,7 +1117,7 @@ void phy_state_machine(struct work_struct *work)
 * PHY, if PHY_IGNORE_INTERRUPT is set, then we will be moving
 * between states from phy_mac_interrupt()
 */
-   if (phydev->irq == PHY_POLL)
+   if (phy_polling_mode(phydev))
queue_delayed_work(system_power_efficient_wq, 
&phydev->state_queue,
   PHY_STATE_TIME * HZ);
 }
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 075c2f77..cd6f637c 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -824,6 +824,16 @@ static inline bool phy_interrupt_is_valid(struct 
phy_device *phydev)
return phydev->irq != PHY_POLL && phydev->irq != PHY_IGNORE_INTERRUPT;
 }
 
+/**
+ * phy_polling_mode - Convenience function for testing whether polling is
+ * used to detect PHY status changes
+ * @phydev: the phy_device struct
+ */
+static inline bool phy_polling_mode(struct phy_device *phydev)
+{
+   return phydev->irq == PHY_POLL;
+}
+
 /**
  * phy_is_internal - Convenience function for testing if a PHY is internal
  * @phydev: the phy_device struct
-- 
2.18.0

Re: [PATCH net-next 1/2] net: phy: add helper phy_polling_mode

2018-07-23 Thread Heiner Kallweit

On 22.07.2018 20:11, David Miller wrote:
> 
> I think you can combine these two patches into one.
> 
> Thank you.
> 
Sure, will provide a v2.

Re: [PATCH net 0/5] tcp: more robust ooo handling

2018-07-23 Thread David Miller

From: Eric Dumazet 
Date: Mon, 23 Jul 2018 09:28:16 -0700

> Juha-Matti Tilli reported that malicious peers could inject tiny
> packets in out_of_order_queue, forcing very expensive calls
> to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for
> every incoming packet.
> 
> With tcp_rmem[2] default of 6MB, the ooo queue could
> contain ~7000 nodes.
> 
> This patch series makes sure we cut cpu cycles enough to
> render the attack not critical.
> 
> We might in the future go further, like disconnecting
> or black-holing proven malicious flows.

Sucky...

It took me a while to understand the sums_tiny logic, every
time I read that function I forget that we reset all of the
state and restart the loop after a coalesce inside the loop.

Series applied, and queued up for -stable.

Thanks!

Re: [PATCH bpf] bpf: btf: Ensure the member->offset is in the right order

2018-07-23 Thread Yonghong Song





On 7/20/18 5:38 PM, Martin KaFai Lau wrote:

This patch ensures the member->offset of a struct
is in the correct order (i.e the later member's offset cannot
go backward).

The current "pahole -J" BTF encoder does not generate something
like this.  However, checking this can ensure future encoder
will not violate this.

Fixes: 69b693f0aefa ("bpf: btf: Introduce BPF Type Format (BTF)")
Signed-off-by: Martin KaFai Lau 

Acked-by: Yonghong Song

Re: [PATCH v2 bpf 2/3] bpf: Replace [u]int32_t and [u]int64_t in libbpf

2018-07-23 Thread Martin KaFai Lau

On Mon, Jul 23, 2018 at 11:04:34AM -0700, Yonghong Song wrote:
> 
> 
> On 7/21/18 11:20 AM, Martin KaFai Lau wrote:
> > This patch replaces [u]int32_t and [u]int64_t usage with
> > __[su]32 and __[su]64.  The same change goes for [u]int16_t
> > and [u]int8_t.
> > 
> > Fixes: 8a138aed4a80 ("bpf: btf: Add BTF support to libbpf")
> > Signed-off-by: Martin KaFai Lau 
> > ---
> >   tools/lib/bpf/btf.c| 28 +---
> >   tools/lib/bpf/btf.h|  8 
> >   tools/lib/bpf/libbpf.c | 12 ++--
> >   tools/lib/bpf/libbpf.h |  4 ++--
> >   4 files changed, 25 insertions(+), 27 deletions(-)
> > 
> > diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
> > index 8c54a4b6f187..ce77b5b57912 100644
> > --- a/tools/lib/bpf/btf.c
> > +++ b/tools/lib/bpf/btf.c
> > @@ -2,7 +2,6 @@
> >   /* Copyright (c) 2018 Facebook */
> >   #include 
> > -#include 
> >   #include 
> >   #include 
> >   #include 
> > @@ -27,13 +26,13 @@ struct btf {
> > struct btf_type **types;
> > const char *strings;
> > void *nohdr_data;
> > -   uint32_t nr_types;
> > -   uint32_t types_size;
> > -   uint32_t data_size;
> > +   __u32 nr_types;
> > +   __u32 types_size;
> > +   __u32 data_size;
> > int fd;
> >   };
> > -static const char *btf_name_by_offset(const struct btf *btf, uint32_t 
> > offset)
> > +static const char *btf_name_by_offset(const struct btf *btf, __u32 offset)
> >   {
> > if (offset < btf->hdr->str_len)
> > return &btf->strings[offset];
> > @@ -151,7 +150,7 @@ static int btf_parse_type_sec(struct btf *btf, 
> > btf_print_fn_t err_log)
> > while (next_type < end_type) {
> > struct btf_type *t = next_type;
> > -   uint16_t vlen = BTF_INFO_VLEN(t->info);
> > +   __u16 vlen = BTF_INFO_VLEN(t->info);
> > int err;
> > next_type += sizeof(*t);
> > @@ -191,7 +190,7 @@ static int btf_parse_type_sec(struct btf *btf, 
> > btf_print_fn_t err_log)
> >   }
> >   static const struct btf_type *btf_type_by_id(const struct btf *btf,
> > -uint32_t type_id)
> > +__u32 type_id)
> >   {
> > if (type_id > btf->nr_types)
> > return NULL;
> > @@ -226,12 +225,12 @@ static int64_t btf_type_size(const struct btf_type *t)
> 
> Missing this one:
>static int64_t btf_type_size(const struct btf_type *t)
> 
> There are a couple of instances of using u32 instead of __u32, better to use
> __u32 everywhere in the same file:
> u32 expand_by, new_size;
> u32 meta_left;
Thanks for pointing them out.  Will make the changes.

> 
> 
> >   #define MAX_RESOLVE_DEPTH 32
> > -int64_t btf__resolve_size(const struct btf *btf, uint32_t type_id)
> > +__s64 btf__resolve_size(const struct btf *btf, __u32 type_id)
> >   {
> > const struct btf_array *array;
> > const struct btf_type *t;
> > -   uint32_t nelems = 1;
> > -   int64_t size = -1;
> > +   __u32 nelems = 1;
> > +   __s64 size = -1;
> > int i;
> > t = btf_type_by_id(btf, type_id);
> > @@ -271,9 +270,9 @@ int64_t btf__resolve_size(const struct btf *btf, 
> > uint32_t type_id)
> > return nelems * size;
> >   }
> > -int32_t btf__find_by_name(const struct btf *btf, const char *type_name)
> > +__s32 btf__find_by_name(const struct btf *btf, const char *type_name)
> >   {
> > -   uint32_t i;
> > +   __u32 i;
> > if (!strcmp(type_name, "void"))
> > return 0;
> > @@ -302,10 +301,9 @@ void btf__free(struct btf *btf)
> > free(btf);
> >   }
> > -struct btf *btf__new(uint8_t *data, uint32_t size,
> > -btf_print_fn_t err_log)
> > +struct btf *btf__new(__u8 *data, __u32 size, btf_print_fn_t err_log)
> >   {
> > -   uint32_t log_buf_size = 0;
> > +   __u32 log_buf_size = 0;
> > char *log_buf = NULL;
> > struct btf *btf;
> > int err;
> > diff --git a/tools/lib/bpf/btf.h b/tools/lib/bpf/btf.h
> > index 74bb344035bb..ed3a84370ccc 100644
> > --- a/tools/lib/bpf/btf.h
> > +++ b/tools/lib/bpf/btf.h
> > @@ -4,7 +4,7 @@
> >   #ifndef __BPF_BTF_H
> >   #define __BPF_BTF_H
> > -#include 
> > +#include 
> >   #define BTF_ELF_SEC ".BTF"
> > @@ -14,9 +14,9 @@ typedef int (*btf_print_fn_t)(const char *, ...)
> > __attribute__((format(printf, 1, 2)));
> >   void btf__free(struct btf *btf);
> > -struct btf *btf__new(uint8_t *data, uint32_t size, btf_print_fn_t err_log);
> > -int32_t btf__find_by_name(const struct btf *btf, const char *type_name);
> > -int64_t btf__resolve_size(const struct btf *btf, uint32_t type_id);
> > +struct btf *btf__new(__u8 *data, __u32 size, btf_print_fn_t err_log);
> > +__s32 btf__find_by_name(const struct btf *btf, const char *type_name);
> > +__s64 btf__resolve_size(const struct btf *btf, __u32 type_id);
> >   int btf__fd(const struct btf *btf);
> >   #endif
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index a1e96b5de5ff..6deb4fe4fffe 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/

Re: [PATCH v2 bpf 3/3] bpf: Introduce BPF_ANNOTATE_KV_PAIR

2018-07-23 Thread Martin KaFai Lau

On Mon, Jul 23, 2018 at 11:31:43AM -0700, Yonghong Song wrote:
> 
> 
> On 7/21/18 11:20 AM, Martin KaFai Lau wrote:
> > This patch introduces BPF_ANNOTATE_KV_PAIR to signal the
> > bpf loader about the btf key_type and value_type of a bpf map.
> > Please refer to the changes in test_btf_haskv.c for its usage.
> > Both iproute2 and libbpf loader will then have the same
> > convention to find out the map's btf_key_type_id and
> > btf_value_type_id from a map's name.
> > 
> > Fixes: 8a138aed4a80 ("bpf: btf: Add BTF support to libbpf")
> > Suggested-by: Daniel Borkmann 
> > Signed-off-by: Martin KaFai Lau 
> > ---
> >   tools/lib/bpf/btf.c  |  7 +-
> >   tools/lib/bpf/btf.h  |  2 +
> >   tools/lib/bpf/libbpf.c   | 75 +++-
> >   tools/testing/selftests/bpf/bpf_helpers.h|  9 +++
> >   tools/testing/selftests/bpf/test_btf_haskv.c |  7 +-
> >   5 files changed, 56 insertions(+), 44 deletions(-)
> > 
> > diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
> > index ce77b5b57912..321a99e648ed 100644
> > --- a/tools/lib/bpf/btf.c
> > +++ b/tools/lib/bpf/btf.c
> > @@ -189,8 +189,7 @@ static int btf_parse_type_sec(struct btf *btf, 
> > btf_print_fn_t err_log)
> > return 0;
> >   }
> > -static const struct btf_type *btf_type_by_id(const struct btf *btf,
> > -__u32 type_id)
> > +const struct btf_type *btf__type_by_id(const struct btf *btf, __u32 
> > type_id)
> >   {
> > if (type_id > btf->nr_types)
> > return NULL;
> > @@ -233,7 +232,7 @@ __s64 btf__resolve_size(const struct btf *btf, __u32 
> > type_id)
> > __s64 size = -1;
> > int i;
> > -   t = btf_type_by_id(btf, type_id);
> > +   t = btf__type_by_id(btf, type_id);
> > for (i = 0; i < MAX_RESOLVE_DEPTH && !btf_type_is_void_or_null(t);
> >  i++) {
> > size = btf_type_size(t);
> > @@ -258,7 +257,7 @@ __s64 btf__resolve_size(const struct btf *btf, __u32 
> > type_id)
> > return -EINVAL;
> > }
> > -   t = btf_type_by_id(btf, type_id);
> > +   t = btf__type_by_id(btf, type_id);
> > }
> > if (size < 0)
> > diff --git a/tools/lib/bpf/btf.h b/tools/lib/bpf/btf.h
> > index ed3a84370ccc..e2a09a155f84 100644
> > --- a/tools/lib/bpf/btf.h
> > +++ b/tools/lib/bpf/btf.h
> > @@ -9,6 +9,7 @@
> >   #define BTF_ELF_SEC ".BTF"
> >   struct btf;
> > +struct btf_type;
> >   typedef int (*btf_print_fn_t)(const char *, ...)
> > __attribute__((format(printf, 1, 2)));
> > @@ -16,6 +17,7 @@ typedef int (*btf_print_fn_t)(const char *, ...)
> >   void btf__free(struct btf *btf);
> >   struct btf *btf__new(__u8 *data, __u32 size, btf_print_fn_t err_log);
> >   __s32 btf__find_by_name(const struct btf *btf, const char *type_name);
> > +const struct btf_type *btf__type_by_id(const struct btf *btf, __u32 id);
> >   __s64 btf__resolve_size(const struct btf *btf, __u32 type_id);
> >   int btf__fd(const struct btf *btf);
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index 6deb4fe4fffe..d881d370616c 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -36,6 +36,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   #include 
> >   #include 
> >   #include 
> > @@ -1014,68 +1015,72 @@ bpf_program__collect_reloc(struct bpf_program 
> > *prog, GElf_Shdr *shdr,
> >   static int bpf_map_find_btf_info(struct bpf_map *map, const struct btf 
> > *btf)
> >   {
> > +   const struct btf_type *container_type;
> > +   const struct btf_member *key, *value;
> > struct bpf_map_def *def = &map->def;
> > const size_t max_name = 256;
> > +   char container_name[max_name];
> > __s64 key_size, value_size;
> > -   __s32 key_id, value_id;
> > -   char name[max_name];
> > +   __s32 container_id;
> > -   /* Find key type by name from BTF */
> > -   if (snprintf(name, max_name, "%s_key", map->name) == max_name) {
> > -   pr_warning("map:%s length of BTF key_type:%s_key is too long\n",
> > +   if (snprintf(container_name, max_name, "btf_map_%s", map->name) ==
> > +   max_name) {
> > +   pr_warning("map:%s length of 'btf_map_%s' is too long\n",
> >map->name, map->name);
> > return -EINVAL;
> > }
> > -   key_id = btf__find_by_name(btf, name);
> > -   if (key_id < 0) {
> > -   pr_debug("map:%s key_type:%s cannot be found in BTF\n",
> > -map->name, name);
> > -   return key_id;
> > +   container_id = btf__find_by_name(btf, container_name);
> > +   if (container_id < 0) {
> > +   pr_debug("map:%s container_name:%s cannot be found in BTF. 
> > Missing BPF_ANNOTATE_KV_PAIR?\n",
> > +map->name, container_name);
> > +   return container_id;
> > }
> > -   key_size = btf__resolve_size(btf, key_id);
> > -   if (key_size < 0) {
> > -   pr_warning("map:%s key_type:%s cannot

Re: [PATCH net] ip: hash fragments consistently

2018-07-23 Thread David Miller

From: Paolo Abeni 
Date: Mon, 23 Jul 2018 16:50:48 +0200

> The skb hash for locally generated ip[v6] fragments belonging
> to the same datagram can vary in several circumstances:
> * for connected UDP[v6] sockets, the first fragment get its hash
>   via set_owner_w()/skb_set_hash_from_sk()
> * for unconnected IPv6 UDPv6 sockets, the first fragment can get
>   its hash via ip6_make_flowlabel()/skb_get_hash_flowi6(), if
>   auto_flowlabel is enabled
> 
> For the following frags the hash is usually computed via
> skb_get_hash().
> The above can cause OoO for unconnected IPv6 UDPv6 socket: in that
> scenario the egress tx queue can be selected on a per packet basis
> via the skb hash.
> It may also fool flow-oriented schedulers to place fragments belonging
> to the same datagram in different flows.
> 
> Fix the issue by copying the skb hash from the head frag into
> the others at fragmentation time.
> 
> Before this commit:
> perf probe -a "dev_queue_xmit skb skb->hash skb->l4_hash:b1@0/8 
> skb->sw_hash:b1@1/8"
> netperf -H $IPV4 -t UDP_STREAM -l 5 -- -m 2000 -n &
> perf record -e probe:dev_queue_xmit -e probe:skb_set_owner_w -a sleep 0.1
> perf script
> probe:dev_queue_xmit: (8c6b1b20) hash=3713014309 l4_hash=1 sw_hash=0
> probe:dev_queue_xmit: (8c6b1b20) hash=0 l4_hash=0 sw_hash=0
> 
> After this commit:
> probe:dev_queue_xmit: (8c6b1b20) hash=2171763177 l4_hash=1 sw_hash=0
> probe:dev_queue_xmit: (8c6b1b20) hash=2171763177 l4_hash=1 sw_hash=0
> 
> Fixes: b73c3d0e4f0e ("net: Save TX flow hash in sock and set in skbuf on 
> xmit")
> Fixes: 67800f9b1f4e ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in 
> ip6_make_flowlabel")
> Signed-off-by: Paolo Abeni 

Good catch!

Applied and queued up for -stable, thanks!

1 2 >

1 - 100 of 182 matches

Mail list logo