from:"Eric Dumazet"

Re: [PATCH] drivers: net: wireless: ath: ath9k: dfs: remove VLA usage

2018-03-09 Thread Eric Dumazet




On 03/09/2018 05:23 AM, Andreas Christoforou wrote:

The kernel would like to have all stack VLA usage removed.

This is the correct patch.

Signed-off-by: Andreas Christoforou 


This is a lazy changelog really.

'This is the correct patch' has no technical value.

What is VLA  ? Sure, _now_ its pretty clear since we have floods of 
these patches, but...


In one or two years, people reading it will not understand the logic of 
this patch.


I had to look at the source to understand what was going on, and that is 
not a good sign for a trivial patch like that.

Re: [PATCH] wil6210: Replace five seq_puts() calls by seq_putc()

2017-05-09 Thread Eric Dumazet

On Tue, 2017-05-09 at 09:50 +0200, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Mon, 8 May 2017 22:22:04 +0200
> 
> Five single characters (line breaks) should be put into a sequence.
> Thus use the corresponding function "seq_putc".
> 
> This issue was detected by using the Coccinelle software.

There is no _issue_ at all here, only a matter of taste.

printf("\n")  or putchar('\n')  in some slow path is really not that
interesting.

Re: [PATCH v2] brcmfmac: Make skb header writable before use

2017-04-24 Thread Eric Dumazet

On Mon, 2017-04-24 at 14:03 +0100, James Hughes wrote:
> The driver was making changes to the skb_header without
> ensuring it was writable (i.e. uncloned).
> This patch also removes some boiler plate header size
> checking/adjustment code as that is also handled by the
> skb_cow_header function used to make header writable.
> 
> This patch depends on
> brcmfmac: Ensure pointer correctly set if skb data location changes
> 
> Signed-off-by: James Hughes 
> ---
> Changes in v2
>   Makes the _cow_ call at the entry point of the skb in to the
>   stack, means only needs to be done once, and error handling
>   is easier.
>   Split a separate minor bug fix off to a separate patch (which
>   this patch depends on)

Best way to handle dependencies is to send a patch series.

1/2 brcmfmac: Ensure pointer correctly set if skb data location changes
2/2 brcmfmac: Make skb header writable before use

Re: [PATCH] p54: add null pointer check before releasing socket buffer

2017-04-10 Thread Eric Dumazet

On Mon, Apr 10, 2017 at 2:22 PM, Christian Lamparter
 wrote:

> Well, the patch could be as simple as this:
> ---
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 7869ae3837ca..44f7d5a1c67c 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2450,6 +2450,9 @@ void __dev_kfree_skb_irq(struct sk_buff *skb, enum 
> skb_free_reason reason)
>  {
> unsigned long flags;
>
> +   if (!skb)
> +   return;
> +
> if (likely(atomic_read(>users) == 1)) {
> smp_rmb();
> atomic_set(>users, 0);
> ---
>
> The question is: would David or Eric support the change. Any comments,
> what's the prefered solution? Just patch __dev_kfree_skb_irq to make
> it consistent with *kfree*, or patch the driver? I'm fine either way,
> but I would prefere patching __dev_kfree_skb_irq.

This is fine, same check happens in consume_skb()

Re: [Make-wifi-fast] [PATCH v3] mac80211: Dynamically set CoDel parameters per station

2017-04-06 Thread Eric Dumazet

On Thu, 2017-04-06 at 11:38 +0200, Toke Høiland-Jørgensen wrote:

> +
> + if (thr && thr < STA_SLOW_THRESHOLD * sta->local->num_sta) {
> + sta->cparams.target = MS2TIME(50);
> + sta->cparams.interval = MS2TIME(300);
> + sta->cparams.ecn = false;
> + } else {
> + sta->cparams.target = MS2TIME(20);
> + sta->cparams.interval = MS2TIME(100);
> + sta->cparams.ecn = true;
> + }
> +}

Why ECN is flipped on/off like that ? ECN really should be an admin
choice.

Also, this change in parameters looks suspect to me, adding a bimodal
behavior. I would consult Kathleen and Van on this possibility.

Re: IPv6-UDP 0x0000 checksum

2017-01-26 Thread Eric Dumazet

On Thu, 2017-01-26 at 07:24 -0800, Eric Dumazet wrote:
> On Thu, 2017-01-26 at 15:49 +0100, Johannes Berg wrote:
> 
> > Unfortunately, I haven't been able to actually test this yet. I also
> > didn't find the code that would drop frames with CSUM 0 either, so I'm
> > thinking - for now - that if all the csum handling is skipped, dropping
> > 0 csum frames would also be, and then we'd accept a frame we should
> > actually have dropped.
> > 
> > I'll go test this I guess :)
> > 
> > Any pointers to where 0 csum frames are dropped?
> 
> Probably in udp6_csum_init()

vi +804 net/ipv6/udp.c

if (!uh->check && !udp_sk(sk)->no_check6_rx) {
udp6_csum_zero_error(skb);
goto csum_error;
}

Re: IPv6-UDP 0x0000 checksum

2017-01-26 Thread Eric Dumazet

On Thu, 2017-01-26 at 15:49 +0100, Johannes Berg wrote:

> Unfortunately, I haven't been able to actually test this yet. I also
> didn't find the code that would drop frames with CSUM 0 either, so I'm
> thinking - for now - that if all the csum handling is skipped, dropping
> 0 csum frames would also be, and then we'd accept a frame we should
> actually have dropped.
> 
> I'll go test this I guess :)
> 
> Any pointers to where 0 csum frames are dropped?

Probably in udp6_csum_init()

Re: IPv6-UDP 0x0000 checksum

2017-01-26 Thread Eric Dumazet

On Thu, 2017-01-26 at 14:49 +0100, Johannes Berg wrote:

> Oops, sorry - receive. We can only indicate "CHECKSUM_UNNECESSARY",
> nothing more advanced right now, but right now we'd indicate that if
> the packet had 0x in the checksum field, but should've had 0x.
> 
> On TX I believe we actually do in HW exactly what your patch just did.

Can you describe the visible effects of this problem ?

Is that because of a conversion we might do later to CHECKSUM_COMPLETE ?

Re: IPv6-UDP 0x0000 checksum

2017-01-26 Thread Eric Dumazet

On Thu, 2017-01-26 at 14:27 +0100, Johannes Berg wrote:
> Hi,
> 
> It looks like right now we may have a hardware bug and accept 0x as
> valid, when the outcome of the calculation is 0x.
> 
> What do you think we should do about this?
> 
> We could ignore the issue entirely, since 0 wasn't ever supposed to be
> sent anyway - but then we don't drop frames that we should drop. I
> didn't manage to find the code in the IPv6/UDP stack that even does
> that, but I assume it's there somewhere.
> 
> Alternatively, we could parse the packet to find the checksum inside,
> and if it's 0 then don't report CHECKSUM_UNNECESSARY, but that seems
> rather expensive/difficult due to the IPv6 variable header and all
> that. If we wanted to go this route, are there any helper functions for
> this?
> 
> Unfortunately, in the current devices, we neither have a complete
> indication that the packet was even UDP-IPv6, nor do we have the raw
> csum or anything like that. I think they're adding that to the next
> hardware spin, but we probably need to address this issue now.
> 
> johannes

Hi Johannes

I am afraid information is missing.

Is this a xmit or rcv problem ?

I recently fixed an issue, could this be this ?

commit 4f2e4ad56a65f3b7d64c258e373cb71e8d2499f4
Author: Eric Dumazet <eduma...@google.com>
Date:   Sat Oct 29 11:02:36 2016 -0700

net: mangle zero checksum in skb_checksum_help()

Sending zero checksum is ok for TCP, but not for UDP.

UDPv6 receiver should by default drop a frame with a 0 checksum,
and UDPv4 would not verify the checksum and might accept a corrupted
packet.

Simply replace such checksum by 0x, regardless of transport.

This error was caught on SIT tunnels, but seems generic.

Signed-off-by: Eric Dumazet <eduma...@google.com>
Cc: Maciej Żenczykowski <m...@google.com>
Cc: Willem de Bruijn <will...@google.com>
Acked-by: Maciej Żenczykowski <m...@google.com>
Signed-off-by: David S. Miller <da...@davemloft.net>

diff --git a/net/core/dev.c b/net/core/dev.c
index 
820bac239738eb021354ac95ca5bbdff1840cb8e..eaad4c28069ff523ac784bf2dffd0acff82341a0
 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2484,7 +2484,7 @@ int skb_checksum_help(struct sk_buff *skb)
goto out;
}
 
-   *(__sum16 *)(skb->data + offset) = csum_fold(csum);
+   *(__sum16 *)(skb->data + offset) = csum_fold(csum) ?: CSUM_MANGLED_0;
 out_set_summed:
skb->ip_summed = CHECKSUM_NONE;
 out:

Re: [Make-wifi-fast] On the ath9k performance regression with FQ and crypto

2016-08-16 Thread Eric Dumazet


Do you have tcpdumps of

1) sample with crypto

2) sample without crypto.

Looks like some TCP Small queue interaction with skb->truesize, if GSO
is involved, or encapsulation adding overhead.


On Tue, 2016-08-16 at 22:41 +0200, Toke Høiland-Jørgensen wrote:
> So Dave and I have been spending the last couple of days trying to
> narrow down why there's a performance regression in some cases on ath9k
> with the softq-FQ patches. Felix first noticed this regression, and LEDE
> currently carries a patch [1] to disable the FQ portion of the softq
> patches to avoid it.
> 
> While we have been able to narrow it down a little bit, no solution has
> been forthcoming, so this is an attempt to describe the bug in the hope
> that someone else will have an idea about what could be causing it.
> 
> What we're seeing is the following (when the access point is running
> ath9k with the softq patches):
> 
> When running two or more flows to a station, their combined throughput
> will be roughly 20-30% lower than the throughput of a single flow to the
> same station. This happens:
> 
> - for both TCP and UDP traffic.
> - independent of the base rate (i.e. signal quality).
> - but only with crypto enabled (WPA2 CCMP in this case).
> 
> However, the regression completely disappears if either of the
> following is true:
> 
> - no crypto is enabled.
> - the FQ part of mac80211 is disabled (as in [1]).
> 
> We have been able to reproduce this behaviour on two different ath9k
> hardware chips and two different architectures.
> 
> The cause of the regression seems to be that the aggregates are smaller
> when there are two flows than when there is only one. Adding debug
> statements to the aggregate forming code indicates that this is because
> no more packets are available when the aggregates are built (i.e.
> ieee80211_tx_dequeue() returns NULL).
> 
> We have not been able to determine why the queues run empty when this
> combination of circumstances arise. Since we easily get upwards of 120
> Mbps of TCP throughput without crypto but with full FQ, it's clearly not
> the hashing overhead in itself that does it (and the hashing also
> happens with just one flow, so the overhead is still there). And the
> crypto itself should be offloaded to hardware (shouldn't it? we do see a
> marked drop in overall throughput from just enabling crypto), so how
> would the queueing (say, mixing of packets from different flows)
> influence that?
> 
> Does anyone have any ideas? We are stumped...
> 
> -Toke
> 
> [1] 
> https://git.lede-project.org/?p=lede/nbd/staging.git;a=blob;f=package/kernel/mac80211/patches/220-fq_disable_hack.patch;h=7f420beea56335d5043de6fd71b5febae3e9bd79;hb=HEAD
> ___
> Make-wifi-fast mailing list
> make-wifi-f...@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/make-wifi-fast

Re: [RFC 5/7] net: Add allocation flag to rtnl_unicast()

2016-07-07 Thread Eric Dumazet

On Fri, 2016-07-08 at 12:15 +0900, Masashi Honma wrote:
=
> Thanks for comment.
> 
> I have selected GFP flags based on existing code.
> 
> I have selected GFP_ATOMIC in inet6_netconf_get_devconf() because
> skb was allocated with GFP_ATOMIC.

Point is : we should remove GFP_ATOMIC uses as much as we can.

Everytime we see one of them, we should think why it was added
and if this is really needed.

inet6_netconf_get_devconf() is a perfect example of one careless
GFP_ATOMIC usage

https://patchwork.ozlabs.org/patch/646291/






--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 5/7] net: Add allocation flag to rtnl_unicast()

2016-07-07 Thread Eric Dumazet

On Wed, 2016-07-06 at 09:28 +0900, Masashi Honma wrote:
> Signed-off-by: Masashi Honma 
> ---


> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index a1f6b7b..2b0b994 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -628,7 +628,7 @@ static int inet6_netconf_get_devconf(struct sk_buff 
> *in_skb,
>   kfree_skb(skb);
>   goto errout;
>   }
> - err = rtnl_unicast(skb, net, NETLINK_CB(in_skb).portid);
> + err = rtnl_unicast(skb, net, NETLINK_CB(in_skb).portid, GFP_ATOMIC);
>  errout:
>   return err;
>  }
> @@ -4824,7 +4824,7 @@ static int inet6_rtm_getaddr(struct sk_buff *in_skb, 
> struct nlmsghdr *nlh)
>   kfree_skb(skb);
>   goto errout_ifa;
>   }
> - err = rtnl_unicast(skb, net, NETLINK_CB(in_skb).portid);
> + err = rtnl_unicast(skb, net, NETLINK_CB(in_skb).portid, GFP_KERNEL);
>  errout_ifa:
>   in6_ifa_put(ifa);
>  errout:


Managing to mix GFP_ATOMIC and GFP_KERNEL almost randomly as you did in
this patch is definitely not good.

Further more, RTNL is a mutex, held in control path, designed to allow
schedules and waiting for memory under pressure.

We do not want to encourage GFP_ATOMIC usage in control path.

Your patch series gives the wrong signal to developers.

I will send a patch against net/ipv4/devinet.c so that we remove
GFP_ATOMIC usage there.



--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Codel] [PATCHv3 2/5] mac80211: implement fair queueing per txq

2016-04-18 Thread Eric Dumazet

On Mon, 2016-04-18 at 07:16 +0200, Michal Kazior wrote:

> 
> I guess .h file can give the compiler an opportunity for more
> optimizations. With .c you would need LTO which I'm not sure if it's
> available everywhere.
> 

This makes little sense really. Otherwise everything would be in .h
files.

include/net/codel.h is an include file because both codel and fq_codel
use a common template for codel_dequeue() in fast path.

But net/mac80211/fq.h is included once, so should be a .c

Certainly all the code in control plan is not fast path and does not
deserve being duplicated.

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mwifiex: add __GFP_REPEAT to skb allocation call

2016-03-29 Thread Eric Dumazet

On Tue, 2016-03-29 at 17:27 +0800, Wei-Ning Huang wrote:
> Adding some chromium devs to the thread.
> 
> In, http://lxr.free-electrons.com/source/mm/page_alloc.c#L3152
> 
> The default mm retry allocation when 'order <=
> PAGE_ALLOC_COSTLY_ORDER' of gfp_mask contains __GFP_REPEAT.
> PAGE_ALLOC_COSTLY_ORDER is defined to be 3. On systems with page size
> = 4K, this means memory compaction and retry is only done when the
> size of allocation is <= 32K
> In mwifiex, the allocation size is 64K.



>  When we have system with
> memory fragmentation and allocation failed, there will be no retry.
> This is why we need to add __GFP_REPEAT here to allow the system to
> perform memory compaction and retry allocation.
> 
> Maybe Amit@marvell can comment on if this is a good fix on this issue.
> I'm also aware that marvell is the progress of implementing
> scatter/gatter for mwifiex, which can also fix the issue.

Before SG is implemented, you really need to copy incoming frames into
smallest chunks (to get lowest skb->truesize) and leave the 64KB
allocated stuff forever in the driver.

__GFP_REPEAT wont really solve the issue.

It seems the problem comes from the fact that the drivers calls
dev_kfree_skb_any() after calling mwifiex_deaggr_sdio_pkt(), instead of
recycling this very precious 64KB skb once memory gets fragmented.

Another problem is that mwifiex_deaggr_sdio_pkt() uses
mwifiex_alloc_dma_align_buf() with GFP_KERNEL | GFP_DMA

Really GFP_DMA makes no sense here, since the skb is going to be
processed by the stack, which has no such requirement.

Please use normal skb allocations there.

diff --git a/drivers/net/wireless/marvell/mwifiex/sdio.c 
b/drivers/net/wireless/marvell/mwifiex/sdio.c
index b2c839a..8404db5 100644
--- a/drivers/net/wireless/marvell/mwifiex/sdio.c
+++ b/drivers/net/wireless/marvell/mwifiex/sdio.c
@@ -1123,8 +1123,8 @@ static void mwifiex_deaggr_sdio_pkt(struct 
mwifiex_adapter *adapter,
__func__, pkt_len, blk_size);
break;
}
-   skb_deaggr = mwifiex_alloc_dma_align_buf(pkt_len,
-GFP_KERNEL | GFP_DMA);
+   skb_deaggr = __netdev_alloc_skb_ip_align(NULL, pkt_len,
+GFP_KERNEL);
if (!skb_deaggr)
break;
skb_put(skb_deaggr, pkt_len);




--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] codel: add forgotten inline to functions in header file

2016-02-11 Thread Eric Dumazet

On Thu, 2016-02-11 at 15:05 +, Grumbach, Emmanuel wrote:


> Yeah :) codel_should_drop seemed very long indeed... I wanted to use the
> codel_get_time and associated utils (_before, _after) in iwlwifi.
> They're better than jiffies... So maybe I can just copy that code to
> iwlwifi.


You certainly can submit a patch adding the inline, but not on all
functions present in this file ;)

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 net] nfc: use GFP_USER for user-controlled kmalloc

2016-01-29 Thread Eric Dumazet

On Fri, 2016-01-29 at 11:24 -0800, Cong Wang wrote:
> These two functions are called in sendmsg path, and the
> 'len' is passed from user-space, so we should not allow
> malicious users to OOM kernel on purpose.
> 
> Reported-by: Dmitry Vyukov <dvyu...@google.com>
> Cc: Lauro Ramos Venancio <lauro.venan...@openbossa.org>
> Cc: Aloisio Almeida Jr <aloisio.alme...@openbossa.org>
> Cc: Samuel Ortiz <sa...@linux.intel.com>
> Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
> ---

Note that the issue is not OOM the kernel (as the allocation is
attempted even after your patch), but having a way to
spill stack traces in the syslog.

Acked-by: Eric Dumazet <eduma...@google.com>

Thanks!



--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RESEND net] nfc: use GFP_USER for user-controlled kmalloc

2016-01-27 Thread Eric Dumazet

On Wed, 2016-01-27 at 11:50 -0800, Cong Wang wrote:

> Hmm? I think nfc_llcp_send_ui_frame() needs to do some fragmention
> with this temporary memory, or you are saying msg_iter has some
> API available to seek the pointer? Even if so, it doesn't look like
> suitable for -stable.
> 

memcpy_from_msg(msg_data, msg, len) will overwrite the msg_data with len
bytes, or return an error.

So prior msg_data content does not matter.

kzalloc() before a memset() or memcpy() sounds defensive programming,
kmalloc() is a bit faster.

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RESEND net] nfc: use GFP_USER for user-controlled kmalloc

2016-01-26 Thread Eric Dumazet

On Tue, 2016-01-26 at 15:12 -0800, Cong Wang wrote:
> On Tue, Jan 26, 2016 at 2:55 PM, Julian Calaby  
> wrote:
> > Hi Cong,
> >
> > On Wed, Jan 27, 2016 at 4:53 AM, Cong Wang  wrote:
> >
> > A commit message would be nice. A brief rundown of how this is called
> > from userspace would be nice (I'm talking a single sentence here, e.g.
> > "this is allocated when submitting a nfc packet") and what issue
> > __GFP_NOWARN is fixing. (I'm guessing log spam due to allocation
> > failures.)
> >
> 
> I thought it is obvious. ;) Keep in mind that $subject is one part of commit
> message too, so there is a commit message although very short.
> 
> I will add it.


BTW, kzalloc() is useless here, since it is followed by

if (memcpy_from_msg(msg_data, msg, len)) {

Also, this file seems to have two spots with the same problem,
in nfc_llcp_send_ui_frame() & nfc_llcp_send_i_frame()



--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v2] ath6kl: Use vmalloc to allocate ar->fw for api1 method

2015-12-02 Thread Eric Dumazet

On Tue, 2015-12-01 at 22:18 -0600, Brent Taylor wrote:
> Since commit 8437754c8335 ("ath6kl: Use vmalloc instead of kmalloc for
> fw") ar->fw is expected to be pointing to memory allocated by vmalloc.
> If the api1 method (via ath6kl_fetch_fw_api1) is used to allocate memory
> for ar->fw, then kmemdup is used.  This patch checks if the firmware being
> loaded is the 'fw' image, then use vmalloc, otherwise use kmalloc.
> 
> Signed-off-by: Brent Taylor 
> ---
> v2: Fix commit message and code formatting (use tab instaed of spaces)
> 
>  drivers/net/wireless/ath/ath6kl/init.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/wireless/ath/ath6kl/init.c 
> b/drivers/net/wireless/ath/ath6kl/init.c
> index 6ae0734..4f16bd8 100644
> --- a/drivers/net/wireless/ath/ath6kl/init.c
> +++ b/drivers/net/wireless/ath/ath6kl/init.c
> @@ -673,10 +673,15 @@ static int ath6kl_get_fw(struct ath6kl *ar, const char 
> *filename,
>   return ret;
>  
>   *fw_len = fw_entry->size;
> - *fw = kmemdup(fw_entry->data, fw_entry->size, GFP_KERNEL);
> + if (>fw == fw)
> + *fw = vmalloc(fw_entry->size);
> + else
> + *fw = kmalloc(fw_entry->size, GFP_KERNEL);
>  
>   if (*fw == NULL)
>   ret = -ENOMEM;
> + else
> + memcpy(*fw, fw_entry->data, fw_entry->size);
>  
>   release_firmware(fw_entry);
>  

This looks very odd.

Why not using kvfree() in ath6kl_core_cleanup() ?

If you switch to vmalloc() here because the kmemdup() was potentially
failing, then the changelog should say it !

Using vmalloc() instead of kmalloc() should be driven by the allocation
size, not the legacy code doing the freeing.



--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] net: tso: add support for IPv6

2015-10-26 Thread Eric Dumazet

On Mon, 2015-10-26 at 10:31 +0200, Emmanuel Grumbach wrote:
> Adding IPv6 for the TSO helper API is trivial:
> * Don't play with the id (which doesn't exist in IPv6)
> * Correctly update the payload_len (don't include the
>   length of the IP header itself)
> 
> Signed-off-by: Emmanuel Grumbach <emmanuel.grumb...@intel.com>
> ---
> v3: use vlan_get_protocol and call it once in tso_start
> store the result in tso_t

Acked-by: Eric Dumazet <eduma...@google.com>

Next step, adding encapsulation support ;)


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v3 2/2] iwlwifi: mvm: send large SKBs to the transport

2015-10-22 Thread Eric Dumazet

On Wed, 2015-10-21 at 21:34 +0300, Emmanuel Grumbach wrote:
> +
> + if (skb->protocol == htons(ETH_P_IP)) {
> + ip_hdr(tmp)->id = ip_hdr(skb)->id;

Too late, you already called consume_skb(skb).
So this is a potential use after free.

> + be16_add_cpu(_hdr(tmp)->id, i * num_subframes);
> + }
> +


I would use 

base_id = ip_hdr(skb)->id; // before the consume_skb(skb)

ip_hdr(tmp)->id = htons(base_id + i * num_subframes);


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v3 1/2] iwlwifi: pcie: allow to build an A-MSDU using TSO core

2015-10-21 Thread Eric Dumazet

On Thu, 2015-10-22 at 00:14 +, Grumbach, Emmanuel wrote:

> 
> Well. I guess I should at least check, but even with very small MSS, our
> device supports up to 20 pointers for the same 802.11 packet: 2 are for
> metadata. So basically, so leaves me only 18 pointers. for each MSS I
> need at least 2 (one for the headers and one for the payload), so I will
> have at most 9 of these for one packet, even with a tiny MSS.
> 

I did not see in your patch where you made the checks about 18 segs in a
TSO packet ?

> I agree that all this should be added to the code in a comment.
> Speaking of which...
> int tso_count_descs(struct sk_buff *skb)
> {
> /* The Marvell Way */
> return skb_shinfo(skb)->gso_segs * 2 + skb_shinfo(skb)->nr_frags;
> }
> 
> What if there is some payload in the header?
> To me it sounds safer to return:
> 
> skb_shinfo(skb)->gso_segs * 2 + skb_shinfo(skb)->nr_frags + 1;
> 
> or maybe to test if there is some payload in the header and then add 1?
> If there is payload in the header, it should be considered as another
> frag, shouldn't it?

Minimal count is gso_segs (one per MSS)

Then you have to add extra for the cases we have a mss spanning a frag
in skb.

Thats a max of (skb_shinfo(skb)->nr_frags - 1) + (data_in_head() ? 1 :
0);

So I believe formula would be correct.


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v3 1/2] iwlwifi: pcie: allow to build an A-MSDU using TSO core

2015-10-21 Thread Eric Dumazet

On Wed, 2015-10-21 at 21:34 +0300, Emmanuel Grumbach wrote:
> When the op_mode sends an skb whose payload is bigger than
> MSS, PCIe will create an A-MSDU out of it. PCIe assumes
> that the skb that is coming from the op_mode can fit in one
> A-MSDU. It is the op_mode's responsibility to make sure
> that this guarantee holds.
> 
> Additional headers need to be built for the subframes.
> The TSO core code takes care of the IP / TCP headers and
> the driver takes care of the 802.11 subframe headers.
> 
> These headers are stored on a per-cpu page that is re-used
> for all the packets handled on that same CPU. Each skb
> holds a reference to that page and releases the page when
> it is reclaimed. When the page gets full, it is released
> and a new one is allocated.
> 
> Since any SKB that doesn't go through the fast-xmit path
> of mac80211 will be segmented, we can assume here that the
> packet is not WEP / TKIP and has a proper SNAP header.
> 
> Signed-off-by: Emmanuel Grumbach 
> ---
>  drivers/net/wireless/iwlwifi/iwl-devtrace-data.h |  16 ++
>  drivers/net/wireless/iwlwifi/iwl-trans.h |   6 +-
>  drivers/net/wireless/iwlwifi/pcie/internal.h |   7 +
>  drivers/net/wireless/iwlwifi/pcie/trans.c|  20 +-
>  drivers/net/wireless/iwlwifi/pcie/tx.c   | 286 
> ++-
>  5 files changed, 329 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/wireless/iwlwifi/iwl-devtrace-data.h 
> b/drivers/net/wireless/iwlwifi/iwl-devtrace-data.h
> index 71a78ce..59d9edf 100644
> --- a/drivers/net/wireless/iwlwifi/iwl-devtrace-data.h
> +++ b/drivers/net/wireless/iwlwifi/iwl-devtrace-data.h
> @@ -51,6 +51,22 @@ TRACE_EVENT(iwlwifi_dev_tx_data,
>   TP_printk("[%s] TX frame data", __get_str(dev))
>  );
>  
> +TRACE_EVENT(iwlwifi_dev_tx_tso_chunk,
> + TP_PROTO(const struct device *dev,
> +  u8 *data_src, size_t data_len),
> + TP_ARGS(dev, data_src, data_len),
> + TP_STRUCT__entry(
> + DEV_ENTRY
> +
> + __dynamic_array(u8, data, data_len)
> + ),
> + TP_fast_assign(
> + DEV_ASSIGN;
> + memcpy(__get_dynamic_array(data), data_src, data_len);
> + ),
> + TP_printk("[%s] TX frame data", __get_str(dev))
> +);
> +
>  TRACE_EVENT(iwlwifi_dev_rx_data,
>   TP_PROTO(const struct device *dev,
>const struct iwl_trans *trans,
> diff --git a/drivers/net/wireless/iwlwifi/iwl-trans.h 
> b/drivers/net/wireless/iwlwifi/iwl-trans.h
> index 0ceff69..6919243 100644
> --- a/drivers/net/wireless/iwlwifi/iwl-trans.h
> +++ b/drivers/net/wireless/iwlwifi/iwl-trans.h
> @@ -379,7 +379,11 @@ static inline void iwl_free_rxb(struct iwl_rx_cmd_buffer 
> *r)
>  }
>  
>  #define MAX_NO_RECLAIM_CMDS  6
> -
> +/*
> + * The first entry in driver_data array in ieee80211_tx_info
> + * that can be used by the transport.
> + */
> +#define IWL_FIRST_DRIVER_DATA 2
>  #define IWL_MASK(lo, hi) ((1 << (hi)) | ((1 << (hi)) - (1 << (lo
>  
>  /*
> diff --git a/drivers/net/wireless/iwlwifi/pcie/internal.h 
> b/drivers/net/wireless/iwlwifi/pcie/internal.h
> index be168d1..7da5643 100644
> --- a/drivers/net/wireless/iwlwifi/pcie/internal.h
> +++ b/drivers/net/wireless/iwlwifi/pcie/internal.h
> @@ -295,6 +295,11 @@ iwl_pcie_get_scratchbuf_dma(struct iwl_txq *txq, int idx)
>  sizeof(struct iwl_pcie_txq_scratch_buf) * idx;
>  }
>  
> +struct iwl_tso_hdr_page {
> + struct page *page;
> + u8 *pos;
> +};
> +
>  /**
>   * struct iwl_trans_pcie - PCIe transport specific data
>   * @rxq: all the RX queue data
> @@ -332,6 +337,8 @@ struct iwl_trans_pcie {
>   struct net_device napi_dev;
>   struct napi_struct napi;
>  
> + struct __percpu iwl_tso_hdr_page *tso_hdr_page;
> +
>   /* INT ICT Table */
>   __le32 *ict_tbl;
>   dma_addr_t ict_tbl_dma;
> diff --git a/drivers/net/wireless/iwlwifi/pcie/trans.c 
> b/drivers/net/wireless/iwlwifi/pcie/trans.c
> index a275318..5bd678b 100644
> --- a/drivers/net/wireless/iwlwifi/pcie/trans.c
> +++ b/drivers/net/wireless/iwlwifi/pcie/trans.c
> @@ -1601,6 +1601,7 @@ static void iwl_trans_pcie_configure(struct iwl_trans 
> *trans,
>  void iwl_trans_pcie_free(struct iwl_trans *trans)
>  {
>   struct iwl_trans_pcie *trans_pcie = IWL_TRANS_GET_PCIE_TRANS(trans);
> + int i;
>  
>  #ifdef CPTCFG_IWLWIFI_PLATFORM_DATA
>   /* Make sure the device is on before calling pci functions again.
> @@ -1631,6 +1632,15 @@ void iwl_trans_pcie_free(struct iwl_trans *trans)
>  
>   iwl_pcie_free_fw_monitor(trans);
>  
> + for_each_possible_cpu(i) {
> + struct iwl_tso_hdr_page *p =
> + per_cpu_ptr(trans_pcie->tso_hdr_page, i);
> +
> + if (p->page)
> + __free_pages(p->page, 0);
> + }
> +
> + free_percpu(trans_pcie->tso_hdr_page);
>   iwl_trans_free(trans);
>  }
>  
> @@ -2822,7 +2832,7 @@ struct iwl_trans *iwl_trans_pcie_alloc(struct pci_dev 
> *pdev,

Re: [PATCH] nfc: free skb buffer using skb_free_datagram

2015-10-19 Thread Eric Dumazet

On Mon, 2015-10-19 at 15:59 +, Insu Yun wrote:
> Freeing sk_buff genereated by skb_recv_datagram is always by
> skb_free_datagram, not kfree_skb.
> 
> Signed-off-by: Insu Yun 
> ---
>  net/nfc/llcp_sock.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/nfc/llcp_sock.c b/net/nfc/llcp_sock.c
> index b7de0da..15e681f 100644
> --- a/net/nfc/llcp_sock.c
> +++ b/net/nfc/llcp_sock.c
> @@ -870,7 +870,7 @@ static int llcp_sock_recvmsg(struct socket *sock, struct 
> msghdr *msg,
>   }
>   }
>  
> - kfree_skb(skb);
> + skb_free_datagram(sk, skb);
>   }
>  
>   /* XXX Queue backlogged skbs */

Nope, you are adding a bug here.

Current code is fine.


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 3/3] iwlwifi: mvm: transfer the truesize to the last TSO segment

2015-08-20 Thread Eric Dumazet

On Thu, 2015-08-20 at 06:21 +, Grumbach, Emmanuel wrote:
 
 On 08/19/2015 11:39 PM, Eric Dumazet wrote:
  On Wed, 2015-08-19 at 19:17 +, Grumbach, Emmanuel wrote:
  
  Hm.. how would net/core/tso.c avoid this?
  
  Because a driver using these helpers keep around the original LSO packet
  and frees it normally at TX completion time.
  
 
 Which is why I can't really use it. The complexity is that I have to
 (ieee802.11 specification) split an LSO is several 802.11 packets. The
 maximal 802.11 packet I can send under ideal condition is 11K long or
 so. So I *must* generate several 802.11 frames from one single LSO
 packet. OTOH, I can have more than MSS bytes in a 802.11 A-MSDU.
 

Who said you had to free original packet ? Just keep it around.

TCP will work better ( check skb_still_in_host_queue() helper if you
want to know why)

 Maybe what would help would be to be able to dynamically change the
 maximal size of an LSO packet. That would allow the wifi driver to
 ensure that the LSO can fit in a single 802.11 packet. Note that since
 the maximal length of the A-MSDU can vary based on link conditions
 (since there is only one CRC for the whole A-MSDU, you don't want long
 A-MSDUs in bad link conditions) the driver would need to be able to tell
 the TCP stack to modify the length of an LSO packet.
 To me, this sounds to be ... an overkill?

It is already doable. Check dev-gso_max_size ( and
netif_set_gso_max_size())

Make sure you do not reinvent the wheel ;)


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v3 1/3] iwlwifi: mvm: add real TSO implementation

2015-08-20 Thread Eric Dumazet

On Thu, 2015-08-20 at 11:15 +0300, Emmanuel Grumbach wrote:
 The segmentation is done completely in software. The
 driver creates several MPDUs out of a single large send.
 Each MPDU is a newly allocated SKB.
 A page is allocated to create the headers that need to be
 duplicated (SNAP / IP / TCP). The WiFi header is in the
 header of the newly created SKBs.
 
 type=feature
 
 Change-Id: I238ffa79cacc5bbdacdfbf3e9673c8d4f02b462a
 Signed-off-by: Emmanuel Grumbach emmanuel.grumb...@intel.com
 ---
  drivers/net/wireless/iwlwifi/mvm/tx.c | 513 
 +++---
  1 file changed, 481 insertions(+), 32 deletions(-)


Ouch again.

This will be a NACK. Sorry.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 3/3] iwlwifi: mvm: transfer the truesize to the last TSO segment

2015-08-20 Thread Eric Dumazet

On Thu, 2015-08-20 at 19:34 +, Grumbach, Emmanuel wrote:

 
 Err... no :( It won't work for me because the MSS impacts the number of
 segments which in turns impact the number of time the headers have to be
 copied which impacts... the A-MSDU maximal size which must be bigger
 than gso_max_size. So basically, a connection parameter (MSS) impacts a
 device parameter (gso_max_size). Now, of course, I could give up on
 small MSS connections and skb_gso_segment() skbs whose MSS is smaller
 than the default. This means that I give up on LSO in certain cases...
 

We also have dev-gso_max_segs for this kind of problems.

Really, you should take closer look.

And _if_ some LSO packets must be segmented using skb_gso_segment()
in some rare cases, who cares ?

 I will get back to the drawing board and check what I can do to use /
 enhance the core infra... But it will be hard to lose functionality /
 efficiency just to use the core infra...

Please take your time. Pushing hard some local hacks is not sustainable.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 1/3] iwlwifi: mvm: add real TSO implementation

2015-08-19 Thread Eric Dumazet

On Wed, 2015-08-19 at 15:59 +0300, Emmanuel Grumbach wrote:
 The segmentation is done completely in software. The
 driver creates several MPDUs out of a single large send.
 Each MPDU is a newly allocated SKB.
 A page is allocated to create the headers that need to be
 duplicated (SNAP / IP / TCP). The WiFi header is in the
 header of the newly created SKBs.
 
 type=feature
 
 Change-Id: I238ffa79cacc5bbdacdfbf3e9673c8d4f02b462a
 Signed-off-by: Emmanuel Grumbach emmanuel.grumb...@intel.com
 ---
  drivers/net/wireless/iwlwifi/mvm/tx.c | 513 
 +++---
  1 file changed, 481 insertions(+), 32 deletions(-)

Ouch dynamic allocations while doing xmit are certainly not needed.
Your driver should pre-allocated space for headers.

Drivers willing to implement tso have to use net/core/tso.c provided
helpers.

$ git grep -n tso_build_hdr
drivers/net/ethernet/cavium/thunder/nicvf_queues.c:1030:
tso_build_hdr(skb, hdr, tso, data_left, total_len == 0);
drivers/net/ethernet/freescale/fec_main.c:729:  tso_build_hdr(skb, hdr, 
tso, data_left, total_len == 0);
drivers/net/ethernet/marvell/mv643xx_eth.c:842: tso_build_hdr(skb, hdr, 
tso, data_left, total_len == 0);
drivers/net/ethernet/marvell/mvneta.c:1650: tso_build_hdr(skb, hdr, 
tso, data_left, total_len == 0);
include/net/tso.h:15:void tso_build_hdr(struct sk_buff *skb, char *hdr, struct 
tso_t *tso,
net/core/tso.c:14:void tso_build_hdr(struct sk_buff *skb, char *hdr, struct 
tso_t *tso,
net/core/tso.c:37:EXPORT_SYMBOL(tso_build_hdr);


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 0/3] add TSO / A-MSDU TX for iwlwifi

2015-08-19 Thread Eric Dumazet

On Wed, 2015-08-19 at 15:59 +0300, Emmanuel Grumbach wrote:

 We could have enabled A-MSDU based on xmit-more, but the
 rationale of using LSO is that when using pfifo-fast,
 the Qdisc gets one packet and dequeues is straight away
 which limits the possibility to get a lot of packets at
 once. (Am I right here?).

No, you are not ;)

Key point for xmit_more is BQL being implemented in your driver.

Relevant code is in try_bulk_dequeue_skb()


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 1/3] iwlwifi: mvm: add real TSO implementation

2015-08-19 Thread Eric Dumazet

On Wed, 2015-08-19 at 07:17 -0700, Eric Dumazet wrote:
 On Wed, 2015-08-19 at 15:59 +0300, Emmanuel Grumbach wrote:
  The segmentation is done completely in software. The
  driver creates several MPDUs out of a single large send.
  Each MPDU is a newly allocated SKB.
  A page is allocated to create the headers that need to be
  duplicated (SNAP / IP / TCP). The WiFi header is in the
  header of the newly created SKBs.
  
  type=feature
  
  Change-Id: I238ffa79cacc5bbdacdfbf3e9673c8d4f02b462a
  Signed-off-by: Emmanuel Grumbach emmanuel.grumb...@intel.com
  ---
   drivers/net/wireless/iwlwifi/mvm/tx.c | 513 
  +++---
   1 file changed, 481 insertions(+), 32 deletions(-)
 
 Ouch dynamic allocations while doing xmit are certainly not needed.
 Your driver should pre-allocated space for headers.
 
 Drivers willing to implement tso have to use net/core/tso.c provided
 helpers.
 
 $ git grep -n tso_build_hdr
 drivers/net/ethernet/cavium/thunder/nicvf_queues.c:1030:
 tso_build_hdr(skb, hdr, tso, data_left, total_len == 0);
 drivers/net/ethernet/freescale/fec_main.c:729:  tso_build_hdr(skb, 
 hdr, tso, data_left, total_len == 0);
 drivers/net/ethernet/marvell/mv643xx_eth.c:842: tso_build_hdr(skb, 
 hdr, tso, data_left, total_len == 0);
 drivers/net/ethernet/marvell/mvneta.c:1650: tso_build_hdr(skb, 
 hdr, tso, data_left, total_len == 0);
 include/net/tso.h:15:void tso_build_hdr(struct sk_buff *skb, char *hdr, 
 struct tso_t *tso,
 net/core/tso.c:14:void tso_build_hdr(struct sk_buff *skb, char *hdr, struct 
 tso_t *tso,
 net/core/tso.c:37:EXPORT_SYMBOL(tso_build_hdr);
 

Look at commit 2adb719d74f6e174071e5c913290b9bbd8c2c0e8 for a typical
use of these helpers.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 0/3] add TSO / A-MSDU TX for iwlwifi

2015-08-19 Thread Eric Dumazet

On Wed, 2015-08-19 at 17:00 +, Grumbach, Emmanuel wrote:
 
 On 08/19/2015 07:08 PM, Eric Dumazet wrote:
  On Wed, 2015-08-19 at 15:07 +, Grumbach, Emmanuel wrote:
  
  I'll look at it.
  I was almost starting to implement that but then I thought with another
  (good?) reason to use LSO. LSO gives me the guarantee that the packet is
  directed to one peer, which might not be the case with xmit_more since
  we have one Qdisc for several clients in case we are in AP mode.
  Building an A-MSDU for several clients is not possible, at least not for
  several client in the L2 (different MAC addresses).
  LSO avoids this problem completely.
  
  Then, simply calling skb_gso_segment() from the driver might be enough,
  and less work for you.
  
  This would even support TSO on IPv6
  
 
 Well... I did take care of IPv6.

net/core/tso.c does not yet handle IPv6


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 0/3] add TSO / A-MSDU TX for iwlwifi

2015-08-19 Thread Eric Dumazet

On Wed, 2015-08-19 at 17:56 +, Grumbach, Emmanuel wrote:
 

 So I feel that making net/core/tso.c more complicated just because of
 our craziness seems an overkill to me.
 I'll try a bit harder to see how I can use net/core/tso.c, but I have to
 say I am pessimistic.

net/core/tso.c is WIP, feel free to expand it to make it more generic
and meet your needs.

The point is : we want a core infrastructure, not something that each
individual driver implements in ~500 lines of code :(


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 3/3] iwlwifi: mvm: transfer the truesize to the last TSO segment

2015-08-19 Thread Eric Dumazet

On Wed, 2015-08-19 at 19:17 +, Grumbach, Emmanuel wrote:

 Hm.. how would net/core/tso.c avoid this?

Because a driver using these helpers keep around the original LSO packet
and frees it normally at TX completion time.

 I can't see anything related to truesize there.
 Note that this work since it is guaranteed that we release the skbs in
 order.
 
  
  (BTW TCP packets do not have sock_wfree as destructor but tcp_wfree(),
  yet we want backpressure mostly for TCP stack (TCP Small Queues))
  
  
 
 I am not sure I follow here.
 You want me to test:
 if (skb_gso-destructor == tcp_wfree) ?


Yes.

Look for example at tcp_gso_segment() (called from skb_gso_segment())

copy_destructor = gso_skb-destructor == tcp_wfree;
...
/* Following permits TCP Small Queues to work well with GSO :
 * The callback to TCP stack will be called at the time last frag
 * is freed at TX completion, and not right now when gso_skb
 * is freed by GSO engine
 */
if (copy_destructor) {
swap(gso_skb-sk, skb-sk);
swap(gso_skb-destructor, skb-destructor);
sum_truesize += skb-truesize;
atomic_add(sum_truesize - gso_skb-truesize,
   skb-sk-sk_wmem_alloc);
}


 
 I checked that code using iperf and saw that I don't get into this if,
 but I (probably wrongly) assumed that other applications would set a
 flag on the socket (forgive my ignorance) that would make this if be taken.

If you do not see skb-destructor == tcp_wfree, then something is
definitely wrong on your setup.



--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 00/14] shrink skb cb to 44 bytes

2015-03-02 Thread Eric Dumazet

On Mon, 2015-03-02 at 21:42 +0100, Florian Westphal wrote:

 Thats right.  Do you think its worth to already move cb[] near the end
 of skb and alter build_skb to not clear it anymore?
 
 Which of the ideas, in your opinion, is worth pursuing first (if any)?

moving cb[] near the end will void my patches to use one cache line per
skb in TCP receive queue ( or write queue)

971f10eca186ca tcp: better TCP_SKB_CB layout to reduce cache line misses


I have worked a bit (3 months ago) about doing the skb-cb[] selective
clearing, but a lot of alloc_skb() users _assume_ it is already
cleared. 

That seemed a lot of work to me, because of the many alloc_skb()
variants we have. But definitely worth trying to complete.



--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 00/14] shrink skb cb to 44 bytes

2015-03-02 Thread Eric Dumazet

On Mon, 2015-03-02 at 17:17 -0500, David Miller wrote:
 From: Eric Dumazet eric.duma...@gmail.com
 Date: Mon, 02 Mar 2015 11:49:23 -0800

  Size of skb-cb[] is not the major factor. Trying to gain 4 or 8 bytes
  is not going to improve performance a lot.

  The real problem is that we clear it in skb_alloc()/build_skb(), instead
  of each layer doing so at demand, and only the part that matters for
  this layer.

  Basically, skb-cb[] could be 80 or 160 bytes instead of 48, and we
  should not care, as long as no layer does a stupid/lazy 

  memset(skb-cb, 0, sizeof(skb-cb))

  Presumably skb_clone() could be extended to receive the length of
  skb-cb[] that current layer cares about.

 Regardless, I think Florian's work has value.

Of course. I hope my answer was not implying the contrary !

48 - 44 cb change, with the 8 bytes alignment and various __packed
tricks that might confuse compilers on some arches, will be hard to
quantify in term of performances on all arches.

About the GRO layout change, reason why 'struct sk_buff *last;' is at
the end of struct napi_gro_cb is that this field is not used in fast
path.

Note : We could try to use one bit in skb to advertise zero shinfo(skb).

Many skbs have a zeroed shinfo() (but shinfo-dataref == 1) , and
dereferencing skb_shinfo adds a cache line miss. 

- We could avoid memset(shinfo, 0, offsetof(struct skb_shared_info,
dataref))  atomic_set(shinfo-dataref, 1); 

 in alloc_skb() and friends completely.

Unfortunately this kind of change would be quite invasive...

--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ath9k_htc: avoid memcpy when downloading firmware

2015-03-02 Thread Eric Dumazet

On Tue, 2015-03-03 at 12:24 +0800, Fred Chou wrote:
 From: Fred Chou fred.chou...@gmail.com
 
 The temporary buffer to hold firmware data is not really needed,
 and memcpy can be avoided by using data pointer instead.
 
 Signed-off-by: Fred Chou fred.chou...@gmail.com
 ---
  drivers/net/wireless/ath/ath9k/hif_usb.c | 12 ++--
  1 file changed, 2 insertions(+), 10 deletions(-)
 
 diff --git a/drivers/net/wireless/ath/ath9k/hif_usb.c 
 b/drivers/net/wireless/ath/ath9k/hif_usb.c
 index 10c02f5..0bc35a8 100644
 --- a/drivers/net/wireless/ath/ath9k/hif_usb.c
 +++ b/drivers/net/wireless/ath/ath9k/hif_usb.c
 @@ -986,30 +986,22 @@ static int ath9k_hif_usb_download_fw(struct 
 hif_device_usb *hif_dev)
   const void *data = hif_dev-fw_data;

Here data can be vmalloc() backed.

   size_t len = hif_dev-fw_size;
   u32 addr = AR9271_FIRMWARE;
 - u8 *buf = kzalloc(4096, GFP_KERNEL);

Here buf is kmalloc() backed.

   u32 firm_offset;
  
 - if (!buf)
 - return -ENOMEM;
 -
   while (len) {
   transfer = min_t(size_t, len, 4096);
 - memcpy(buf, data, transfer);
  
   err = usb_control_msg(hif_dev-udev,
 usb_sndctrlpipe(hif_dev-udev, 0),
 FIRMWARE_DOWNLOAD, 0x40 | USB_DIR_OUT,
 -   addr  8, 0, buf, transfer, HZ);


Are you sure usb_control_msg() accepts vmalloc()ed buffers ?

My guess is the answer is no.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 00/14] shrink skb cb to 44 bytes

2015-03-02 Thread Eric Dumazet

On Mon, 2015-03-02 at 18:40 +0100, Florian Westphal wrote:
 Following patches shrink all in-tree users of skb-cb[] so that kernel
 still builds with skb-cb[] set to 44 bytes.
 
 This would create a 4-byte hole, it would be easy to reorder skb
 members so this hole is filled.
 
 pahole dump for vmlinux with allyesconfig (+this patch series):
 
 http://www.strlen.de/fw/pahole.txt.gz
 
 remarks/known issues:
 - adds __packet attribute to a few structures.
   Its needed to not have padding at end of the structure, else
   we get build assertion errors even if all members fit into cb[].
 - checkpatch isn't happy yet.
 - dccp changes are untested (its on my todo list)
 - rxrpc change is untested (on todo list).
 - wireless changes are untested, I don't own any of the affected hw.
 
 The idea is to figure out what needs to be done to make build_skb() and
 friends only touch the first 3 cachelines of the skb on all architectures.
 
 We'd need to reduce skbuff by 40 bytes to achieve this for allyesconfig.
 
 Other sk_buff reduction ideas being worked:
 
 - move truesize to shinfo (which has 4 byte hole)
 - turn -data into offset on 64bit platforms (intrusive,  1000 files
   affected)
 - move pointers that are ususally not needed (nf_bridge, secpath) to the end,
   with flag field that tells us when pointer is valid (so we don't have
   to memset() them unconditionally at allocation time).
 - seems we could already to this for the inner header fields since the're only
   valid once -encapsulation / inner_protocol_type is set.
 
 Comments and more ideas welcome.

Size of skb-cb[] is not the major factor. Trying to gain 4 or 8 bytes
is not going to improve performance a lot.

The real problem is that we clear it in skb_alloc()/build_skb(), instead
of each layer doing so at demand, and only the part that matters for
this layer.

Basically, skb-cb[] could be 80 or 160 bytes instead of 48, and we
should not care, as long as no layer does a stupid/lazy 

memset(skb-cb, 0, sizeof(skb-cb))

Presumably skb_clone() could be extended to receive the length of
skb-cb[] that current layer cares about.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-11 Thread Eric Dumazet

On Wed, 2015-02-11 at 09:33 +0100, Michal Kazior wrote:

 If I set tcp_limit_output_bytes to 700K+ I can get ath10k w/ cushion
 w/ aggregation to reach 600mbps on a single flow.

You know, there is a reason this sysctl exists in the first place ;)

The first suggestion I made to you was to raise it.

The default setting must stay as is as long default Qdisc is pfifo_fast.

I believe I already mentioned skb-truesize tricks for drivers willing
to adjust the TSQ given their constraints.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-10 Thread Eric Dumazet

On Tue, 2015-02-10 at 15:19 +0100, Johannes Berg wrote:
 On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:
 
  +   if (msdu-sk) {
  +   ewma_add(ar-tx_delay_us,
  +ktime_to_ns(ktime_sub(ktime_get(), skb_cb-stamp)) 
  /
  +NSEC_PER_USEC);
  +
  +   ACCESS_ONCE(msdu-sk-sk_tx_completion_delay_cushion) =
  +   (ewma_read(ar-tx_delay_us) *
  +msdu-sk-sk_pacing_rate)  20;
  +   }
 
 To some extent, every wifi driver is going to have this problem. Perhaps
 we should do this in mac80211?

I'll provide the TCP patch.

sk-sk_tx_completion_delay_cushion is probably a wrong name, as the
units here are in bytes, since it is really number of bytes in the
network driver that accommodate for tx completions delays. 

tx_completion_delay * pacing_rate

sk_tx_completion_cushion maybe.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-10 Thread Eric Dumazet

On Tue, 2015-02-10 at 04:54 -0800, Eric Dumazet wrote:

 Hi Michal
 
 This is almost it ;)
 
 As I said you must do this using u64 arithmetics, we still support 32bit
 kernels.
 
 Also,  20 instead of / 100 introduces a 5% error, I would use a
 plain divide, as the compiler will use a reciprocal divide (ie : a
 multiply)
 
 We use  10 instead of /1000 because a 2.4 % error is probably okay.
 
 ewma_add(ar-tx_delay_us,
  ktime_to_ns(ktime_sub(ktime_get(),
 skb_cb-stamp)) /
   NSEC_PER_USEC);

btw I suspect this wont compile on 32 bit kernel

You need to use do_div() as well :

u64 val = ktime_to_ns(ktime_sub(ktime_get(),
skb_cb-stamp));

do_div(val, NSEC_PER_USEC);

ewma_add(ar-tx_delay_us, val);


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rtlwifi: ratelimit skb allocation failure message

2015-02-10 Thread Eric Dumazet

On Tue, 2015-02-10 at 08:54 +, Colin King wrote:
 From: Colin Ian King colin.k...@canonical.com
 
 when running low on memory I noticed rtlwifi was producing a large
 quantity of repeated skb allocation failures messages.  This should
 be ratelimited to reduce the noise.
 
 Signed-off-by: Colin Ian King colin.k...@canonical.com
 ---
  drivers/net/wireless/rtlwifi/pci.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/drivers/net/wireless/rtlwifi/pci.c 
 b/drivers/net/wireless/rtlwifi/pci.c
 index c70efb9..ca0fd50 100644
 --- a/drivers/net/wireless/rtlwifi/pci.c
 +++ b/drivers/net/wireless/rtlwifi/pci.c
 @@ -817,7 +817,7 @@ static void _rtl_pci_rx_interrupt(struct ieee80211_hw *hw)
   /* get a new skb - if fail, old one will be reused */
   new_skb = dev_alloc_skb(rtlpci-rxbuffersize);
   if (unlikely(!new_skb)) {
 - pr_err(Allocation of new skb failed in %s\n,
 + pr_err_ratelimited(Allocation of new skb failed in 
 %s\n,
  __func__);

Or even better, remove the message.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-10 Thread Eric Dumazet

On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:

 +   if (msdu-sk) {
 +   ewma_add(ar-tx_delay_us,
 +ktime_to_ns(ktime_sub(ktime_get(), skb_cb-stamp)) /
 +NSEC_PER_USEC);
 +
 +   ACCESS_ONCE(msdu-sk-sk_tx_completion_delay_cushion) =
 +   (ewma_read(ar-tx_delay_us) *
 +msdu-sk-sk_pacing_rate)  20;
 +   }
 +

Hi Michal

This is almost it ;)

As I said you must do this using u64 arithmetics, we still support 32bit
kernels.

Also,  20 instead of / 100 introduces a 5% error, I would use a
plain divide, as the compiler will use a reciprocal divide (ie : a
multiply)

We use  10 instead of /1000 because a 2.4 % error is probably okay.

ewma_add(ar-tx_delay_us,
 ktime_to_ns(ktime_sub(ktime_get(),
skb_cb-stamp)) /
NSEC_PER_USEC);
u64 val = (u64)ewma_read(ar-tx_delay_us) *
   msdu-sk-sk_pacing_rate;

do_div(val, USEC_PER_SEC);

ACCESS_ONCE(msdu-sk-sk_tx_completion_delay_cushion) =
(u32)val;
 
(WRITE_ONCE() would be better for new kernels, but ACCESS_ONCE() is ok
since we probably want to backport to stable kernels)


Thanks


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-09 Thread Eric Dumazet

On Mon, 2015-02-09 at 14:47 +0100, Michal Kazior wrote:

 diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
 index 65caf8b..5e249bf 100644
 --- a/net/ipv4/tcp_output.c
 +++ b/net/ipv4/tcp_output.c
 @@ -1996,6 +1996,7 @@ static bool tcp_write_xmit(struct sock *sk,
 unsigned int mss_now, int nonagle,
 max_segs = tcp_tso_autosize(sk, mss_now);
 while ((skb = tcp_send_head(sk))) {
 unsigned int limit;
 +   unsigned int amount;
 
 tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
 BUG_ON(!tso_segs);
 @@ -2053,7 +2054,9 @@ static bool tcp_write_xmit(struct sock *sk,
 unsigned int mss_now, int nonagle,
  * of queued bytes to ensure line rate.
  * One example is wifi aggregation (802.11 AMPDU)
  */
 -   limit = max(2 * skb-truesize, sk-sk_pacing_rate  10);
 +   amount = sk-sk_tx_completion_delay_us *
 +(sk-sk_pacing_rate  10);
 +   limit = max(2 * skb-truesize, amount  10);
 limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
 
 if (atomic_read(sk-sk_wmem_alloc)  limit) {

This is not what I suggested.

If you test this on any other network device, you'll have
sk-sk_tx_completion_delay_us == 0

amount = 0 * (sk-sk_pacing_rate  10); -- 0
limit = max(2 * skb-truesize, amount  10); -- 2 * skb-truesize

So non TSO/GSO NIC will not be able to queue more than 2 MSS (one MSS
per skb)

Then if you store only the last tx completion, you have the possibility
of having a last packet of a train (say a retransmit) to make it very
low.

Ideally the formula would be in TCP something very fast to compute :

amount = (sk-sk_pacing_rate  10) + sk-tx_completion_delay_cushion;
limit = max(2 * skb-truesize, amount);
limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);

So a 'problematic' driver would have to do the math (64 bit maths) like
this :


sk-tx_completion_delay_cushion = ewma_tx_delay * sk-sk_pacing_rate;





--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-06 Thread Eric Dumazet

On Fri, 2015-02-06 at 05:53 -0800, Eric Dumazet wrote:


 wifi could eventually do that, providing in skb-tx_completion_delay_us
 the time spent in wifi driver.
 
 This way, we would have no penalty for network devices doing normal skb
 orphaning (loopback interface, ethernet, ...)

Another way would be that wifi does an automatic orphaning after 1 or
2ms.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-06 Thread Eric Dumazet

On Fri, 2015-02-06 at 15:08 +0100, Michal Kazior wrote:

 Hmm.. I confirm it works. However the value at which I get full rate
 on a single flow is more than 2048K. Also using non-default
 wmem_default seems to introduce packet loss as per iperf reports at
 the receiver. I suppose this is kind of expected but on the other hand
 wmem_default=262992 and 5 flows of UDP max the device out with 0
 packet loss.

If you increase ability to flood on one flow, then you need to make sure
receiver has big rcvbuf as well.

echo 200 /proc/sys/net/core/rmem_default

Otherwise it might drop bursts.

This is the kind of things that TCP does automatically, not UDP.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-06 Thread Eric Dumazet

On Fri, 2015-02-06 at 10:42 +0100, Michal Kazior wrote:

 The above brings back previous behaviour, i.e. I can get 600mbps TCP
 on 5 flows again. Single flow is still (as it was before TSO
 autosizing) limited to roughly ~280mbps.
 
 I never really bothered before to understand why I need to push a few
 flows through ath10k to max it out, i.e. if I run a single UDP flow I
 get ~300mbps while with, e.g. 5 I get 670mbps easily.
 

For single UDP flow, tweaking /proc/sys/net/core/wmem_default might be
enough : UDP has no callback from TX completion to feed following frames
(No write queue like TCP)

# cat /proc/sys/net/core/wmem_default
212992
# ethtool -C eth1 tx-usecs 1024 tx-frames 120
# ./netperf -H remote -t UDP_STREAM -- -m 1450
Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

2129921450   10.00  697705  0 809.27
212992   10.00  673412781.09

# echo 80 /proc/sys/net/core/wmem_default
# ./netperf -H remote -t UDP_STREAM -- -m 1450
Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

801450   10.00 7329221  08501.84
212992   10.00 7284051   8449.44


 I guess it was the tx completion latency all along.
 
 I just put an extra debug to ath10k to see the latency between
 submission and completion. Here's a log
 (http://www.filedropper.com/complete-log) of 2s run of UDP iperf
 trying to push 1gbps but managing only 300mbps.
 
 I've made sure to not hold any locks nor introduce internal to ath10k
 delays. Frames get completed between 2-4ms in avarage during load.


tcp_wfree() could maintain in tp-tx_completion_delay_ms an EWMA
of TX completion delay. But this would require yet another expensive
call to ktime_get() if HZ  1000.

Then tcp_write_xmit() could use it to adjust :

   limit = max(2 * skb-truesize, sk-sk_pacing_rate  9);

to

   amount = (2 + tp-tx_completion_delay_ms) * sk-sk_pacing_rate 

   limit = max(2 * skb-truesize, amount / 1000);

I'll cook a patch.

Thanks.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-06 Thread Eric Dumazet

On Fri, 2015-02-06 at 05:40 -0800, Eric Dumazet wrote:

 tcp_wfree() could maintain in tp-tx_completion_delay_ms an EWMA
 of TX completion delay. But this would require yet another expensive
 call to ktime_get() if HZ  1000.
 
 Then tcp_write_xmit() could use it to adjust :
 
limit = max(2 * skb-truesize, sk-sk_pacing_rate  9);
 
 to
 
amount = (2 + tp-tx_completion_delay_ms) * sk-sk_pacing_rate 
 
limit = max(2 * skb-truesize, amount / 1000);
 
 I'll cook a patch.

Hmm... doing this in all protocols would be too expensive,
and we do not want to include time spent in qdiscs.

wifi could eventually do that, providing in skb-tx_completion_delay_us
the time spent in wifi driver.

This way, we would have no penalty for network devices doing normal skb
orphaning (loopback interface, ethernet, ...)


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-05 Thread Eric Dumazet

On Thu, 2015-02-05 at 07:46 +0100, Michal Kazior wrote:
 On 4 February 2015 at 22:11, Eric Dumazet eric.duma...@gmail.com wrote:

  Most conservative patch would be :
 
  diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c 
  b/drivers/net/wireless/ath/ath10k/htt_rx.c
  index 
  9c782a42665e1aaf43bfbca441631ee58da50c09..6a36317d6bb0447202dee15528130bd5e21248c4
   100644
  --- a/drivers/net/wireless/ath/ath10k/htt_rx.c
  +++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
  @@ -1642,6 +1642,7 @@ void ath10k_htt_t2h_msg_handler(struct ath10k *ar, 
  struct sk_buff *skb)
  break;
  }
  case HTT_T2H_MSG_TYPE_TX_COMPL_IND:
  +   skb_orphan(skb);
  spin_lock_bh(htt-tx_lock);
  __skb_queue_tail(htt-tx_compl_q, skb);
  spin_unlock_bh(htt-tx_lock);
 
 I suppose you want to call skb_orphan() on actual data packets, right?
 This skb is just a host-firmware communication buffer.

Right. I have no idea how you find the actual data packet at this stage.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-05 Thread Eric Dumazet

On Thu, 2015-02-05 at 04:57 -0800, Eric Dumazet wrote:

 The intention is to control the queues to the following :
 
 1 ms of buffering, but limited to a configurable value.
 
 On a 40Gbps flow, 1ms represents 5 MB, which is insane.
 
 We do not want to queue 5 MB of traffic, this would destroy latencies
 for all concurrent flows. (Or would require having fq_codel or fq as
 packet schedulers, instead of default pfifo_fast)
 
 This is why having 1.5 ms delay between the transmit and TX completion
 is a problem in your case.

Note that TCP stack could detect when this happens, *if* ACK where
delivered before the TX completions, or when TX completion happens,
we could detect that the clone of the freed packet was freed.

In my test, when I did ethtool -C eth0 tx-usecs 1024 tx-frames 64, and
disabling GSO, TCP stack sends a bunch of packets (a bit less than 64),
blocks on tcp_limit_output_bytes.

Then we receive 2 stretch ACKS after ~50 usec.

TCP stack tries to push again some packets but blocks on
tcp_limit_output_bytes again.

1ms later, TX completion happens, tcp_wfree() is called, and TCP stack
push following ~60 packets.


TCP could  eventually dynamically adjust the tcp_limit_output_bytes,
using a per flow dynamic value, but I would rather not add a kludge in
TCP stack only to deal with a possible bug in ath10k driver.

niu has a similar issue and simply had to call skb_orphan() :

drivers/net/ethernet/sun/niu.c:6669:skb_orphan(skb);



--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-05 Thread Eric Dumazet

On Thu, 2015-02-05 at 05:19 -0800, Eric Dumazet wrote:

 
 TCP could  eventually dynamically adjust the tcp_limit_output_bytes,
 using a per flow dynamic value, but I would rather not add a kludge in
 TCP stack only to deal with a possible bug in ath10k driver.
 
 niu has a similar issue and simply had to call skb_orphan() :
 
 drivers/net/ethernet/sun/niu.c:6669:skb_orphan(skb);

In your case that might be the place :

diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c 
b/drivers/net/wireless/ath/ath10k/htt_tx.c
index 4bc51d8a14a3..cbda7a87d5a1 100644
--- a/drivers/net/wireless/ath/ath10k/htt_tx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
@@ -468,6 +468,7 @@ int ath10k_htt_tx(struct ath10k_htt *htt, struct sk_buff 
*msdu)
msdu_id = res;
htt-pending_tx[msdu_id] = msdu;
spin_unlock_bh(htt-tx_lock);
+   skb_orphan(msdu);
 
prefetch_len = min(htt-prefetch_len, msdu-len);
prefetch_len = roundup(prefetch_len, 4);



--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-05 Thread Eric Dumazet

On Thu, 2015-02-05 at 09:38 +0100, Michal Kazior wrote:
 On 4 February 2015 at 22:11, Eric Dumazet eric.duma...@gmail.com wrote:
  I do not see how a TSO patch could hurt a flow not using TSO/GSO.
 
  This makes no sense.
 
 Hmm..
 
 @@ -2018,8 +2053,8 @@ static bool tcp_write_xmit(struct sock *sk,
 unsigned int mss_now, int nonagle,
  * of queued bytes to ensure line rate.
  * One example is wifi aggregation (802.11 AMPDU)
  */
 -   limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes,
 - sk-sk_pacing_rate  10);
 +   limit = max(2 * skb-truesize, sk-sk_pacing_rate  10);
 +   limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
 
 if (atomic_read(sk-sk_wmem_alloc)  limit) {
 set_bit(TSQ_THROTTLED, tp-tsq_flags);
 
 Doesn't this effectively invert how tcp_limit_output_bytes is used?
 This would explain why raising the limit wasn't changing anything
 anymore when you asked me do so. Only decreasing it yielded any
 change.
 
 I've added a printk to show up the new and old values. Excerpt from logs:
 
 [  114.782740] (4608 39126 131072 = 39126) vs (131072 39126 = 131072)
 
 (2*truesize, pacing_rate, tcp_limit = limit) vs (tcp_limit, pacing_rate = 
 limit)
 
 Reverting this patch hunk alone fixes my TCP problem. Not that I'm
 saying the old logic was correct (it seems it wasn't, a limit should
 be applied as min(value, max_value), right?).
 
 Anyway the change doesn't seem to be TSO-only oriented so it would
 explain the makes no sense.


The intention is to control the queues to the following :

1 ms of buffering, but limited to a configurable value.

On a 40Gbps flow, 1ms represents 5 MB, which is insane.

We do not want to queue 5 MB of traffic, this would destroy latencies
for all concurrent flows. (Or would require having fq_codel or fq as
packet schedulers, instead of default pfifo_fast)

This is why having 1.5 ms delay between the transmit and TX completion
is a problem in your case.

 



--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-05 Thread Eric Dumazet

On Thu, 2015-02-05 at 14:44 +0100, Michal Kazior wrote:

 I do get your point. But 1.5ms is really tough on Wi-Fi.
 
 Just look at this:
 
 ; ping 192.168.1.2 -c 3
 PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=1.83 ms
 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=2.02 ms
 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=1.98 ms

Thats a different point.

I dont care about rtt but TX completions. (usually much much lower than
rtt)

I can have a 4 usec delay from the moment a NIC submits a packet to the
wire and I get TX completion IRQ, free the packet.

Yet the pong reply can come 100 ms later.

It does not mean the 4 usec delay is a problem.



--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-05 Thread Eric Dumazet

On Thu, 2015-02-05 at 14:44 +0100, Michal Kazior wrote:

 Ok. I tried calling skb_orphan() right after I submit each Tx frame
 (similar to niu which does this in start_xmit):
 
 --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
 +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
 @@ -564,6 +564,8 @@ int ath10k_htt_tx(struct ath10k_htt *htt, struct
 sk_buff *msdu)
 if (res)
 goto err_unmap_msdu;
 
 +   skb_orphan(msdu);
 +
 return 0;
 
  err_unmap_msdu:
 
 
 Now, with {net/master + ath10k GRO + the above} I get 620mbps on a
 single flow (even better then before). Wow.
 
 Does this look ok/safe as a solution to you?

Not at all. This basically removes backpressure.

A single UDP socket can now blast packets regardless of SO_SNDBUF
limits.

This basically remove years of work trying to fix bufferbloat.

I still do not understand why increasing tcp_limit_output_bytes is not
working for you.




--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-05 Thread Eric Dumazet

On Thu, 2015-02-05 at 06:41 -0800, Eric Dumazet wrote:

 Not at all. This basically removes backpressure.
 
 A single UDP socket can now blast packets regardless of SO_SNDBUF
 limits.
 
 This basically remove years of work trying to fix bufferbloat.
 
 I still do not understand why increasing tcp_limit_output_bytes is not
 working for you.

Oh well, tcp_limit_output_bytes might be ok.

In fact, the problem comes from GSO assumption. Maybe Herbert was right,
when he suggested TCP would be simpler if we enforced GSO...

When GSO is used, the thing works because 2*skb-truesize is roughly 2
ms worth of traffic.

Because you do not use GSO, and tx completions are slow, we need this :

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 65caf8b95e17..ac01b4cd0035 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2044,7 +2044,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
break;
 
/* TCP Small Queues :
-* Control number of packets in qdisc/devices to two packets / 
or ~1 ms.
+* Control number of packets in qdisc/devices to two packets /
+* or ~2 ms (sk-sk_pacing_rate  9) in case GSO is off.
 * This allows for :
 *  - better RTT estimation and ACK scheduling
 *  - faster recovery
@@ -2053,7 +2054,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
 * of queued bytes to ensure line rate.
 * One example is wifi aggregation (802.11 AMPDU)
 */
-   limit = max(2 * skb-truesize, sk-sk_pacing_rate  10);
+   limit = max(2 * skb-truesize, sk-sk_pacing_rate  9);
limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
 
if (atomic_read(sk-sk_wmem_alloc)  limit) {


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-04 Thread Eric Dumazet

I do not see how a TSO patch could hurt a flow not using TSO/GSO.

This makes no sense.

ath10k tx completions being batched/deferred to a tasklet might increase
probability to hit this condition in tcp_wfree() :

/* If this softirq is serviced by ksoftirqd, we are likely under stress.
 * Wait until our queues (qdisc + devices) are drained.
 * This gives :
 * - less callbacks to tcp_write_xmit(), reducing stress (batches)
 * - chance for incoming ACK (processed by another cpu maybe)
 *   to migrate this flow (skb-ooo_okay will be eventually set)
 */
if (wmem = SKB_TRUESIZE(1)  this_cpu_ksoftirqd() == current)
goto out;

Meaning tcp stack waits all skbs left qdisc/NIC queues before queuing
additional packets.

I would try to call skb_orphan() in ath10k if you really want to keep
these batches.

I have hard time to understand why tx completed packets go through
ath10k_htc_rx_completion_handler().. anyway...

Most conservative patch would be :

diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c 
b/drivers/net/wireless/ath/ath10k/htt_rx.c
index 
9c782a42665e1aaf43bfbca441631ee58da50c09..6a36317d6bb0447202dee15528130bd5e21248c4
 100644
--- a/drivers/net/wireless/ath/ath10k/htt_rx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
@@ -1642,6 +1642,7 @@ void ath10k_htt_t2h_msg_handler(struct ath10k *ar, struct 
sk_buff *skb)
break;
}
case HTT_T2H_MSG_TYPE_TX_COMPL_IND:
+   skb_orphan(skb);
spin_lock_bh(htt-tx_lock);
__skb_queue_tail(htt-tx_compl_q, skb);
spin_unlock_bh(htt-tx_lock);


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-04 Thread Eric Dumazet

On Wed, 2015-02-04 at 13:22 +0100, Michal Kazior wrote:
 On 4 February 2015 at 12:57, Eric Dumazet eric.duma...@gmail.com wrote:

 
  To disable gso you would have to use :
 
  ethtool -K wlan1 gso off
 
 Oh, thanks! This works. However I can't turn it on:
 
 ; ethtool -K wlan1 gso on
 Could not change any device features
 
 ..so I guess it makes no sense to re-run tests because:
 
 ; ethtool -k wlan1 | grep generic
 tx-checksum-ip-generic: on [fixed]
 generic-segmentation-offload: off [requested on]
 generic-receive-offload: on
 
 And this seems to never change.

GSO requires SG (Scatter Gather)

Are you sure this hardware has no SG support ?


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-04 Thread Eric Dumazet

OK guys

Using a mlx4 testbed I can reproduce the problem by pushing coalescing
settings and disabling SG (thus disabling GSO)

ethtool -K eth0 sg off
Actual changes:
scatter-gather: off
tx-scatter-gather: off
generic-segmentation-offload: off [requested on]

ethtool -C eth0 tx-usecs 1024 tx-frames 64

Meaning that NIC waits one ms before sending the TX IRQ,
and can accumulate 64 frames before forcing the interrupt.

We probably have a bug in cwnd expansion logic :

lpaa23:~# DUMP_TCP_INFO=1 ./netperf -H 10.246.7.152 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 
() port 0 AF_INET
rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=230 rttvar=30 snd_ssthresh=41 
cwnd=59 reordering=3 total_retrans=1 ca_state=0 pacing_rate=5943.1 Mbits
Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

 87380  16384  1638410.00   530.39   0.40 0.32 2.965   2.398  


- final cwnd=59 which is not enough to avoid the 1ms delay between each
burst. 

So sender sends ~60 packets, then has to wait 1ms (to get NIC TX IRQ)
before sending the following burst.

I am CCing Neal, he probably can help to root cause the problem.

Thanks


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-03 Thread Eric Dumazet

On Tue, 2015-02-03 at 06:27 -0800, Eric Dumazet wrote:

 Are packets TX completed after a timer or something ?
 
 Some very heavy stuff might run from tasklet (or other softirq triggered) 
 event.
 

Right, commit 6c5151a9ffa9f796f2d707617cecb6b6b241dff8
(ath10k: batch htt tx/rx completions)
is very suspicious.

Please revert it.

BTW, ath10k_htt_txrx_compl_task() runs from softirq context, so the 
_bh() prefixes are not really needed.

It seems lot of batching happens in wifi drivers, not necessarily at the
right places.



--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-02 Thread Eric Dumazet

On Mon, 2015-02-02 at 10:52 -0800, Eric Dumazet wrote:

 It seems to break ACK clocking badly (linux stack has a somewhat buggy
 tcp_tso_should_defer(), which relies on ACK being received smoothly, as
 no timer is setup to split the TSO packet.)

Following patch might help the TSO split defer logic.

It would avoid setting the TSO defer 'pseudo timer' twice, if/when TCP
Small Queue logic prevented the xmit at the expiration of first 'timer'.

This patch clears the tso_deferred variable only if we could really
send something.

Please try it, thanks !


diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 65caf8b95e17..e735f38557db 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1821,7 +1821,6 @@ static bool tcp_tso_should_defer(struct sock *sk,
struct sk_buff *skb,
return true;
 
 send_now:
-   tp-tso_deferred = 0;
return false;
 }
 
@@ -2070,6 +2069,7 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
break;
 
+   tp-tso_deferred = 0;
 repair:
/* Advance the send_head.  This one is sent out.
 * This call will increment packets_out.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-02 Thread Eric Dumazet

On Mon, 2015-02-02 at 13:25 -0800, Ben Greear wrote:

 It is a big throughput win to have fewer TCP ack packets on
 wireless since it is a half-duplex environment.  Is there anything
 we could improve so that we can have fewer acks and still get
 good tcp stack behaviour?

First apply TCP stretch ack fixes to the sender. There is no way to get
good performance if the sender does not handle stretch ack.

d6b1a8a92a14 tcp: fix timing issue in CUBIC slope calculation
9cd981dcf174 tcp: fix stretch ACK bugs in CUBIC
c22bdca94782 tcp: fix stretch ACK bugs in Reno
814d488c6126 tcp: fix the timid additive increase on stretch ACKs
e73ebb0881ea tcp: stretch ACK fixes prep

Then, make sure you do not throttle ACK too long, especially if you hope
to get Gbit line rate on a 4 ms RTT flow.

GRO does not mean : send one ACK every ms, or after 3ms delay...

It is literally :
  aggregate X packets at receive, and send the ACK asap.

If the receiver expects to have 64 ACK packets in the TX ring buffer to
actually send them (wifi aggregation), then you certainly do not want to
compress ACK too much.



--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-02 Thread Eric Dumazet

On Mon, 2015-02-02 at 11:27 +0100, Michal Kazior wrote:

 While testing I've had my internal GRO patch for ath10k and no stretch
 ack patches.

Thanks for the data, I took a look at it.

I am afraid this GRO patch might be the problem.

It seems to break ACK clocking badly (linux stack has a somewhat buggy
tcp_tso_should_defer(), which relies on ACK being received smoothly, as
no timer is setup to split the TSO packet.)

I am seeing huge delays on ACK packets and bursts like that :

05:01:53.413038 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 76745, 
win 4435, options [nop,nop,TS val 4294758508 ecr 4294757300], length 0
05:01:53.413407 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 79641, 
win 4435, options [nop,nop,TS val 4294758508 ecr 4294757301], length 0
05:01:53.413969 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 92673, 
win 4435, options [nop,nop,TS val 4294758510 ecr 4294757302], length 0
05:01:53.413990 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 97017, 
win 4435, options [nop,nop,TS val 4294758510 ecr 4294757302], length 0
05:01:53.414011 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 110049, 
win 4435, options [nop,nop,TS val 4294758510 ecr 4294757302], length 0
...
05:01:53.422663 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 189689, 
win 4435, options [nop,nop,TS val 4294758519 ecr 4294757310], length 0
05:01:53.424354 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 198377, 
win 4435, options [nop,nop,TS val 4294758520 ecr 4294757311], length 0
05:01:53.424400 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 202721, 
win 4435, options [nop,nop,TS val 4294758520 ecr 4294757313], length 0
05:01:53.424409 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 205617, 
win 4435, options [nop,nop,TS val 4294758520 ecr 4294757313], length 0
...
05:01:53.450248 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 419921, 
win 4435, options [nop,nop,TS val 4294758547 ecr 4294757337], length 0
05:01:53.450266 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 427161, 
win 4435, options [nop,nop,TS val 4294758547 ecr 4294757340], length 0
05:01:53.450289 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 431505, 
win 4435, options [nop,nop,TS val 4294758547 ecr 4294757340], length 0

Could you make again your experiments using upstream kernel (David
Miller net tree) ?

You also could post the GRO patch so that we can comment on it.

Thanks


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-01-30 Thread Eric Dumazet

On Fri, 2015-01-30 at 14:47 +0100, Arend van Spriel wrote:

 Indeed and that is what we would like to address in our wireless 
 drivers. I will setup some experiments using the fraction sizing and 
 post my findings. Again sorry if I offended you.

You did not, but I had no feedback about my suggestions.

Michal sent it now.

Thanks


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-01-30 Thread Eric Dumazet

On Fri, 2015-01-30 at 14:39 +0100, Michal Kazior wrote:

 I've briefly tried playing with this knob to no avail unfortunately. I
 tried 256K, 1M - it didn't improve TCP performance. When I tried to
 make it smaller (e.g. 16K) the traffic dropped even more so it does
 have an effect. It seems there's some other limiting factor in this
 case.

Interesting.

Could you take some tcpdump/pcap with various tcp_limit_output_bytes
values ?

echo 131072 /proc/sys/net/ipv4/tcp_limit_output_bytes
tcpdump -p -i wlanX -s 128 -c 2 -w 128k.pcap

echo 262144 /proc/sys/net/ipv4/tcp_limit_output_bytes
tcpdump -p -i wlanX -s 128 -c 2 -w 256k.pcap

...

Thanks !


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-01-29 Thread Eric Dumazet

On Thu, 2015-01-29 at 12:48 +0100, Michal Kazior wrote:
 Hi,
 
 I'm not subscribed to netdev list and I can't find the message-id so I
 can't reply directly to the original thread `BW regression after tcp:
 refine TSO autosizing`.
 
 I've noticed a big TCP performance drop with ath10k
 (drivers/net/wireless/ath/ath10k) on 3.19-rc5. Instead of 500mbps I
 get 250mbps in my testbed.
 
 After bisecting I ended up at `tcp: refine TSO autosizing`. Reverting
 `tcp: refine TSO autosizing` and `tcp: Do not apply TSO segment limit
 to non-TSO packets` (for conflict free reverts) fixes the problem.
 
 My testing setup is as follows:
 
  a) ath10k AP, github.com/kvalo/ath/tree/master 3.19-rc5, w/ reverts
  b) ath10k STA connected to (a), github.com/kvalo/ath/tree/master
 3.19-rc5, w/ reverts
  c) (b) w/o reverts
 
 Devices are 3x3 (AP) and 2x2 (Client) and are RF cabled. 11ac@80MHz
 2x2 has 866mbps modulation rate. In practice this should deliver
 ~700mbps of real UDP traffic.
 
 Here are some numbers:
 
 UDP: (b) - (a): 672mbps
 UDP: (a) - (b): 687mbps
 TCP: (b) - (a): 526mbps
 TCP: (a) - (b): 500mbps
 
 UDP: (c) - (a): 669mbps*
 UDP: (a) - (c): 689mbps*
 TCP: (c) - (a): 240mbps**
 TCP: (a) - (c): 490mbps*
 
 * no changes/within error margin
 ** the performance drop
 
 I'm using iperf:
   UDP: iperf -i1 -s -u vs iperf -i1 -c XX -u -B 200M -P5 -t 20
   TCP: iperf -i1 -s vs iperf -i1 -c XX -P5 -t 20
 
 Result values were obtained at the receiver side.
 
 Iperf reports a few frames lost and out-of-order at each UDP test
 start (during first second) but later has no packet loss and no
 out-of-order. This shouldn't have any effect on a TCP session, right?
 
 The device delivers batched up tx/rx completions (no way to change
 that). I suppose this could be an issue for timing sensitive
 algorithms. Also keep in mind 802.11n and 802.11ac devices have frame
 aggregation windows so there's an inherent extra (and non-uniform)
 latency when compared to, e.g. ethernet devices.
 
 The driver doesn't have GRO. I have an internal patch which implements
 it. It improves overall TCP traffic (more stable, up to 600mbps TCP
 which is ~100mbps more than without GRO) but the TCP: (c) - (a)
 performance drop remains unaffected regardless.
 
 I've tried applying stretch ACK patchset (v2) on both machines and
 re-run the above tests. I got no measurable difference in performance.
 
 I've also run these tests with iwlwifi 7260 (also a 2x2) as (b) and
 (c). It didn't seem to be affected by the TSO patch at all (it runs at
 ~360mbps of TCP regardless of the TSO patch).
 
 Any hints/ideas?
 

Hi Michal

This patch restored original TSQ behavior, because the 1ms worth of data
per flow had totally destroyed TSQ intent.

vi +630 Documentation/networking/ip-sysctl.txt

tcp_limit_output_bytes - INTEGER
Controls TCP Small Queue limit per tcp socket.
TCP bulk sender tends to increase packets in flight until it
gets losses notifications. With SNDBUF autotuning, this can
result in a large amount of packets queued in qdisc/device
on the local machine, hurting latency of other flows, for
typical pfifo_fast qdiscs.
tcp_limit_output_bytes limits the number of bytes on qdisc
or device to reduce artificial RTT/cwnd and reduce bufferbloat.
Default: 131072

This is why I suggested to Eyal Perry to change the TX interrupt
mitigation parameters as in :

ethtool -C eth0 tx-frames 4 rx-frames 4

With this change and the stretch ack fixes, I got 37Gbps of throughput
on a single flow, on a 40Gbit NIC (mlx4)

If a driver needs to buffer more than tcp_limit_output_bytes=131072 to
get line rate, I suggest that you either :

1) tweak tcp_limit_output_bytes, but its not practical from a driver.

2) change the driver, knowing what are its exact requirements, by
removing a fraction of skb-truesize at ndo_start_xmit() time as in :

if ((skb-destructor == sock_wfree ||
 skb-restuctor == tcp_wfree) 
skb-sk) {
u32 fraction = skb-truesize / 2;

skb-truesize -= fraction;
atomic_sub(fraction, skb-sk-sk_wmem_alloc);
}

Thanks.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

64 matches

Mail list logo