Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-24 Thread Johannes Berg
On Thu, 2015-02-12 at 08:48 +0100, Michal Kazior wrote:

  Good point. I was actually thinking about it. I can try cooking a
  patch unless you want to do it yourself :-)
 
 I've taken a look into this. The most obvious place to add the
 timestamp for each packet would be ieee80211_tx_info (i.e. the
 skb-cb[48]). The problem is it's very tight there. Even squeezing 2
 bytes (allowing up to 64ms of tx completion delay which I'm worried
 won't be enough) will be troublesome. Some drivers already use every
 last byte of their allowance on 64bit archs (e.g. ar5523 uses entire
 40 bytes of driver_data).

Couldn't we just repurpose the existing skb-tstamp field for this, as
long as the skb is fully contained within the wireless layer?

Actually, it looks like we can't, since I guess timestamping options can
be turned on on any socket.

 I wonder if it's okay to bump skb-cb to 56 bytes to avoid the cascade
 of changes required to implement the tx completion delay accounting?

I have no doubt that would be rejected :)

johannes

--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-24 Thread Johannes Berg
On Tue, 2015-02-24 at 11:30 +0100, Johannes Berg wrote:
 On Tue, 2015-02-24 at 11:24 +0100, Johannes Berg wrote:
  On Thu, 2015-02-12 at 08:48 +0100, Michal Kazior wrote:
  
Good point. I was actually thinking about it. I can try cooking a
patch unless you want to do it yourself :-)
   
   I've taken a look into this. The most obvious place to add the
   timestamp for each packet would be ieee80211_tx_info (i.e. the
   skb-cb[48]). The problem is it's very tight there. Even squeezing 2
   bytes (allowing up to 64ms of tx completion delay which I'm worried
   won't be enough) will be troublesome. Some drivers already use every
   last byte of their allowance on 64bit archs (e.g. ar5523 uses entire
   40 bytes of driver_data).
  
  Couldn't we just repurpose the existing skb-tstamp field for this, as
  long as the skb is fully contained within the wireless layer?
  
  Actually, it looks like we can't, since I guess timestamping options can
  be turned on on any socket.
 
 Actually, that creates a clone or a new skb? Hmm.

Ah and then it puts it on the error queue right away, so I think we can
reuse it.

johannes

--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Wifi outside the faraday cage (was: Throughput regression with `tcp: refine TSO autosizing`)

2015-02-12 Thread Dave Taht
On Fri, Feb 6, 2015 at 1:57 AM, Michal Kazior michal.kaz...@tieto.com wrote:
 On 5 February 2015 at 20:50, Dave Taht dave.t...@gmail.com wrote:
 [...]
 And I really, really, really wish, that just once during this thread,
 someone had bothered to try running a test
 at a real world MCS rate - say MCS1, or MCS4, and measured the latency
 under load of that...

 Time between frame submission to firmware and tx-completion on one of
 my ath10k machines:

THANK YOU for running these tests!


 Legacy 54mbps: ~18ms
 Legacy 6mbps: ~37ms

legacy rates are what many people actually achieve, given the
limited market penetration ac has on clients and APs.

 11n MCS 3 (nss=0): ~13ms
 11n MCS 8 (nss=1): ~6-8ms
 11ac NSS=1 MCS=2: ~4-6ms
 11ac NSS=2 MCS=0: ~5-8ms

 Keep in mind this is a clean room environment so retransmissions are
 kept at minimum. Obviously with a noisy environment you'll get retries
 at different rates and higher latency.

It is difficult to reconcile the results you get in the clean room
with the results I get from measurements in the real wold. I encourage
you to go test your code in coffee shops, in offices with wifi, and in
hotels and apartment buildings in preference to testing in the lab.

I typically measure induced delays in the 3 to 6 second range in your
typical conference scenario, which I measure at every conference I go
to. The latest talk, including data on that, is friday morning,
starting at 2:15 or so, at nznog:

http://new.livestream.com/i-filmservices/NZNOG2015/videos/75358960

1) In the real world, I rarely see the maximum *rates*.

I am personally quite fond of designing stuff with gears out of the
middle of the Boston Gear Catalog. [1]. In looking over my largely
outdoor wifi network, I see a cluster of values around mcs11,
followed\by mcs4,3, 7 and 0, and *nothing* with MCS15. David lang is
planning on doing some measurements at the SCALE conference next week,
and I expect heaps of data from that, but I strongly suspect that the
vast majority of connections in every circumstance except the
test-bench are not even coming close to the maximum MCS rate in the
standard(s).

I would have thought that the lessons of the chromecast, where *every*
attempt at reviewing it in an office environment failed, might have
supplied industry clue that even 20Mbit to a given station is
impossible in many cases due to airtime contention.

Aggregates should be sized to have a maximum of 2 full ones stacked up
at the rate being achieved for the destination, and the rest
backlogged in the qdisc layer, if possible. 37ms backed up in the
firmware is a lot, considering that the test above had no airtime
contention in it, and no multicast.

Drivers need to be aware that every TXOP is precious. I could see
having a watchdog timer set on getting one packet into a wifi driver
to wait a few hundred usec longer to fire off the write to the
hardware in order to maximize aggregation by accumulating more packets
to aggregate.

I have hopes for xmit_more also being useful, but I am really not sure
how well that works on single cores, interactions with napi, and with
other wifi aggregates. It looks like adding xmit_more to the ag71xx
driver will be easy...

2) In the real world I see media acquisition times *far* greater than 1ms.

Please feel free to test your drivers in coffee shops, in the office,
at hotels, in apartments...

And retries... let's not talk about retries...


3) Longer AMPDUs lead to more tail loss and retries

I have a paper around here somewhere that shows AMPDU loss and retries
go up disproportionately as the length of transmission approaches 4ms.
I hate drawing a conclusion from a paper I can't find, but my overall
take on it is that as media acquisition time and retransmits go up,
reducing AMPDU size from the maximum down to about 1ms at the current
rate would lead to more fair, responsive, and fast-feeling wifi for
everyone, improve ack clocking, flow mixing for web traffic, etc, etc.

4) There is some fairly decent academic work on other aspects of
excessive buffering at lower rates

http://hph16.uwaterloo.ca/~bshihada/publications/buffer-AMPDU.pdf

(there are problems with this paper, but at least it tests n)

and see google scholar for bufferbloat related papers in 2014 and
later on wifi and LTE.

5) As for rate control, Minstrel was designed in an era when there
wasn't one AP for every 4 people in the USA. Other people's rate
controllers are even dumber, and minstrel-ht itself needs a hard look
at n speeds, much less ac speeds.

6) Everything I say above applies to both stations and APs.

APs have FAR worse problems, where per-tid (station) queuing is really
needed in order to effectively aggregate when two or more stations are
in use. Statistically, with two or more stations using traffic,
aggregation possibilities will go down rapidly on a FIFO, (and go down
even faster with FQ in place without per sta queuing!),  and with the
usual fixed buffersize underneath that, without per-tid 

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-12 Thread Dave Taht
On Wed, Feb 11, 2015 at 11:48 PM, Michal Kazior michal.kaz...@tieto.com wrote:
 On 11 February 2015 at 09:57, Michal Kazior michal.kaz...@tieto.com wrote:
 On 10 February 2015 at 15:19, Johannes Berg johan...@sipsolutions.net 
 wrote:
 On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:

 +   if (msdu-sk) {
 +   ewma_add(ar-tx_delay_us,
 +ktime_to_ns(ktime_sub(ktime_get(), 
 skb_cb-stamp)) /
 +NSEC_PER_USEC);
 +
 +   ACCESS_ONCE(msdu-sk-sk_tx_completion_delay_cushion) =
 +   (ewma_read(ar-tx_delay_us) *
 +msdu-sk-sk_pacing_rate)  20;
 +   }

 To some extent, every wifi driver is going to have this problem. Perhaps
 we should do this in mac80211?

 Good point. I was actually thinking about it. I can try cooking a
 patch unless you want to do it yourself :-)

 I've taken a look into this. The most obvious place to add the
 timestamp for each packet would be ieee80211_tx_info (i.e. the
 skb-cb[48]). The problem is it's very tight there. Even squeezing 2
 bytes (allowing up to 64ms of tx completion delay which I'm worried

I will argue strongly in favor of never allowing more than 4ms packets
to accumulate in the firmware.

 won't be enough) will be troublesome. Some drivers already use every
 last byte of their allowance on 64bit archs (e.g. ar5523 uses entire
 40 bytes of driver_data).

 I wonder if it's okay to bump skb-cb to 56 bytes to avoid the cascade
 of changes required to implement the tx completion delay accounting?


 Michał
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Dave Täht

thttp://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-11 Thread Michal Kazior
On 11 February 2015 at 09:57, Michal Kazior michal.kaz...@tieto.com wrote:
 On 10 February 2015 at 15:19, Johannes Berg johan...@sipsolutions.net wrote:
 On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:

 +   if (msdu-sk) {
 +   ewma_add(ar-tx_delay_us,
 +ktime_to_ns(ktime_sub(ktime_get(), skb_cb-stamp)) 
 /
 +NSEC_PER_USEC);
 +
 +   ACCESS_ONCE(msdu-sk-sk_tx_completion_delay_cushion) =
 +   (ewma_read(ar-tx_delay_us) *
 +msdu-sk-sk_pacing_rate)  20;
 +   }

 To some extent, every wifi driver is going to have this problem. Perhaps
 we should do this in mac80211?

 Good point. I was actually thinking about it. I can try cooking a
 patch unless you want to do it yourself :-)

I've taken a look into this. The most obvious place to add the
timestamp for each packet would be ieee80211_tx_info (i.e. the
skb-cb[48]). The problem is it's very tight there. Even squeezing 2
bytes (allowing up to 64ms of tx completion delay which I'm worried
won't be enough) will be troublesome. Some drivers already use every
last byte of their allowance on 64bit archs (e.g. ar5523 uses entire
40 bytes of driver_data).

I wonder if it's okay to bump skb-cb to 56 bytes to avoid the cascade
of changes required to implement the tx completion delay accounting?


Michał
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-11 Thread Michal Kazior
On 11 February 2015 at 14:17, Eric Dumazet eric.duma...@gmail.com wrote:
 On Wed, 2015-02-11 at 09:33 +0100, Michal Kazior wrote:

 If I set tcp_limit_output_bytes to 700K+ I can get ath10k w/ cushion
 w/ aggregation to reach 600mbps on a single flow.

 You know, there is a reason this sysctl exists in the first place ;)

 The first suggestion I made to you was to raise it.

 The default setting must stay as is as long default Qdisc is pfifo_fast.

 I believe I already mentioned skb-truesize tricks for drivers willing
 to adjust the TSQ given their constraints.

Right. truesize didn't help in my early tests and once the cushion
thing came about I had assumed that it's not relevant anymore.

I just checked:

@@ -2620,6 +2621,12 @@ static void ath10k_tx(struct ieee80211_hw *hw,
if (info-flags  IEEE80211_TX_CTL_NO_CCK_RATE)
ath10k_dbg(ar, ATH10K_DBG_MAC,
IEEE80211_TX_CTL_NO_CCK_RATE\n);

+   if (skb-sk) {
+   u32 trim = skb-truesize - (skb-truesize / 8);
+   skb-truesize -= trim;
+   atomic_sub(trim, skb-sk-sk_wmem_alloc);
+   }

With this I get 600mbps on a single flow. The /2 wasn't enough (it
barely made a difference, 250-300mbps). The question is how do I know
how much of trimming is too much? Could the tx completion delay be
used to compute the trim factor, hmm..

Maybe this should be done in mac80211 as well?


Michał
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-11 Thread Eric Dumazet
On Wed, 2015-02-11 at 09:33 +0100, Michal Kazior wrote:

 If I set tcp_limit_output_bytes to 700K+ I can get ath10k w/ cushion
 w/ aggregation to reach 600mbps on a single flow.

You know, there is a reason this sysctl exists in the first place ;)

The first suggestion I made to you was to raise it.

The default setting must stay as is as long default Qdisc is pfifo_fast.

I believe I already mentioned skb-truesize tricks for drivers willing
to adjust the TSQ given their constraints.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-10 Thread Eric Dumazet
On Tue, 2015-02-10 at 15:19 +0100, Johannes Berg wrote:
 On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:
 
  +   if (msdu-sk) {
  +   ewma_add(ar-tx_delay_us,
  +ktime_to_ns(ktime_sub(ktime_get(), skb_cb-stamp)) 
  /
  +NSEC_PER_USEC);
  +
  +   ACCESS_ONCE(msdu-sk-sk_tx_completion_delay_cushion) =
  +   (ewma_read(ar-tx_delay_us) *
  +msdu-sk-sk_pacing_rate)  20;
  +   }
 
 To some extent, every wifi driver is going to have this problem. Perhaps
 we should do this in mac80211?

I'll provide the TCP patch.

sk-sk_tx_completion_delay_cushion is probably a wrong name, as the
units here are in bytes, since it is really number of bytes in the
network driver that accommodate for tx completions delays. 

tx_completion_delay * pacing_rate

sk_tx_completion_cushion maybe.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-10 Thread Michal Kazior
On 9 February 2015 at 16:11, Eric Dumazet eric.duma...@gmail.com wrote:
 On Mon, 2015-02-09 at 14:47 +0100, Michal Kazior wrote:
[...]
 This is not what I suggested.

 If you test this on any other network device, you'll have
 sk-sk_tx_completion_delay_us == 0

 amount = 0 * (sk-sk_pacing_rate  10); -- 0
 limit = max(2 * skb-truesize, amount  10); -- 2 * skb-truesize

You're right. Sorry for mixing up.


 So non TSO/GSO NIC will not be able to queue more than 2 MSS (one MSS
 per skb)

 Then if you store only the last tx completion, you have the possibility
 of having a last packet of a train (say a retransmit) to make it very
 low.

 Ideally the formula would be in TCP something very fast to compute :

 amount = (sk-sk_pacing_rate  10) + sk-tx_completion_delay_cushion;
 limit = max(2 * skb-truesize, amount);
 limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);

 So a 'problematic' driver would have to do the math (64 bit maths) like
 this :


 sk-tx_completion_delay_cushion = ewma_tx_delay * sk-sk_pacing_rate;

Hmm. So I've done like you suggested (hopefully I didn't mix anything
up this time around).

I now get pre-regression performance, ~250mbps on 1 flow, 600mbps on 5
flows (vs 250mbps whatever number of flows).


Michał


diff --git a/drivers/net/wireless/ath/ath10k/core.c
b/drivers/net/wireless/ath/ath10k/core.c
index 367e896..a29111c 100644
--- a/drivers/net/wireless/ath/ath10k/core.c
+++ b/drivers/net/wireless/ath/ath10k/core.c
@@ -18,6 +18,7 @@
 #include linux/module.h
 #include linux/firmware.h
 #include linux/of.h
+#include linux/average.h

 #include core.h
 #include mac.h
@@ -1423,6 +1424,7 @@ struct ath10k *ath10k_core_create(size_t
priv_size, struct device *dev,
init_dummy_netdev(ar-napi_dev);
ieee80211_napi_add(ar-hw, ar-napi, ar-napi_dev,
   ath10k_core_napi_dummy_poll, 64);
+   ewma_init(ar-tx_delay_us, 16384, 8);

ret = ath10k_debug_create(ar);
if (ret)
diff --git a/drivers/net/wireless/ath/ath10k/core.h
b/drivers/net/wireless/ath/ath10k/core.h
index 3be3a59..34f6d78 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -24,6 +24,7 @@
 #include linux/pci.h
 #include linux/uuid.h
 #include linux/time.h
+#include linux/average.h

 #include htt.h
 #include htc.h
@@ -82,6 +83,7 @@ struct ath10k_skb_cb {
dma_addr_t paddr;
u8 eid;
u8 vdev_id;
+   ktime_t stamp;

struct {
u8 tid;
@@ -625,6 +627,7 @@ struct ath10k {

struct net_device napi_dev;
struct napi_struct napi;
+   struct ewma tx_delay_us;

 #ifdef CONFIG_ATH10K_DEBUGFS
struct ath10k_debug debug;
diff --git a/drivers/net/wireless/ath/ath10k/mac.c
b/drivers/net/wireless/ath/ath10k/mac.c
index 15e47f4..5efb2a7 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -2620,6 +2620,7 @@ static void ath10k_tx(struct ieee80211_hw *hw,
if (info-flags  IEEE80211_TX_CTL_NO_CCK_RATE)
ath10k_dbg(ar, ATH10K_DBG_MAC,
IEEE80211_TX_CTL_NO_CCK_RATE\n);

+   ATH10K_SKB_CB(skb)-stamp = ktime_get();
ATH10K_SKB_CB(skb)-htt.is_offchan = false;
ATH10K_SKB_CB(skb)-htt.tid = ath10k_tx_h_get_tid(hdr);
ATH10K_SKB_CB(skb)-vdev_id = ath10k_tx_h_get_vdev_id(ar, vif);
diff --git a/drivers/net/wireless/ath/ath10k/txrx.c
b/drivers/net/wireless/ath/ath10k/txrx.c
index 3f00cec..0f5f0f2 100644
--- a/drivers/net/wireless/ath/ath10k/txrx.c
+++ b/drivers/net/wireless/ath/ath10k/txrx.c
@@ -15,6 +15,8 @@
  * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
  */

+#include net/sock.h
+#include linux/average.h
 #include core.h
 #include txrx.h
 #include htt.h
@@ -82,6 +84,16 @@ void ath10k_txrx_tx_unref(struct ath10k_htt *htt,

ath10k_report_offchan_tx(htt-ar, msdu);

+   if (msdu-sk) {
+   ewma_add(ar-tx_delay_us,
+ktime_to_ns(ktime_sub(ktime_get(), skb_cb-stamp)) /
+NSEC_PER_USEC);
+
+   ACCESS_ONCE(msdu-sk-sk_tx_completion_delay_cushion) =
+   (ewma_read(ar-tx_delay_us) *
+msdu-sk-sk_pacing_rate)  20;
+   }
+
info = IEEE80211_SKB_CB(msdu);
memset(info-status, 0, sizeof(info-status));
trace_ath10k_txrx_tx_unref(ar, tx_done-msdu_id);
diff --git a/include/net/sock.h b/include/net/sock.h
index 2210fec..6772543 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -391,6 +391,7 @@ struct sock {
gfp_t   sk_allocation;
u32 sk_pacing_rate; /* bytes per second */
u32 sk_max_pacing_rate;
+   u32 sk_tx_completion_delay_cushion;
netdev_features_t   sk_route_caps;
netdev_features_t   sk_route_nocaps;
int sk_gso_type;
diff --git a/net/ipv4/tcp_output.c 

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-10 Thread Eric Dumazet
On Tue, 2015-02-10 at 04:54 -0800, Eric Dumazet wrote:

 Hi Michal
 
 This is almost it ;)
 
 As I said you must do this using u64 arithmetics, we still support 32bit
 kernels.
 
 Also,  20 instead of / 100 introduces a 5% error, I would use a
 plain divide, as the compiler will use a reciprocal divide (ie : a
 multiply)
 
 We use  10 instead of /1000 because a 2.4 % error is probably okay.
 
 ewma_add(ar-tx_delay_us,
  ktime_to_ns(ktime_sub(ktime_get(),
 skb_cb-stamp)) /
   NSEC_PER_USEC);

btw I suspect this wont compile on 32 bit kernel

You need to use do_div() as well :

u64 val = ktime_to_ns(ktime_sub(ktime_get(),
skb_cb-stamp));

do_div(val, NSEC_PER_USEC);

ewma_add(ar-tx_delay_us, val);


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-10 Thread Eric Dumazet
On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:

 +   if (msdu-sk) {
 +   ewma_add(ar-tx_delay_us,
 +ktime_to_ns(ktime_sub(ktime_get(), skb_cb-stamp)) /
 +NSEC_PER_USEC);
 +
 +   ACCESS_ONCE(msdu-sk-sk_tx_completion_delay_cushion) =
 +   (ewma_read(ar-tx_delay_us) *
 +msdu-sk-sk_pacing_rate)  20;
 +   }
 +

Hi Michal

This is almost it ;)

As I said you must do this using u64 arithmetics, we still support 32bit
kernels.

Also,  20 instead of / 100 introduces a 5% error, I would use a
plain divide, as the compiler will use a reciprocal divide (ie : a
multiply)

We use  10 instead of /1000 because a 2.4 % error is probably okay.

ewma_add(ar-tx_delay_us,
 ktime_to_ns(ktime_sub(ktime_get(),
skb_cb-stamp)) /
NSEC_PER_USEC);
u64 val = (u64)ewma_read(ar-tx_delay_us) *
   msdu-sk-sk_pacing_rate;

do_div(val, USEC_PER_SEC);

ACCESS_ONCE(msdu-sk-sk_tx_completion_delay_cushion) =
(u32)val;
 
(WRITE_ONCE() would be better for new kernels, but ACCESS_ONCE() is ok
since we probably want to backport to stable kernels)


Thanks


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-09 Thread Eric Dumazet
On Mon, 2015-02-09 at 14:47 +0100, Michal Kazior wrote:

 diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
 index 65caf8b..5e249bf 100644
 --- a/net/ipv4/tcp_output.c
 +++ b/net/ipv4/tcp_output.c
 @@ -1996,6 +1996,7 @@ static bool tcp_write_xmit(struct sock *sk,
 unsigned int mss_now, int nonagle,
 max_segs = tcp_tso_autosize(sk, mss_now);
 while ((skb = tcp_send_head(sk))) {
 unsigned int limit;
 +   unsigned int amount;
 
 tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
 BUG_ON(!tso_segs);
 @@ -2053,7 +2054,9 @@ static bool tcp_write_xmit(struct sock *sk,
 unsigned int mss_now, int nonagle,
  * of queued bytes to ensure line rate.
  * One example is wifi aggregation (802.11 AMPDU)
  */
 -   limit = max(2 * skb-truesize, sk-sk_pacing_rate  10);
 +   amount = sk-sk_tx_completion_delay_us *
 +(sk-sk_pacing_rate  10);
 +   limit = max(2 * skb-truesize, amount  10);
 limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
 
 if (atomic_read(sk-sk_wmem_alloc)  limit) {

This is not what I suggested.

If you test this on any other network device, you'll have
sk-sk_tx_completion_delay_us == 0

amount = 0 * (sk-sk_pacing_rate  10); -- 0
limit = max(2 * skb-truesize, amount  10); -- 2 * skb-truesize

So non TSO/GSO NIC will not be able to queue more than 2 MSS (one MSS
per skb)

Then if you store only the last tx completion, you have the possibility
of having a last packet of a train (say a retransmit) to make it very
low.

Ideally the formula would be in TCP something very fast to compute :

amount = (sk-sk_pacing_rate  10) + sk-tx_completion_delay_cushion;
limit = max(2 * skb-truesize, amount);
limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);

So a 'problematic' driver would have to do the math (64 bit maths) like
this :


sk-tx_completion_delay_cushion = ewma_tx_delay * sk-sk_pacing_rate;





--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-09 Thread Michal Kazior
On 6 February 2015 at 15:09, Michal Kazior michal.kaz...@tieto.com wrote:
 On 6 February 2015 at 14:53, Eric Dumazet eric.duma...@gmail.com wrote:
 On Fri, 2015-02-06 at 05:40 -0800, Eric Dumazet wrote:

 tcp_wfree() could maintain in tp-tx_completion_delay_ms an EWMA
 of TX completion delay. But this would require yet another expensive
 call to ktime_get() if HZ  1000.

 Then tcp_write_xmit() could use it to adjust :

limit = max(2 * skb-truesize, sk-sk_pacing_rate  9);

 to

amount = (2 + tp-tx_completion_delay_ms) * sk-sk_pacing_rate

limit = max(2 * skb-truesize, amount / 1000);

 I'll cook a patch.

 Hmm... doing this in all protocols would be too expensive,
 and we do not want to include time spent in qdiscs.

 wifi could eventually do that, providing in skb-tx_completion_delay_us
 the time spent in wifi driver.

 This way, we would have no penalty for network devices doing normal skb
 orphaning (loopback interface, ethernet, ...)

 I'll play around with this idea and report back later.

I'm able to get 600mbps with 5 flows and 250mbps with 1 flow, i.e.
same as before the regression. I'm attaching the patch at the end of
my mail - is this approach viable?

I wonder if there's anything that can be done to allow 600mbps (line
rate) on 1 flow with ath10k without tweaking tcp_limit_output_bytes
(you can't expect end-users to tweak this).

Perhaps tcp_limit_output_bytes should also consider tx_completion_delay, e.g.:

  amount = sk-sk_tx_completion_delay_us;
  amount *= sk-sk_pacing_rate  10;
  limit = max(2 * skb-truesize, amount  10);
  max_limit = sysctl_tcp_limit_output_bytes;
  max_limit *= 1 + (sk-sk_tx_completion_delay_us / USEC_PER_MSEC);
  limit = min(u32, limit, max_limit);

With this I get ~400mbps on 1 flow. If I add the original 1ms extra
delay from your formula to tx_completion_delay I fill in ath10k I get
nearly line rate in 1 flow (almost 600mbps; it hops between 570-620).
Decreasing tcp_limit_output_bytes decreases throughput (e.g. 64K gives
300mbps, 32K gives 180mbps, 16K gives 110mbps). Multiple flows in
iperf seem unbalanced with 128K limit, but look okay with 32K).


Michał


diff --git a/drivers/net/wireless/ath/ath10k/core.h
b/drivers/net/wireless/ath/ath10k/core.h
index 3be3a59..4ff0ae8 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -82,6 +82,7 @@ struct ath10k_skb_cb {
dma_addr_t paddr;
u8 eid;
u8 vdev_id;
+   ktime_t stamp;

struct {
u8 tid;
diff --git a/drivers/net/wireless/ath/ath10k/mac.c
b/drivers/net/wireless/ath/ath10k/mac.c
index 15e47f4..5efb2a7 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -2620,6 +2620,7 @@ static void ath10k_tx(struct ieee80211_hw *hw,
if (info-flags  IEEE80211_TX_CTL_NO_CCK_RATE)
ath10k_dbg(ar, ATH10K_DBG_MAC,
IEEE80211_TX_CTL_NO_CCK_RATE\n);

+   ATH10K_SKB_CB(skb)-stamp = ktime_get();
ATH10K_SKB_CB(skb)-htt.is_offchan = false;
ATH10K_SKB_CB(skb)-htt.tid = ath10k_tx_h_get_tid(hdr);
ATH10K_SKB_CB(skb)-vdev_id = ath10k_tx_h_get_vdev_id(ar, vif);
diff --git a/drivers/net/wireless/ath/ath10k/txrx.c
b/drivers/net/wireless/ath/ath10k/txrx.c
index 3f00cec..0d5539b 100644
--- a/drivers/net/wireless/ath/ath10k/txrx.c
+++ b/drivers/net/wireless/ath/ath10k/txrx.c
@@ -15,6 +15,7 @@
  * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
  */

+#include net/sock.h
 #include core.h
 #include txrx.h
 #include htt.h
@@ -82,6 +83,13 @@ void ath10k_txrx_tx_unref(struct ath10k_htt *htt,

ath10k_report_offchan_tx(htt-ar, msdu);

+   if (msdu-sk) {
+   ACCESS_ONCE(msdu-sk-sk_tx_completion_delay_us) =
+   ktime_to_ns(ktime_sub(ktime_get(),
+ skb_cb-stamp)) /
+   NSEC_PER_USEC;
+   }
+
info = IEEE80211_SKB_CB(msdu);
memset(info-status, 0, sizeof(info-status));
trace_ath10k_txrx_tx_unref(ar, tx_done-msdu_id);
diff --git a/include/net/sock.h b/include/net/sock.h
index 2210fec..6b15d71 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -390,6 +390,7 @@ struct sock {
int sk_wmem_queued;
gfp_t   sk_allocation;
u32 sk_pacing_rate; /* bytes per second */
+   u32 sk_tx_completion_delay_us;
u32 sk_max_pacing_rate;
netdev_features_t   sk_route_caps;
netdev_features_t   sk_route_nocaps;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 65caf8b..5e249bf 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1996,6 +1996,7 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
max_segs = tcp_tso_autosize(sk, mss_now);
while ((skb = tcp_send_head(sk))) {

Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-06 Thread Michal Kazior
On 5 February 2015 at 20:50, Dave Taht dave.t...@gmail.com wrote:
[...]
 And I really, really, really wish, that just once during this thread,
 someone had bothered to try running a test
 at a real world MCS rate - say MCS1, or MCS4, and measured the latency
 under load of that...

Time between frame submission to firmware and tx-completion on one of
my ath10k machines:

Legacy 54mbps: ~18ms
Legacy 6mbps: ~37ms
11n MCS 3 (nss=0): ~13ms
11n MCS 8 (nss=1): ~6-8ms
11ac NSS=1 MCS=2: ~4-6ms
11ac NSS=2 MCS=0: ~5-8ms

Keep in mind this is a clean room environment so retransmissions are
kept at minimum. Obviously with a noisy environment you'll get retries
at different rates and higher latency.


Michał
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-06 Thread Michal Kazior
On 5 February 2015 at 18:10, Eric Dumazet eric.duma...@gmail.com wrote:
 On Thu, 2015-02-05 at 06:41 -0800, Eric Dumazet wrote:

 Not at all. This basically removes backpressure.

 A single UDP socket can now blast packets regardless of SO_SNDBUF
 limits.

 This basically remove years of work trying to fix bufferbloat.

 I still do not understand why increasing tcp_limit_output_bytes is not
 working for you.

 Oh well, tcp_limit_output_bytes might be ok.

 In fact, the problem comes from GSO assumption. Maybe Herbert was right,
 when he suggested TCP would be simpler if we enforced GSO...

 When GSO is used, the thing works because 2*skb-truesize is roughly 2
 ms worth of traffic.

 Because you do not use GSO, and tx completions are slow, we need this :

 diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
 index 65caf8b95e17..ac01b4cd0035 100644
 --- a/net/ipv4/tcp_output.c
 +++ b/net/ipv4/tcp_output.c
 @@ -2044,7 +2044,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned 
 int mss_now, int nonagle,
 break;

 /* TCP Small Queues :
 -* Control number of packets in qdisc/devices to two packets 
 / or ~1 ms.
 +* Control number of packets in qdisc/devices to two packets /
 +* or ~2 ms (sk-sk_pacing_rate  9) in case GSO is off.
  * This allows for :
  *  - better RTT estimation and ACK scheduling
  *  - faster recovery
 @@ -2053,7 +2054,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned 
 int mss_now, int nonagle,
  * of queued bytes to ensure line rate.
  * One example is wifi aggregation (802.11 AMPDU)
  */
 -   limit = max(2 * skb-truesize, sk-sk_pacing_rate  10);
 +   limit = max(2 * skb-truesize, sk-sk_pacing_rate  9);
 limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);

 if (atomic_read(sk-sk_wmem_alloc)  limit) {


The above brings back previous behaviour, i.e. I can get 600mbps TCP
on 5 flows again. Single flow is still (as it was before TSO
autosizing) limited to roughly ~280mbps.

I never really bothered before to understand why I need to push a few
flows through ath10k to max it out, i.e. if I run a single UDP flow I
get ~300mbps while with, e.g. 5 I get 670mbps easily.

I guess it was the tx completion latency all along.

I just put an extra debug to ath10k to see the latency between
submission and completion. Here's a log
(http://www.filedropper.com/complete-log) of 2s run of UDP iperf
trying to push 1gbps but managing only 300mbps.

I've made sure to not hold any locks nor introduce internal to ath10k
delays. Frames get completed between 2-4ms in avarage during load.

When I tried using different ath10k hwfw I got between 1-2ms of
latency for tx completionsyielding ~430mbps while max should be around
670mbps.


Michał
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-06 Thread Michal Kazior
On 6 February 2015 at 14:40, Eric Dumazet eric.duma...@gmail.com wrote:
 On Fri, 2015-02-06 at 10:42 +0100, Michal Kazior wrote:

 The above brings back previous behaviour, i.e. I can get 600mbps TCP
 on 5 flows again. Single flow is still (as it was before TSO
 autosizing) limited to roughly ~280mbps.

 I never really bothered before to understand why I need to push a few
 flows through ath10k to max it out, i.e. if I run a single UDP flow I
 get ~300mbps while with, e.g. 5 I get 670mbps easily.


 For single UDP flow, tweaking /proc/sys/net/core/wmem_default might be
 enough : UDP has no callback from TX completion to feed following frames
 (No write queue like TCP)

 # cat /proc/sys/net/core/wmem_default
 212992
 # ethtool -C eth1 tx-usecs 1024 tx-frames 120
 # ./netperf -H remote -t UDP_STREAM -- -m 1450
 Socket  Message  Elapsed  Messages
 SizeSize Time Okay Errors   Throughput
 bytes   bytessecs#  #   10^6bits/sec

 2129921450   10.00  697705  0 809.27
 212992   10.00  673412781.09

 # echo 80 /proc/sys/net/core/wmem_default
 # ./netperf -H remote -t UDP_STREAM -- -m 1450
 Socket  Message  Elapsed  Messages
 SizeSize Time Okay Errors   Throughput
 bytes   bytessecs#  #   10^6bits/sec

 801450   10.00 7329221  08501.84
 212992   10.00 7284051   8449.44

Hmm.. I confirm it works. However the value at which I get full rate
on a single flow is more than 2048K. Also using non-default
wmem_default seems to introduce packet loss as per iperf reports at
the receiver. I suppose this is kind of expected but on the other hand
wmem_default=262992 and 5 flows of UDP max the device out with 0
packet loss.


Michał
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-06 Thread Michal Kazior
On 6 February 2015 at 14:53, Eric Dumazet eric.duma...@gmail.com wrote:
 On Fri, 2015-02-06 at 05:40 -0800, Eric Dumazet wrote:

 tcp_wfree() could maintain in tp-tx_completion_delay_ms an EWMA
 of TX completion delay. But this would require yet another expensive
 call to ktime_get() if HZ  1000.

 Then tcp_write_xmit() could use it to adjust :

limit = max(2 * skb-truesize, sk-sk_pacing_rate  9);

 to

amount = (2 + tp-tx_completion_delay_ms) * sk-sk_pacing_rate

limit = max(2 * skb-truesize, amount / 1000);

 I'll cook a patch.

 Hmm... doing this in all protocols would be too expensive,
 and we do not want to include time spent in qdiscs.

 wifi could eventually do that, providing in skb-tx_completion_delay_us
 the time spent in wifi driver.

 This way, we would have no penalty for network devices doing normal skb
 orphaning (loopback interface, ethernet, ...)

I'll play around with this idea and report back later.


Michał
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-06 Thread Eric Dumazet
On Fri, 2015-02-06 at 05:53 -0800, Eric Dumazet wrote:


 wifi could eventually do that, providing in skb-tx_completion_delay_us
 the time spent in wifi driver.
 
 This way, we would have no penalty for network devices doing normal skb
 orphaning (loopback interface, ethernet, ...)

Another way would be that wifi does an automatic orphaning after 1 or
2ms.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-06 Thread Eric Dumazet
On Fri, 2015-02-06 at 15:08 +0100, Michal Kazior wrote:

 Hmm.. I confirm it works. However the value at which I get full rate
 on a single flow is more than 2048K. Also using non-default
 wmem_default seems to introduce packet loss as per iperf reports at
 the receiver. I suppose this is kind of expected but on the other hand
 wmem_default=262992 and 5 flows of UDP max the device out with 0
 packet loss.

If you increase ability to flood on one flow, then you need to make sure
receiver has big rcvbuf as well.

echo 200 /proc/sys/net/core/rmem_default

Otherwise it might drop bursts.

This is the kind of things that TCP does automatically, not UDP.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-06 Thread Nicolas Cavallari
On 05/02/2015 15:48, Eric Dumazet wrote:
 On Thu, 2015-02-05 at 14:44 +0100, Michal Kazior wrote:
 
 I do get your point. But 1.5ms is really tough on Wi-Fi.

 Just look at this:

 ; ping 192.168.1.2 -c 3
 PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=1.83 ms
 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=2.02 ms
 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=1.98 ms
 
 Thats a different point.
 
 I dont care about rtt but TX completions. (usually much much lower than
 rtt)

On wired network perhaps, but definitely not on Wi-Fi.

With aggregation, you may send up to 4ms of data before the receiver
can acknowledge anything. But you have to gain access to the channel
first, so you may wait while others finish off their 4ms
transmissions. And this does not account for retransmissions.

And aggregation is not the only problem as far as bufferbloat is
concerned. I don't even want to think about powersave.
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-06 Thread Eric Dumazet
On Fri, 2015-02-06 at 10:42 +0100, Michal Kazior wrote:

 The above brings back previous behaviour, i.e. I can get 600mbps TCP
 on 5 flows again. Single flow is still (as it was before TSO
 autosizing) limited to roughly ~280mbps.
 
 I never really bothered before to understand why I need to push a few
 flows through ath10k to max it out, i.e. if I run a single UDP flow I
 get ~300mbps while with, e.g. 5 I get 670mbps easily.
 

For single UDP flow, tweaking /proc/sys/net/core/wmem_default might be
enough : UDP has no callback from TX completion to feed following frames
(No write queue like TCP)

# cat /proc/sys/net/core/wmem_default
212992
# ethtool -C eth1 tx-usecs 1024 tx-frames 120
# ./netperf -H remote -t UDP_STREAM -- -m 1450
Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

2129921450   10.00  697705  0 809.27
212992   10.00  673412781.09

# echo 80 /proc/sys/net/core/wmem_default
# ./netperf -H remote -t UDP_STREAM -- -m 1450
Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

801450   10.00 7329221  08501.84
212992   10.00 7284051   8449.44


 I guess it was the tx completion latency all along.
 
 I just put an extra debug to ath10k to see the latency between
 submission and completion. Here's a log
 (http://www.filedropper.com/complete-log) of 2s run of UDP iperf
 trying to push 1gbps but managing only 300mbps.
 
 I've made sure to not hold any locks nor introduce internal to ath10k
 delays. Frames get completed between 2-4ms in avarage during load.


tcp_wfree() could maintain in tp-tx_completion_delay_ms an EWMA
of TX completion delay. But this would require yet another expensive
call to ktime_get() if HZ  1000.

Then tcp_write_xmit() could use it to adjust :

   limit = max(2 * skb-truesize, sk-sk_pacing_rate  9);

to

   amount = (2 + tp-tx_completion_delay_ms) * sk-sk_pacing_rate 

   limit = max(2 * skb-truesize, amount / 1000);

I'll cook a patch.

Thanks.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-06 Thread Eric Dumazet
On Fri, 2015-02-06 at 05:40 -0800, Eric Dumazet wrote:

 tcp_wfree() could maintain in tp-tx_completion_delay_ms an EWMA
 of TX completion delay. But this would require yet another expensive
 call to ktime_get() if HZ  1000.
 
 Then tcp_write_xmit() could use it to adjust :
 
limit = max(2 * skb-truesize, sk-sk_pacing_rate  9);
 
 to
 
amount = (2 + tp-tx_completion_delay_ms) * sk-sk_pacing_rate 
 
limit = max(2 * skb-truesize, amount / 1000);
 
 I'll cook a patch.

Hmm... doing this in all protocols would be too expensive,
and we do not want to include time spent in qdiscs.

wifi could eventually do that, providing in skb-tx_completion_delay_us
the time spent in wifi driver.

This way, we would have no penalty for network devices doing normal skb
orphaning (loopback interface, ethernet, ...)


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Throughput regression with `tcp: refine TSO autosizing`

2015-02-06 Thread David Laight
From: Eric Dumazet
 On Fri, 2015-02-06 at 05:53 -0800, Eric Dumazet wrote:
 
 
  wifi could eventually do that, providing in skb-tx_completion_delay_us
  the time spent in wifi driver.
 
  This way, we would have no penalty for network devices doing normal skb
  orphaning (loopback interface, ethernet, ...)
 
 Another way would be that wifi does an automatic orphaning after 1 or
 2ms.

Couldn't you do byte counting?
So orphan enough packets to keep a few ms of tx traffic (at the current
tx rate) orphaned.
You might need to give the hardware both orphaned and non-orphaned (parented?)
packets and orphan some when you get a tx complete for an orphaned packet.

David



Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-06 Thread Rick Jones

If you increase ability to flood on one flow, then you need to make sure
receiver has big rcvbuf as well.

echo 200 /proc/sys/net/core/rmem_default

Otherwise it might drop bursts.

This is the kind of things that TCP does automatically, not UDP.


An alternative, if the application involved can make explicit 
setsockopt() calls to set SO_SNDBUF and/or SO_RCVBUF, is to tweak 
rmem_max and wmem_max and then let the application make the setsockopt() 
calls.


Which path one would take would depend on circumstances I suspect.

rick jones
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-05 Thread Eric Dumazet
On Thu, 2015-02-05 at 07:46 +0100, Michal Kazior wrote:
 On 4 February 2015 at 22:11, Eric Dumazet eric.duma...@gmail.com wrote:

  Most conservative patch would be :
 
  diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c 
  b/drivers/net/wireless/ath/ath10k/htt_rx.c
  index 
  9c782a42665e1aaf43bfbca441631ee58da50c09..6a36317d6bb0447202dee15528130bd5e21248c4
   100644
  --- a/drivers/net/wireless/ath/ath10k/htt_rx.c
  +++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
  @@ -1642,6 +1642,7 @@ void ath10k_htt_t2h_msg_handler(struct ath10k *ar, 
  struct sk_buff *skb)
  break;
  }
  case HTT_T2H_MSG_TYPE_TX_COMPL_IND:
  +   skb_orphan(skb);
  spin_lock_bh(htt-tx_lock);
  __skb_queue_tail(htt-tx_compl_q, skb);
  spin_unlock_bh(htt-tx_lock);
 
 I suppose you want to call skb_orphan() on actual data packets, right?
 This skb is just a host-firmware communication buffer.

Right. I have no idea how you find the actual data packet at this stage.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-05 Thread Eric Dumazet
On Thu, 2015-02-05 at 04:57 -0800, Eric Dumazet wrote:

 The intention is to control the queues to the following :
 
 1 ms of buffering, but limited to a configurable value.
 
 On a 40Gbps flow, 1ms represents 5 MB, which is insane.
 
 We do not want to queue 5 MB of traffic, this would destroy latencies
 for all concurrent flows. (Or would require having fq_codel or fq as
 packet schedulers, instead of default pfifo_fast)
 
 This is why having 1.5 ms delay between the transmit and TX completion
 is a problem in your case.

Note that TCP stack could detect when this happens, *if* ACK where
delivered before the TX completions, or when TX completion happens,
we could detect that the clone of the freed packet was freed.

In my test, when I did ethtool -C eth0 tx-usecs 1024 tx-frames 64, and
disabling GSO, TCP stack sends a bunch of packets (a bit less than 64),
blocks on tcp_limit_output_bytes.

Then we receive 2 stretch ACKS after ~50 usec.

TCP stack tries to push again some packets but blocks on
tcp_limit_output_bytes again.

1ms later, TX completion happens, tcp_wfree() is called, and TCP stack
push following ~60 packets.


TCP could  eventually dynamically adjust the tcp_limit_output_bytes,
using a per flow dynamic value, but I would rather not add a kludge in
TCP stack only to deal with a possible bug in ath10k driver.

niu has a similar issue and simply had to call skb_orphan() :

drivers/net/ethernet/sun/niu.c:6669:skb_orphan(skb);



--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-05 Thread Michal Kazior
On 5 February 2015 at 14:19, Eric Dumazet eric.duma...@gmail.com wrote:
 On Thu, 2015-02-05 at 04:57 -0800, Eric Dumazet wrote:

 The intention is to control the queues to the following :

 1 ms of buffering, but limited to a configurable value.

 On a 40Gbps flow, 1ms represents 5 MB, which is insane.

 We do not want to queue 5 MB of traffic, this would destroy latencies
 for all concurrent flows. (Or would require having fq_codel or fq as
 packet schedulers, instead of default pfifo_fast)

 This is why having 1.5 ms delay between the transmit and TX completion
 is a problem in your case.

I do get your point. But 1.5ms is really tough on Wi-Fi.

Just look at this:

; ping 192.168.1.2 -c 3
PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=1.83 ms
64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=2.02 ms
64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=1.98 ms

; ping 192.168.1.2 -c 3 -Q 224
PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=0.939 ms
64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=0.906 ms
64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=0.946 ms

This was run with no load so batching code in the driver itself should
have no measurable effect. The channel was near-ideal: low noise
floor, cabled rf, no other traffic.

The lower latency ping is when 802.11 QoS Voice Access Category is
used. I also get 400mbps instead of 250mbps in this case with 5 flows
(net/master).

Dealing with black box firmware blobs is a pain.


 Note that TCP stack could detect when this happens, *if* ACK where
 delivered before the TX completions, or when TX completion happens,
 we could detect that the clone of the freed packet was freed.

 In my test, when I did ethtool -C eth0 tx-usecs 1024 tx-frames 64, and
 disabling GSO, TCP stack sends a bunch of packets (a bit less than 64),
 blocks on tcp_limit_output_bytes.

 Then we receive 2 stretch ACKS after ~50 usec.

 TCP stack tries to push again some packets but blocks on
 tcp_limit_output_bytes again.

 1ms later, TX completion happens, tcp_wfree() is called, and TCP stack
 push following ~60 packets.


 TCP could  eventually dynamically adjust the tcp_limit_output_bytes,
 using a per flow dynamic value, but I would rather not add a kludge in
 TCP stack only to deal with a possible bug in ath10k driver.

 niu has a similar issue and simply had to call skb_orphan() :

 drivers/net/ethernet/sun/niu.c:6669:skb_orphan(skb);

Ok. I tried calling skb_orphan() right after I submit each Tx frame
(similar to niu which does this in start_xmit):

--- a/drivers/net/wireless/ath/ath10k/htt_tx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
@@ -564,6 +564,8 @@ int ath10k_htt_tx(struct ath10k_htt *htt, struct
sk_buff *msdu)
if (res)
goto err_unmap_msdu;

+   skb_orphan(msdu);
+
return 0;

 err_unmap_msdu:


Now, with {net/master + ath10k GRO + the above} I get 620mbps on a
single flow (even better then before). Wow.

Does this look ok/safe as a solution to you?


Michał
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-05 Thread Eric Dumazet
On Thu, 2015-02-05 at 05:19 -0800, Eric Dumazet wrote:

 
 TCP could  eventually dynamically adjust the tcp_limit_output_bytes,
 using a per flow dynamic value, but I would rather not add a kludge in
 TCP stack only to deal with a possible bug in ath10k driver.
 
 niu has a similar issue and simply had to call skb_orphan() :
 
 drivers/net/ethernet/sun/niu.c:6669:skb_orphan(skb);

In your case that might be the place :

diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c 
b/drivers/net/wireless/ath/ath10k/htt_tx.c
index 4bc51d8a14a3..cbda7a87d5a1 100644
--- a/drivers/net/wireless/ath/ath10k/htt_tx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
@@ -468,6 +468,7 @@ int ath10k_htt_tx(struct ath10k_htt *htt, struct sk_buff 
*msdu)
msdu_id = res;
htt-pending_tx[msdu_id] = msdu;
spin_unlock_bh(htt-tx_lock);
+   skb_orphan(msdu);
 
prefetch_len = min(htt-prefetch_len, msdu-len);
prefetch_len = roundup(prefetch_len, 4);



--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-05 Thread Eric Dumazet
On Thu, 2015-02-05 at 09:38 +0100, Michal Kazior wrote:
 On 4 February 2015 at 22:11, Eric Dumazet eric.duma...@gmail.com wrote:
  I do not see how a TSO patch could hurt a flow not using TSO/GSO.
 
  This makes no sense.
 
 Hmm..
 
 @@ -2018,8 +2053,8 @@ static bool tcp_write_xmit(struct sock *sk,
 unsigned int mss_now, int nonagle,
  * of queued bytes to ensure line rate.
  * One example is wifi aggregation (802.11 AMPDU)
  */
 -   limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes,
 - sk-sk_pacing_rate  10);
 +   limit = max(2 * skb-truesize, sk-sk_pacing_rate  10);
 +   limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
 
 if (atomic_read(sk-sk_wmem_alloc)  limit) {
 set_bit(TSQ_THROTTLED, tp-tsq_flags);
 
 Doesn't this effectively invert how tcp_limit_output_bytes is used?
 This would explain why raising the limit wasn't changing anything
 anymore when you asked me do so. Only decreasing it yielded any
 change.
 
 I've added a printk to show up the new and old values. Excerpt from logs:
 
 [  114.782740] (4608 39126 131072 = 39126) vs (131072 39126 = 131072)
 
 (2*truesize, pacing_rate, tcp_limit = limit) vs (tcp_limit, pacing_rate = 
 limit)
 
 Reverting this patch hunk alone fixes my TCP problem. Not that I'm
 saying the old logic was correct (it seems it wasn't, a limit should
 be applied as min(value, max_value), right?).
 
 Anyway the change doesn't seem to be TSO-only oriented so it would
 explain the makes no sense.


The intention is to control the queues to the following :

1 ms of buffering, but limited to a configurable value.

On a 40Gbps flow, 1ms represents 5 MB, which is insane.

We do not want to queue 5 MB of traffic, this would destroy latencies
for all concurrent flows. (Or would require having fq_codel or fq as
packet schedulers, instead of default pfifo_fast)

This is why having 1.5 ms delay between the transmit and TX completion
is a problem in your case.

 



--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-05 Thread Eric Dumazet
On Thu, 2015-02-05 at 14:44 +0100, Michal Kazior wrote:

 I do get your point. But 1.5ms is really tough on Wi-Fi.
 
 Just look at this:
 
 ; ping 192.168.1.2 -c 3
 PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=1.83 ms
 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=2.02 ms
 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=1.98 ms

Thats a different point.

I dont care about rtt but TX completions. (usually much much lower than
rtt)

I can have a 4 usec delay from the moment a NIC submits a packet to the
wire and I get TX completion IRQ, free the packet.

Yet the pong reply can come 100 ms later.

It does not mean the 4 usec delay is a problem.



--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-05 Thread Michal Kazior
On 4 February 2015 at 22:11, Eric Dumazet eric.duma...@gmail.com wrote:
 I do not see how a TSO patch could hurt a flow not using TSO/GSO.

 This makes no sense.

Hmm..

@@ -2018,8 +2053,8 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
 * of queued bytes to ensure line rate.
 * One example is wifi aggregation (802.11 AMPDU)
 */
-   limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes,
- sk-sk_pacing_rate  10);
+   limit = max(2 * skb-truesize, sk-sk_pacing_rate  10);
+   limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);

if (atomic_read(sk-sk_wmem_alloc)  limit) {
set_bit(TSQ_THROTTLED, tp-tsq_flags);

Doesn't this effectively invert how tcp_limit_output_bytes is used?
This would explain why raising the limit wasn't changing anything
anymore when you asked me do so. Only decreasing it yielded any
change.

I've added a printk to show up the new and old values. Excerpt from logs:

[  114.782740] (4608 39126 131072 = 39126) vs (131072 39126 = 131072)

(2*truesize, pacing_rate, tcp_limit = limit) vs (tcp_limit, pacing_rate = limit)

Reverting this patch hunk alone fixes my TCP problem. Not that I'm
saying the old logic was correct (it seems it wasn't, a limit should
be applied as min(value, max_value), right?).

Anyway the change doesn't seem to be TSO-only oriented so it would
explain the makes no sense.


Michał
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-05 Thread Eric Dumazet
On Thu, 2015-02-05 at 14:44 +0100, Michal Kazior wrote:

 Ok. I tried calling skb_orphan() right after I submit each Tx frame
 (similar to niu which does this in start_xmit):
 
 --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
 +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
 @@ -564,6 +564,8 @@ int ath10k_htt_tx(struct ath10k_htt *htt, struct
 sk_buff *msdu)
 if (res)
 goto err_unmap_msdu;
 
 +   skb_orphan(msdu);
 +
 return 0;
 
  err_unmap_msdu:
 
 
 Now, with {net/master + ath10k GRO + the above} I get 620mbps on a
 single flow (even better then before). Wow.
 
 Does this look ok/safe as a solution to you?

Not at all. This basically removes backpressure.

A single UDP socket can now blast packets regardless of SO_SNDBUF
limits.

This basically remove years of work trying to fix bufferbloat.

I still do not understand why increasing tcp_limit_output_bytes is not
working for you.




--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-05 Thread Eric Dumazet
On Thu, 2015-02-05 at 06:41 -0800, Eric Dumazet wrote:

 Not at all. This basically removes backpressure.
 
 A single UDP socket can now blast packets regardless of SO_SNDBUF
 limits.
 
 This basically remove years of work trying to fix bufferbloat.
 
 I still do not understand why increasing tcp_limit_output_bytes is not
 working for you.

Oh well, tcp_limit_output_bytes might be ok.

In fact, the problem comes from GSO assumption. Maybe Herbert was right,
when he suggested TCP would be simpler if we enforced GSO...

When GSO is used, the thing works because 2*skb-truesize is roughly 2
ms worth of traffic.

Because you do not use GSO, and tx completions are slow, we need this :

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 65caf8b95e17..ac01b4cd0035 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2044,7 +2044,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
break;
 
/* TCP Small Queues :
-* Control number of packets in qdisc/devices to two packets / 
or ~1 ms.
+* Control number of packets in qdisc/devices to two packets /
+* or ~2 ms (sk-sk_pacing_rate  9) in case GSO is off.
 * This allows for :
 *  - better RTT estimation and ACK scheduling
 *  - faster recovery
@@ -2053,7 +2054,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
 * of queued bytes to ensure line rate.
 * One example is wifi aggregation (802.11 AMPDU)
 */
-   limit = max(2 * skb-truesize, sk-sk_pacing_rate  10);
+   limit = max(2 * skb-truesize, sk-sk_pacing_rate  9);
limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
 
if (atomic_read(sk-sk_wmem_alloc)  limit) {


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-05 Thread Dave Taht
On Fri, Feb 6, 2015 at 2:44 AM, Michal Kazior michal.kaz...@tieto.com wrote:

 On 5 February 2015 at 14:19, Eric Dumazet eric.duma...@gmail.com wrote:
  On Thu, 2015-02-05 at 04:57 -0800, Eric Dumazet wrote:
 
  The intention is to control the queues to the following :
 
  1 ms of buffering, but limited to a configurable value.
 
  On a 40Gbps flow, 1ms represents 5 MB, which is insane.
 
  We do not want to queue 5 MB of traffic, this would destroy latencies
  for all concurrent flows. (Or would require having fq_codel or fq as
  packet schedulers, instead of default pfifo_fast)
 
  This is why having 1.5 ms delay between the transmit and TX completion
  is a problem in your case.

 I do get your point. But 1.5ms is really tough on Wi-Fi.

 Just look at this:

 ; ping 192.168.1.2 -c 3
 PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=1.83 ms
 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=2.02 ms
 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=1.98 ms

 ; ping 192.168.1.2 -c 3 -Q 224
 PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=0.939 ms
 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=0.906 ms
 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=0.946 ms

 This was run with no load so batching code in the driver itself should
 have no measurable effect. The channel was near-ideal: low noise
 floor, cabled rf, no other traffic.

 The lower latency ping is when 802.11 QoS Voice Access Category is
 used. I also get 400mbps instead of 250mbps in this case with 5 flows
 (net/master).


The VO queue is now nearly useless in a real world environment. Whlle
it does grab the media mildly faster in some cases, on a good day with
no other competing APs, it cannot aggregate packets, and wastes TXOPS.
It is far saner to aim for better aggregate (use the VI queue if you
must try to get better media acquisition).

It is disabled in multiple products I know of.

And I really, really, really wish, that just once during this thread,
someone had bothered to try running a test
at a real world MCS rate - say MCS1, or MCS4, and measured the latency
under load of that...

or tried talking to two or more stations at the same time.

Instead of trying for 1.5Gbits in a faraday cage.



 Dealing with black box firmware blobs is a pain.


+10



  Note that TCP stack could detect when this happens, *if* ACK where
  delivered before the TX completions, or when TX completion happens,
  we could detect that the clone of the freed packet was freed.
 
  In my test, when I did ethtool -C eth0 tx-usecs 1024 tx-frames 64, and
  disabling GSO, TCP stack sends a bunch of packets (a bit less than 64),
  blocks on tcp_limit_output_bytes.
 
  Then we receive 2 stretch ACKS after ~50 usec.
 
  TCP stack tries to push again some packets but blocks on
  tcp_limit_output_bytes again.
 
  1ms later, TX completion happens, tcp_wfree() is called, and TCP stack
  push following ~60 packets.
 
 
  TCP could  eventually dynamically adjust the tcp_limit_output_bytes,
  using a per flow dynamic value, but I would rather not add a kludge in
  TCP stack only to deal with a possible bug in ath10k driver.
 
  niu has a similar issue and simply had to call skb_orphan() :
 
  drivers/net/ethernet/sun/niu.c:6669:skb_orphan(skb);

 Ok. I tried calling skb_orphan() right after I submit each Tx frame
 (similar to niu which does this in start_xmit):

 --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
 +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
 @@ -564,6 +564,8 @@ int ath10k_htt_tx(struct ath10k_htt *htt, struct
 sk_buff *msdu)
 if (res)
 goto err_unmap_msdu;

 +   skb_orphan(msdu);
 +
 return 0;

  err_unmap_msdu:


 Now, with {net/master + ath10k GRO + the above} I get 620mbps on a
 single flow (even better then before). Wow.

 Does this look ok/safe as a solution to you?


 Michał
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html




-- 
Dave Täht

thttp://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-04 Thread Eric Dumazet
I do not see how a TSO patch could hurt a flow not using TSO/GSO.

This makes no sense.

ath10k tx completions being batched/deferred to a tasklet might increase
probability to hit this condition in tcp_wfree() :

/* If this softirq is serviced by ksoftirqd, we are likely under stress.
 * Wait until our queues (qdisc + devices) are drained.
 * This gives :
 * - less callbacks to tcp_write_xmit(), reducing stress (batches)
 * - chance for incoming ACK (processed by another cpu maybe)
 *   to migrate this flow (skb-ooo_okay will be eventually set)
 */
if (wmem = SKB_TRUESIZE(1)  this_cpu_ksoftirqd() == current)
goto out;

Meaning tcp stack waits all skbs left qdisc/NIC queues before queuing
additional packets.

I would try to call skb_orphan() in ath10k if you really want to keep
these batches.

I have hard time to understand why tx completed packets go through
ath10k_htc_rx_completion_handler().. anyway...

Most conservative patch would be :

diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c 
b/drivers/net/wireless/ath/ath10k/htt_rx.c
index 
9c782a42665e1aaf43bfbca441631ee58da50c09..6a36317d6bb0447202dee15528130bd5e21248c4
 100644
--- a/drivers/net/wireless/ath/ath10k/htt_rx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
@@ -1642,6 +1642,7 @@ void ath10k_htt_t2h_msg_handler(struct ath10k *ar, struct 
sk_buff *skb)
break;
}
case HTT_T2H_MSG_TYPE_TX_COMPL_IND:
+   skb_orphan(skb);
spin_lock_bh(htt-tx_lock);
__skb_queue_tail(htt-tx_compl_q, skb);
spin_unlock_bh(htt-tx_lock);


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-04 Thread Michal Kazior
On 4 February 2015 at 22:11, Eric Dumazet eric.duma...@gmail.com wrote:
 I do not see how a TSO patch could hurt a flow not using TSO/GSO.

 This makes no sense.

 ath10k tx completions being batched/deferred to a tasklet might increase
 probability to hit this condition in tcp_wfree() :

 /* If this softirq is serviced by ksoftirqd, we are likely under 
 stress.
  * Wait until our queues (qdisc + devices) are drained.
  * This gives :
  * - less callbacks to tcp_write_xmit(), reducing stress (batches)
  * - chance for incoming ACK (processed by another cpu maybe)
  *   to migrate this flow (skb-ooo_okay will be eventually set)
  */
 if (wmem = SKB_TRUESIZE(1)  this_cpu_ksoftirqd() == current)
 goto out;

 Meaning tcp stack waits all skbs left qdisc/NIC queues before queuing
 additional packets.

 I would try to call skb_orphan() in ath10k if you really want to keep
 these batches.

 I have hard time to understand why tx completed packets go through
 ath10k_htc_rx_completion_handler().. anyway...

There's a couple of layers for host-firmware communication. The
transport layer (e.g. PCI) delivers HTC packets. These contain WMI
(configuration stuff) or HTT (traffic stuff). HTT can contain
different events (tx complete, rx complete, etc). HTT Tx completion
contains a list of ids which refer to frames that have been completed
(either sent or dropped).

I've tried reverting tx/rx tasklet batching. No change in throughput.
I can get tcpdump if you're interested.


 Most conservative patch would be :

 diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c 
 b/drivers/net/wireless/ath/ath10k/htt_rx.c
 index 
 9c782a42665e1aaf43bfbca441631ee58da50c09..6a36317d6bb0447202dee15528130bd5e21248c4
  100644
 --- a/drivers/net/wireless/ath/ath10k/htt_rx.c
 +++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
 @@ -1642,6 +1642,7 @@ void ath10k_htt_t2h_msg_handler(struct ath10k *ar, 
 struct sk_buff *skb)
 break;
 }
 case HTT_T2H_MSG_TYPE_TX_COMPL_IND:
 +   skb_orphan(skb);
 spin_lock_bh(htt-tx_lock);
 __skb_queue_tail(htt-tx_compl_q, skb);
 spin_unlock_bh(htt-tx_lock);

I suppose you want to call skb_orphan() on actual data packets, right?
This skb is just a host-firmware communication buffer.


Michał
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-04 Thread Eric Dumazet
On Wed, 2015-02-04 at 13:22 +0100, Michal Kazior wrote:
 On 4 February 2015 at 12:57, Eric Dumazet eric.duma...@gmail.com wrote:

 
  To disable gso you would have to use :
 
  ethtool -K wlan1 gso off
 
 Oh, thanks! This works. However I can't turn it on:
 
 ; ethtool -K wlan1 gso on
 Could not change any device features
 
 ..so I guess it makes no sense to re-run tests because:
 
 ; ethtool -k wlan1 | grep generic
 tx-checksum-ip-generic: on [fixed]
 generic-segmentation-offload: off [requested on]
 generic-receive-offload: on
 
 And this seems to never change.

GSO requires SG (Scatter Gather)

Are you sure this hardware has no SG support ?


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-04 Thread Michal Kazior
On 4 February 2015 at 13:38, Eric Dumazet eric.duma...@gmail.com wrote:
 On Wed, 2015-02-04 at 13:22 +0100, Michal Kazior wrote:
 On 4 February 2015 at 12:57, Eric Dumazet eric.duma...@gmail.com wrote:


  To disable gso you would have to use :
 
  ethtool -K wlan1 gso off

 Oh, thanks! This works. However I can't turn it on:

 ; ethtool -K wlan1 gso on
 Could not change any device features

 ..so I guess it makes no sense to re-run tests because:

 ; ethtool -k wlan1 | grep generic
 tx-checksum-ip-generic: on [fixed]
 generic-segmentation-offload: off [requested on]
 generic-receive-offload: on

 And this seems to never change.

 GSO requires SG (Scatter Gather)

 Are you sure this hardware has no SG support ?

The hardware itself seems to be capable. The firmware is a problem
though. I'm also not sure if mac80211 can handle this as is. No 802.11
driver seems to support SG except wil6210 which uses cfg80211 and
netdevs directly.


Michał
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-04 Thread Michal Kazior
On 3 February 2015 at 15:27, Eric Dumazet eric.duma...@gmail.com wrote:
 On Tue, 2015-02-03 at 12:50 +0100, Michal Kazior wrote:
[...]
 IOW:
  - stretch acks / TSO defer don't seem to help much (when compared to
 throughput results from yesterday)
  - GRO helps
  - disabling A-MSDU on sender helps
  - net/master+GRO still doesn't reach the performance from before the
 regression (~600mbps w/ GRO)

 You can grab logs and dumps here: http://www.filedropper.com/test2tar


 Thanks for these traces.

 There is absolutely a problem at the sender, as we can see a big 2ms
 delay between reception of ACK and send of following packets.
 TCP stack should generate them immediately.
 Are you using some kind of netem qdisc ?

Both systems have identical setup:

; tc qdisc
qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap  1 2 2 2 1
2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 root refcnt 2 bands 3 priomap  1 2 2 2 1
2 0 0 1 1 1 1 1 1 1 1
qdisc mq 0: dev wlan1 root
qdisc pfifo_fast 0: dev wlan1 parent :1 bands 3 priomap  1 2 2 2 1 2 0
0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev wlan1 parent :2 bands 3 priomap  1 2 2 2 1 2 0
0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev wlan1 parent :3 bands 3 priomap  1 2 2 2 1 2 0
0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev wlan1 parent :4 bands 3 priomap  1 2 2 2 1 2 0
0 1 1 1 1 1 1 1 1


 These 2ms delays, in a flow with a 5ms RTT are terrible.

 06:54:57.408391 IP 192.168.1.2.5001  192.168.1.3.51645: Flags [.], ack 
 4294899240, win 11268, options [nop,nop,TS val 1053302 ecr 1052250], length 0
 06:54:57.408418 IP 192.168.1.2.5001  192.168.1.3.51645: Flags [.], ack 
 4294910824, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
 06:54:57.408431 IP 192.168.1.2.5001  192.168.1.3.51645: Flags [.], ack 
 4294936888, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
 06:54:57.408453 IP 192.168.1.2.5001  192.168.1.3.51645: Flags [.], ack 
 4294962952, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
 06:54:57.408474 IP 192.168.1.2.5001  192.168.1.3.51645: Flags [.], ack 0, 
 win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
 this 2ms delay is not generated by TCP stack.
 06:54:57.410243 IP 192.168.1.3.51645  192.168.1.2.5001: Flags [.], seq 
 82536:83984, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], 
 length 1448
[...]

 Are packets TX completed after a timer or something ?

As far as ath10k is concerned - no timers here. Not sure about
firmware itself though.


 Some very heavy stuff might run from tasklet (or other softirq triggered) 
 event.

 BTW, traces tend to show that you 'receive' multiple ACK in the same burst,
 its not clear if they are delayed at one side or the other.

 GRO should delay only GRO candidates. ACK packets are not GRO candidates.

 Have you tried to disable GSO on sender ?

I assume I do that via ethtool? This is my current setup on both systems:

; ethtool -k wlan1
Features for wlan1:
rx-checksumming: off [fixed]
tx-checksumming: on
tx-checksum-ipv4: off [fixed]
tx-checksum-ip-generic: on [fixed]
tx-checksum-ipv6: off [fixed]
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: off
tx-scatter-gather: off [fixed]
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
tx-tcp-segmentation: off [fixed]
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp6-segmentation: off [fixed]
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: off [requested on]
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: off [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: on [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
busy-poll: off [fixed]

; ethtool -K wlan1 generic-segmentation-offload off
ethtool: bad command line argument(s)
For more information run ethtool -h


 (Or maybe wifi drivers should start to use skb-xmit_more as a signal to end 
 aggregation)

This could work if your firmware/device supports this kind of thing.
To my understanding ath10k firmware doesn't.


Michał
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-04 Thread Johannes Berg
On Wed, 2015-02-04 at 13:53 +0100, Michal Kazior wrote:

 The hardware itself seems to be capable. The firmware is a problem
 though. I'm also not sure if mac80211 can handle this as is. No 802.11
 driver seems to support SG except wil6210 which uses cfg80211 and
 netdevs directly.

mac80211 cannot deal with this right now. This would make a good topic
for the workshop since there's interest elsewhere in this as well. It's
probably not terribly hard to do as far as mac80211 is concerned.

How much offload do you really have though? Sometimes people just want
to build A-MSDUs.

johannes

--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-04 Thread Eric Dumazet
OK guys

Using a mlx4 testbed I can reproduce the problem by pushing coalescing
settings and disabling SG (thus disabling GSO)

ethtool -K eth0 sg off
Actual changes:
scatter-gather: off
tx-scatter-gather: off
generic-segmentation-offload: off [requested on]

ethtool -C eth0 tx-usecs 1024 tx-frames 64

Meaning that NIC waits one ms before sending the TX IRQ,
and can accumulate 64 frames before forcing the interrupt.

We probably have a bug in cwnd expansion logic :

lpaa23:~# DUMP_TCP_INFO=1 ./netperf -H 10.246.7.152 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 
() port 0 AF_INET
rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=230 rttvar=30 snd_ssthresh=41 
cwnd=59 reordering=3 total_retrans=1 ca_state=0 pacing_rate=5943.1 Mbits
Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

 87380  16384  1638410.00   530.39   0.40 0.32 2.965   2.398  


- final cwnd=59 which is not enough to avoid the 1ms delay between each
burst. 

So sender sends ~60 packets, then has to wait 1ms (to get NIC TX IRQ)
before sending the following burst.

I am CCing Neal, he probably can help to root cause the problem.

Thanks


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-03 Thread Michal Kazior
On 3 February 2015 at 02:18, Eric Dumazet eric.duma...@gmail.com wrote:
 On Mon, 2015-02-02 at 10:52 -0800, Eric Dumazet wrote:

 It seems to break ACK clocking badly (linux stack has a somewhat buggy
 tcp_tso_should_defer(), which relies on ACK being received smoothly, as
 no timer is setup to split the TSO packet.)

 Following patch might help the TSO split defer logic.

 It would avoid setting the TSO defer 'pseudo timer' twice, if/when TCP
 Small Queue logic prevented the xmit at the expiration of first 'timer'.

 This patch clears the tso_deferred variable only if we could really
 send something.

 Please try it, thanks !
[..patch..]

I've done a second round of tests. I've added the A-MSDU count
parameter I've mentioned in my other email into the mix.

 net - net/master (includes stretch ack patches)
 net-tso - net/master + your TSO defer patch
 net-gro - net/master + my ath10k GRO patch
 net-gro-tso - net/master + duh

Here's the best of amsdu count 1 and 3:

 ; for (i in */output.txt) { echo $i; for (j in (1 3)) { cat $i | awk
'x  /Mbits/ {y=$0}; x  y  !/Mbits/ {print y; x=0; y=}; /set
amsdu cnt to '$j'/{x=1}' | awk '{ if (x  $(NF-1)) {x=$(NF-1)} }
END{print A-MSDU limit='$j',  x  Mbits/sec}' } }
 net-gro-tso/output.txt
 A-MSDU limit=1, 436 Mbits/sec
 A-MSDU limit=3, 284 Mbits/sec
 net-gro/output.txt
 A-MSDU limit=1, 444 Mbits/sec
 A-MSDU limit=3, 283 Mbits/sec
 net-tso/output.txt
 A-MSDU limit=1, 376 Mbits/sec
 A-MSDU limit=3, 251 Mbits/sec
 net/output.txt
 A-MSDU limit=1, 387 Mbits/sec
 A-MSDU limit=3, 260 Mbits/sec

IOW:
 - stretch acks / TSO defer don't seem to help much (when compared to
throughput results from yesterday)
 - GRO helps
 - disabling A-MSDU on sender helps
 - net/master+GRO still doesn't reach the performance from before the
regression (~600mbps w/ GRO)

You can grab logs and dumps here: http://www.filedropper.com/test2tar


Michał
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-03 Thread Eric Dumazet
On Tue, 2015-02-03 at 06:27 -0800, Eric Dumazet wrote:

 Are packets TX completed after a timer or something ?
 
 Some very heavy stuff might run from tasklet (or other softirq triggered) 
 event.
 

Right, commit 6c5151a9ffa9f796f2d707617cecb6b6b241dff8
(ath10k: batch htt tx/rx completions)
is very suspicious.

Please revert it.

BTW, ath10k_htt_txrx_compl_task() runs from softirq context, so the 
_bh() prefixes are not really needed.

It seems lot of batching happens in wifi drivers, not necessarily at the
right places.



--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-03 Thread Michal Kazior
On 2 February 2015 at 19:52, Eric Dumazet eric.duma...@gmail.com wrote:
 On Mon, 2015-02-02 at 11:27 +0100, Michal Kazior wrote:

 While testing I've had my internal GRO patch for ath10k and no stretch
 ack patches.

 Thanks for the data, I took a look at it.

 I am afraid this GRO patch might be the problem.

The entire performance drop happens without the GRO patch as well. I
tested with it included because I intended to upstream it later. I'll
run without it in future tests.


[...]
 Could you make again your experiments using upstream kernel (David
 Miller net tree) ?

Sure.


 You also could post the GRO patch so that we can comment on it.

(You probably want to see mac80211 patch as well:
06d181a8fd58031db9c114d920b40d8820380a6e mac80211: add NAPI support
back)

diff --git a/drivers/net/wireless/ath/ath10k/core.c
b/drivers/net/wireless/ath/ath10k/core.c
index 36a8fcf..367e896 100644
--- a/drivers/net/wireless/ath/ath10k/core.c
+++ b/drivers/net/wireless/ath/ath10k/core.c
@@ -1147,6 +1147,12 @@ err:
 }
 EXPORT_SYMBOL(ath10k_core_start);

+static int ath10k_core_napi_dummy_poll(struct napi_struct *napi, int budget)
+{
+   WARN_ON(1);
+   return 0;
+}
+
 int ath10k_wait_for_suspend(struct ath10k *ar, u32 suspend_opt)
 {
int ret;
@@ -1414,6 +1420,10 @@ struct ath10k *ath10k_core_create(size_t
priv_size, struct device *dev,
INIT_WORK(ar-register_work, ath10k_core_register_work);
INIT_WORK(ar-restart_work, ath10k_core_restart);

+   init_dummy_netdev(ar-napi_dev);
+   ieee80211_napi_add(ar-hw, ar-napi, ar-napi_dev,
+  ath10k_core_napi_dummy_poll, 64);
+
ret = ath10k_debug_create(ar);
if (ret)
goto err_free_wq;
@@ -1434,6 +1444,7 @@ void ath10k_core_destroy(struct ath10k *ar)
 {
flush_workqueue(ar-workqueue);
destroy_workqueue(ar-workqueue);
+   netif_napi_del(ar-napi);

ath10k_debug_destroy(ar);
ath10k_mac_destroy(ar);
diff --git a/drivers/net/wireless/ath/ath10k/core.h
b/drivers/net/wireless/ath/ath10k/core.h
index 2d9f871..b5a8847 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -623,6 +623,9 @@ struct ath10k {

struct dfs_pattern_detector *dfs_detector;

+   struct net_device napi_dev;
+   struct napi_struct napi;
+
 #ifdef CONFIG_ATH10K_DEBUGFS
struct ath10k_debug debug;
 #endif
diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c
b/drivers/net/wireless/ath/ath10k/htt_rx.c
index c1da44f..7e58b38 100644
--- a/drivers/net/wireless/ath/ath10k/htt_rx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
@@ -2061,5 +2061,7 @@ static void ath10k_htt_txrx_compl_task(unsigned long ptr)
ath10k_htt_rx_in_ord_ind(ar, skb);
dev_kfree_skb_any(skb);
}
+
+   napi_gro_flush(htt-ar-napi, false);
spin_unlock_bh(htt-rx_ring.lock);
 }

So that you can quickly get an understanding how ath10k Rx works:
first tasklet (not visible in the patch) picks up smallish event
buffers from firmware and puts them into ath10k queue for latter
processing by another tasklet (the last hunk). Each such event buffer
is just some metainfo but can carry tens of frames (both Rx and Tx
completions). The count is arbitrary and depends on fw/hw combo and
air conditions. The GRO flush is called after all queued small event
buffers are processed (frames delivered up to mac80211 which can in
turn perform aggregation reordering in case some frames were
re-transmitted in the meantime before handing them to net subsystem).


Michał
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-03 Thread Michal Kazior
On 3 February 2015 at 00:06, Eric Dumazet eric.duma...@gmail.com wrote:
 On Mon, 2015-02-02 at 13:25 -0800, Ben Greear wrote:

 It is a big throughput win to have fewer TCP ack packets on
 wireless since it is a half-duplex environment.  Is there anything
 we could improve so that we can have fewer acks and still get
 good tcp stack behaviour?

 First apply TCP stretch ack fixes to the sender. There is no way to get
 good performance if the sender does not handle stretch ack.

 d6b1a8a92a14 tcp: fix timing issue in CUBIC slope calculation
 9cd981dcf174 tcp: fix stretch ACK bugs in CUBIC
 c22bdca94782 tcp: fix stretch ACK bugs in Reno
 814d488c6126 tcp: fix the timid additive increase on stretch ACKs
 e73ebb0881ea tcp: stretch ACK fixes prep

 Then, make sure you do not throttle ACK too long, especially if you hope
 to get Gbit line rate on a 4 ms RTT flow.

 GRO does not mean : send one ACK every ms, or after 3ms delay...

I think it's worth pointing out that If you assume 3-frame A-MSDU and
64-frame A-MPDU you get 192 frames (as far as TCP/IP is concerned) per
aggregation window. Assuming effective 600mbps throughput:

 python 1.0/600/8)*1024*1024)/1500)/(3*64))
 0.003663003663003663

This is probably worst case, but still probably worth to keep in mind.

ath10k has a knob to tune A-MSDU aggregation count. The default is 3
and it's what I've been testing so far.

When I change it to 1 on sender I get 250-400mbps boost in TCP -P5
but see no difference with -P1 (number of flows). Changing it to 1
on receiver yields no difference. I can try adding this configuration
permutation to my future tests if you're interested.

So that you have an idea - using 1 on sender degrades UDP throughput
(even 690-500mbps in some cases).


Michał
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Cerowrt-devel] Open Source RRM Hand-Over Optimization (WAS: Throughput regression with `tcp: refine TSO autosizing`)

2015-02-03 Thread Björn Smedman
On Tue, Feb 3, 2015 at 12:27 AM, David Lang da...@lang.hm wrote:
 On Mon, 2 Feb 2015, Avery Pennarun wrote:
 On Mon, Feb 2, 2015 at 11:44 AM, Björn Smedman b...@anyfi.net wrote:
 We've got an SDN-inspired architecture with 802.11 frame tunneling (a
 la CAPWAP), airtime fairness, infrastructure initiated hand-over,
 Opportunistic Key Caching (OKC), IEEE 802.11r Fast BSS Transition and
 a few more goodies. It's currently free as in beer
 (http://anyfi.net/software,
 https://github.com/carrierwrt/carrierwrt/pull/7 and
 http://www.anyfinetworks.com/download) up to 100 APs, but we're
 definitely going to open source in one form or another.

 Please keep in touch, when it is released open source I'd be very interested
 in trying it for SCaLE. I'll probably exceed your 100 radio free limit this
 year, and it's hard to justify using non-free code at a linux conference
 (not impossible, but not something I'm going to try to do 3 weeks before the
 show :-)

Will do. :)

 I'm doing social engineering to push people to the 5GHz network (SSID for 5G
 is scale, for 2.4 is scale-slow), it would be great to be able to do this
 directly. And better handoffs as people move around would be good.

 It would also be good if something like this could help identify gaps in
 coverage. If it can identify cases where users go from having coverage to
 poor connectivity to having coverage, we can manually investigate to see
 where in the building that is and see what we can do to fix it.

Both of those should be well within scope. :)
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-02 Thread Ben Greear
On 02/02/2015 10:52 AM, Eric Dumazet wrote:
 On Mon, 2015-02-02 at 11:27 +0100, Michal Kazior wrote:
 
 While testing I've had my internal GRO patch for ath10k and no stretch
 ack patches.
 
 Thanks for the data, I took a look at it.
 
 I am afraid this GRO patch might be the problem.
 
 It seems to break ACK clocking badly (linux stack has a somewhat buggy
 tcp_tso_should_defer(), which relies on ACK being received smoothly, as
 no timer is setup to split the TSO packet.)

It is a big throughput win to have fewer TCP ack packets on
wireless since it is a half-duplex environment.  Is there anything
we could improve so that we can have fewer acks and still get
good tcp stack behaviour?

Thanks,
Ben

-- 
Ben Greear gree...@candelatech.com
Candela Technologies Inc  http://www.candelatech.com

--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Open Source RRM Hand-Over Optimization (WAS: Throughput regression with `tcp: refine TSO autosizing`)

2015-02-02 Thread Avery Pennarun
On Mon, Feb 2, 2015 at 11:44 AM, Björn Smedman b...@anyfi.net wrote:
 On Mon, Feb 2, 2015 at 5:21 AM, Avery Pennarun apenw...@google.com wrote:
 While there is definitely some work to be done in handoff, it seems
 like there are some find implementations of this already in existence.
 Several brands of enterprise access point setups seem to do well at
 this.  It would be nice if they interoperated, I guess.

 The fact that there's no open source version of this kind of handoff
 feature bugs me, but we are working on it here and the work is all
 planned to be open source, for example: (very early version)
 https://gfiber.googlesource.com/vendor/google/platform/+/master/waveguide/

 We've got an SDN-inspired architecture with 802.11 frame tunneling (a
 la CAPWAP), airtime fairness, infrastructure initiated hand-over,
 Opportunistic Key Caching (OKC), IEEE 802.11r Fast BSS Transition and
 a few more goodies. It's currently free as in beer
 (http://anyfi.net/software,
 https://github.com/carrierwrt/carrierwrt/pull/7 and
 http://www.anyfinetworks.com/download) up to 100 APs, but we're
 definitely going to open source in one form or another.

 We've also tried to raise some interest in fixing up CAPWAP
 (https://www.ietf.org/mail-archive/web/opsawg/current/msg03196.html),
 which is (unfortunately) the best open standard at the moment.
 Interest seems marginal though...

This sounds cool.  Is the CAPWAP/encapsulation stuff separable from
the rest?  At 802.11ac speeds, a super fast WAN link, and a low-cost
SoC, too many layers can be a killer.
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-02 Thread Eric Dumazet
On Mon, 2015-02-02 at 10:52 -0800, Eric Dumazet wrote:

 It seems to break ACK clocking badly (linux stack has a somewhat buggy
 tcp_tso_should_defer(), which relies on ACK being received smoothly, as
 no timer is setup to split the TSO packet.)

Following patch might help the TSO split defer logic.

It would avoid setting the TSO defer 'pseudo timer' twice, if/when TCP
Small Queue logic prevented the xmit at the expiration of first 'timer'.

This patch clears the tso_deferred variable only if we could really
send something.

Please try it, thanks !


diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 65caf8b95e17..e735f38557db 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1821,7 +1821,6 @@ static bool tcp_tso_should_defer(struct sock *sk,
struct sk_buff *skb,
return true;
 
 send_now:
-   tp-tso_deferred = 0;
return false;
 }
 
@@ -2070,6 +2069,7 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
break;
 
+   tp-tso_deferred = 0;
 repair:
/* Advance the send_head.  This one is sent out.
 * This call will increment packets_out.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-02 Thread Eric Dumazet
On Mon, 2015-02-02 at 13:25 -0800, Ben Greear wrote:

 It is a big throughput win to have fewer TCP ack packets on
 wireless since it is a half-duplex environment.  Is there anything
 we could improve so that we can have fewer acks and still get
 good tcp stack behaviour?

First apply TCP stretch ack fixes to the sender. There is no way to get
good performance if the sender does not handle stretch ack.

d6b1a8a92a14 tcp: fix timing issue in CUBIC slope calculation
9cd981dcf174 tcp: fix stretch ACK bugs in CUBIC
c22bdca94782 tcp: fix stretch ACK bugs in Reno
814d488c6126 tcp: fix the timid additive increase on stretch ACKs
e73ebb0881ea tcp: stretch ACK fixes prep

Then, make sure you do not throttle ACK too long, especially if you hope
to get Gbit line rate on a 4 ms RTT flow.

GRO does not mean : send one ACK every ms, or after 3ms delay...

It is literally :
  aggregate X packets at receive, and send the ACK asap.

If the receiver expects to have 64 ACK packets in the TX ring buffer to
actually send them (wifi aggregation), then you certainly do not want to
compress ACK too much.



--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-02 Thread Michal Kazior
On 30 January 2015 at 15:40, Eric Dumazet eric.duma...@gmail.com wrote:
 On Fri, 2015-01-30 at 14:39 +0100, Michal Kazior wrote:

 I've briefly tried playing with this knob to no avail unfortunately. I
 tried 256K, 1M - it didn't improve TCP performance. When I tried to
 make it smaller (e.g. 16K) the traffic dropped even more so it does
 have an effect. It seems there's some other limiting factor in this
 case.

 Interesting.

 Could you take some tcpdump/pcap with various tcp_limit_output_bytes
 values ?

 echo 131072 /proc/sys/net/ipv4/tcp_limit_output_bytes
 tcpdump -p -i wlanX -s 128 -c 2 -w 128k.pcap

 echo 262144 /proc/sys/net/ipv4/tcp_limit_output_bytes
 tcpdump -p -i wlanX -s 128 -c 2 -w 256k.pcap

I've run a couple of tests across different kernels. This got pretty
big so I decided to use an external file hosting:
 http://www.filedropper.com/testtar

Let me know if you can't access it (and perhaps you could suggest how
you prefer the logs to be delivered in that case).

The layout of logs is: $kernel/$limit-P$threads.pcap. I've also
included the test script and output of each test run.

While testing I've had my internal GRO patch for ath10k and no stretch
ack patches.

When I was trying to come up with a testing methodology I've noticed
something interesting:
 1. set 16k limit
 2. start iperf -P1
 3. observe 200mbps
 4. set 2048k limit (while iperf is running)
 5. observe 600mbps
 6. set 16limit back (while iperf is running)
 7. observe 500-600mbps (i.e. no drop to 200mbps)

Due to that I've decided to re-start iperf for each limit test.

If you want me to gather some other logs/dumps/configuration
permutations let me know, please.


Michał
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Cerowrt-devel] Fwd: Throughput regression with `tcp: refine TSO autosizing`

2015-02-02 Thread Jim Gettys
On Sun, Feb 1, 2015 at 11:04 PM, Avery Pennarun apenw...@google.com wrote:
 On Sun, Feb 1, 2015 at 6:34 PM, Andrew McGregor andrewm...@gmail.com wrote:
 I missed one item in my list of potential improvements: the most braindead
 thing 802.11 has to say about rates is that broadcast and multicast packets
 should be sent at 'the lowest basic rate in the current supported rate set',
 which is really wasteful.  There are a couple of ways of dealing with this:
 one, ignore the standard and pick the rate that is most likely to get the
 frame to as many neighbours as possible (by a scan of the Minstrel tables).
 Or two, fan it out as unicast, which might well take less airtime (due to
 aggregation) as well as being much more likely to be delivered, since you
 get ACKs and retries by doing that.

 As far as I can see, the only sensible thing to do with
 multicast/broadcast is some variation of the unicast fanout, unless
 you've got a truly huge number of nodes.  I don't know of any
 protocols (certainly not video streams) that actually work well with
 the kind of packet loss you see at medium/long range with wifi if
 retransmits aren't used.  I've heard that openwrt already has a patch
 included that does this kind of fanout at the bridge layer.

I gather some Windows drivers from some vendors do this unicast fanout
(claim made by one of their engineers in an early homenet meeting).


 I've also heard of a new reliable multicast in some newer 802.11
 variant, which essentially sends out a single multicast packet and
 expects an ACK from each intended recipient.  Other than adding
 complexity, it seems like the best of both worlds.

So long as it times out in some very small, finite time.  We don't
want a repeat of the infinite retry bugs Dave found in drivers a few
years back...

Reliable multicast ultimately is an oxymoron, particularly on a
medium with hundreds/one bandwidth variation.  One remote low
bandwidth station cannot be allowed to drag the entire network to the
basement.
 - Jim
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Open Source RRM Hand-Over Optimization (WAS: Throughput regression with `tcp: refine TSO autosizing`)

2015-02-02 Thread Björn Smedman
On Mon, Feb 2, 2015 at 5:21 AM, Avery Pennarun apenw...@google.com wrote:
 On Sun, Feb 1, 2015 at 9:43 AM,  dpr...@reed.com wrote:
 Just to clarify, managing queueing in a single access point WiFi network is
 only a small part of the problem of fixing the rapidly degrading performance
 of WiFi based systems.

 Can you explain what you mean by rapidly degrading?  The performance
 in odd situations is certainly not inspirational, but I haven't
 noticed it getting worse over time.

 Similarly, mesh routing is only a small part of the
 problem with the scalability of cooperative meshes based on the WiFi MAC.

 That's certainly true.  Not to say the mesh routing algorithms are
 much good either.

  Also, as we noted
 earlier, handoff from one next hop to another is a huge problem with
 performance in practical deployments (a factor of 10x at least, just in
 that).

 While there is definitely some work to be done in handoff, it seems
 like there are some find implementations of this already in existence.
 Several brands of enterprise access point setups seem to do well at
 this.  It would be nice if they interoperated, I guess.

 The fact that there's no open source version of this kind of handoff
 feature bugs me, but we are working on it here and the work is all
 planned to be open source, for example: (very early version)
 https://gfiber.googlesource.com/vendor/google/platform/+/master/waveguide/

We've got an SDN-inspired architecture with 802.11 frame tunneling (a
la CAPWAP), airtime fairness, infrastructure initiated hand-over,
Opportunistic Key Caching (OKC), IEEE 802.11r Fast BSS Transition and
a few more goodies. It's currently free as in beer
(http://anyfi.net/software,
https://github.com/carrierwrt/carrierwrt/pull/7 and
http://www.anyfinetworks.com/download) up to 100 APs, but we're
definitely going to open source in one form or another.

We've also tried to raise some interest in fixing up CAPWAP
(https://www.ietf.org/mail-archive/web/opsawg/current/msg03196.html),
which is (unfortunately) the best open standard at the moment.
Interest seems marginal though...

If anybody's interested in joining forces on either front we'd be be
happy to talk.

Cheers,

Björn
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-02-02 Thread Eric Dumazet
On Mon, 2015-02-02 at 11:27 +0100, Michal Kazior wrote:

 While testing I've had my internal GRO patch for ath10k and no stretch
 ack patches.

Thanks for the data, I took a look at it.

I am afraid this GRO patch might be the problem.

It seems to break ACK clocking badly (linux stack has a somewhat buggy
tcp_tso_should_defer(), which relies on ACK being received smoothly, as
no timer is setup to split the TSO packet.)

I am seeing huge delays on ACK packets and bursts like that :

05:01:53.413038 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 76745, 
win 4435, options [nop,nop,TS val 4294758508 ecr 4294757300], length 0
05:01:53.413407 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 79641, 
win 4435, options [nop,nop,TS val 4294758508 ecr 4294757301], length 0
05:01:53.413969 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 92673, 
win 4435, options [nop,nop,TS val 4294758510 ecr 4294757302], length 0
05:01:53.413990 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 97017, 
win 4435, options [nop,nop,TS val 4294758510 ecr 4294757302], length 0
05:01:53.414011 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 110049, 
win 4435, options [nop,nop,TS val 4294758510 ecr 4294757302], length 0
...
05:01:53.422663 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 189689, 
win 4435, options [nop,nop,TS val 4294758519 ecr 4294757310], length 0
05:01:53.424354 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 198377, 
win 4435, options [nop,nop,TS val 4294758520 ecr 4294757311], length 0
05:01:53.424400 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 202721, 
win 4435, options [nop,nop,TS val 4294758520 ecr 4294757313], length 0
05:01:53.424409 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 205617, 
win 4435, options [nop,nop,TS val 4294758520 ecr 4294757313], length 0
...
05:01:53.450248 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 419921, 
win 4435, options [nop,nop,TS val 4294758547 ecr 4294757337], length 0
05:01:53.450266 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 427161, 
win 4435, options [nop,nop,TS val 4294758547 ecr 4294757340], length 0
05:01:53.450289 IP 192.168.1.2.5001  192.168.1.3.49669: Flags [.], ack 431505, 
win 4435, options [nop,nop,TS val 4294758547 ecr 4294757340], length 0

Could you make again your experiments using upstream kernel (David
Miller net tree) ?

You also could post the GRO patch so that we can comment on it.

Thanks


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Cerowrt-devel] Fwd: Throughput regression with `tcp: refine TSO autosizing`

2015-02-01 Thread Avery Pennarun
On Sun, Feb 1, 2015 at 6:34 PM, Andrew McGregor andrewm...@gmail.com wrote:
 I missed one item in my list of potential improvements: the most braindead
 thing 802.11 has to say about rates is that broadcast and multicast packets
 should be sent at 'the lowest basic rate in the current supported rate set',
 which is really wasteful.  There are a couple of ways of dealing with this:
 one, ignore the standard and pick the rate that is most likely to get the
 frame to as many neighbours as possible (by a scan of the Minstrel tables).
 Or two, fan it out as unicast, which might well take less airtime (due to
 aggregation) as well as being much more likely to be delivered, since you
 get ACKs and retries by doing that.

As far as I can see, the only sensible thing to do with
multicast/broadcast is some variation of the unicast fanout, unless
you've got a truly huge number of nodes.  I don't know of any
protocols (certainly not video streams) that actually work well with
the kind of packet loss you see at medium/long range with wifi if
retransmits aren't used.  I've heard that openwrt already has a patch
included that does this kind of fanout at the bridge layer.

I've also heard of a new reliable multicast in some newer 802.11
variant, which essentially sends out a single multicast packet and
expects an ACK from each intended recipient.  Other than adding
complexity, it seems like the best of both worlds.
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Cerowrt-devel] Fwd: Throughput regression with `tcp: refine TSO autosizing`

2015-02-01 Thread Avery Pennarun
On Sun, Feb 1, 2015 at 9:43 AM,  dpr...@reed.com wrote:
 Just to clarify, managing queueing in a single access point WiFi network is
 only a small part of the problem of fixing the rapidly degrading performance
 of WiFi based systems.

Can you explain what you mean by rapidly degrading?  The performance
in odd situations is certainly not inspirational, but I haven't
noticed it getting worse over time.

 Similarly, mesh routing is only a small part of the
 problem with the scalability of cooperative meshes based on the WiFi MAC.

That's certainly true.  Not to say the mesh routing algorithms are
much good either.

  Also, as we noted
 earlier, handoff from one next hop to another is a huge problem with
 performance in practical deployments (a factor of 10x at least, just in
 that).

While there is definitely some work to be done in handoff, it seems
like there are some find implementations of this already in existence.
Several brands of enterprise access point setups seem to do well at
this.  It would be nice if they interoperated, I guess.

The fact that there's no open source version of this kind of handoff
feature bugs me, but we are working on it here and the work is all
planned to be open source, for example: (very early version)
https://gfiber.googlesource.com/vendor/google/platform/+/master/waveguide/

 Propagation information is not used at all when 802.11 systems share a
 channel, even in single AP deployments, yet all stations can measure
 propagation quite accurately in their hardware.

802.11k seems to provide for sharing this information.  But I'm not
clear what I should use it for. :)

 Finally, Listen-before-talk is highly wasteful for two reasons: 1) any
 random radio noise from other sources unnecessarily degrades communications 
 [...]
 2) the transmitter cannot tell when the intended receiver will be perfectly
 able to decode the signal without interference with the station it hears
 (this second point is actually proven in theory in a paper by Jon Peha that
 argued against trivial etiquettes as a mechanism for sharing among
 uncooperative and non-interoperable stations).

I've thought quite a bit about your point #2 above, but I don't know
which direction to pursue.  The idea is that sometimes just shout
over the background noise is a globally optimal solution, right?  The
question seems to be to figure out when that is true and when it
isn't.

 I agree that, to the extent that managing queues in a single box or a single
 operating system doesn't require cooperation, it's much easier to get such
 things into the market.  That's why CeroWRT has been as effective as it has
 been.  But has Microsoft done anything at all about it?   Do the better ECN
 signals that can arise from good queue management get used by the TCP
 endpoints, or for that matter UDP-based protocol endpoints?

If we don't know the answer to the questions, then that is itself the
problem.  It's a lot easier to say, hey, ChromeOS and MacOS have good
network performance but Microsoft has bad network performance, if it's
true and we have good reproducible tests to demonstrate that.

 The reason no one is making progress on any of these particular issues is
 that there is no coordination at the systems level around creating rising
 tides that lift all boats in the WiFi-ish space.  It's all about ripping the
 competition by creating stuff that can sell better than the other guys'
 stuff, and avoiding cooperation at all costs.
 [...]
 But the big wins in making WiFi better are going begging.  As WiFi becomes
 more closed, as it will as the major Internet Access Providers and Gadget
 builders (Google, Apple) start excluding innovators in wireless from the
 market by closed, proprietary solutions, the problem WILL get worse.  You
 won't be able to fix those problems at all.  If you have a solution you will
 have to convince the oligopoly to even bother trying it.

As someone who works at Google Fiber (which is both a gadget maker and
an ISP) and who pushes all day long for our wifi stuff to be open
source, I'm slightly offended to be lumped in with other vendors in
your story :)  I think the ChromeOS team (which insists on only open
source wifi drivers in all chromebooks) would feel similarly.  We are
lucky to have defined our competitive advantage as something other
than short-lived slight improvements in wifi that will soon be
wastefully duplicated by everyone else.

That said, I see what you mean about the general state of the
industry.  The way to fix it is the way Linux always fixes it: make
the open source version so much better that building a proprietary
one, just to gather a small incremental advantage, is a huge waste of
time and effort.  Work on minstrel and fq_codel go really far here.

 I personally think that things like promoting semi-closed, essentially
 proprietary ESSID-based bridged distribution systems as good ideas are
 counterproductive to this goal.  But that's perhaps too radical for this
 crowd.

Not 

Re: [Cerowrt-devel] Fwd: Throughput regression with `tcp: refine TSO autosizing`

2015-02-01 Thread David Lang

On Sun, 1 Feb 2015, Avery Pennarun wrote:


On Sun, Feb 1, 2015 at 9:43 AM,  dpr...@reed.com wrote:

I personally think that things like promoting semi-closed, essentially
proprietary ESSID-based bridged distribution systems as good ideas are
counterproductive to this goal.  But that's perhaps too radical for this
crowd.


Not sure what you mean here.  ESSID-based distribution systems seem
pretty well defined to me.  The only proprietary part is the
decision-making process for assisted roaming (ie. the inter-AP
protocol) which is only an optional performance optimization.  There
really should be an open source version of this, and I'm in fact
feebly attempting to build one, but I don't feel like the world is
falling apart through not having it.  You can build a bridged
multi-BSS ESSID today with plain out-of-the-box hostapd.


I will be running a fully opensource bridged ESSID system at SCaLE this month. 
last year we had ~2500 people and devices with ~50 APs deployed, and it worked 
well. The only problem was that I needed to deploy a few more APs to cover some 
of the hallway areas more reliably.


There are tricks that the commercial systems pull that I can't currently 
duplicate with opensource tools. But as Avery says, they are optimizations, not 
something required for successful operation. It would be nice to get the 
assisted roaming portion available. But it's not required.


David Lang
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-01-30 Thread Eric Dumazet
On Fri, 2015-01-30 at 14:47 +0100, Arend van Spriel wrote:

 Indeed and that is what we would like to address in our wireless 
 drivers. I will setup some experiments using the fraction sizing and 
 post my findings. Again sorry if I offended you.

You did not, but I had no feedback about my suggestions.

Michal sent it now.

Thanks


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-01-30 Thread Eric Dumazet
On Fri, 2015-01-30 at 14:39 +0100, Michal Kazior wrote:

 I've briefly tried playing with this knob to no avail unfortunately. I
 tried 256K, 1M - it didn't improve TCP performance. When I tried to
 make it smaller (e.g. 16K) the traffic dropped even more so it does
 have an effect. It seems there's some other limiting factor in this
 case.

Interesting.

Could you take some tcpdump/pcap with various tcp_limit_output_bytes
values ?

echo 131072 /proc/sys/net/ipv4/tcp_limit_output_bytes
tcpdump -p -i wlanX -s 128 -c 2 -w 128k.pcap

echo 262144 /proc/sys/net/ipv4/tcp_limit_output_bytes
tcpdump -p -i wlanX -s 128 -c 2 -w 256k.pcap

...

Thanks !


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-01-30 Thread Arend van Spriel

On 01/29/15 14:14, Eric Dumazet wrote:

On Thu, 2015-01-29 at 12:48 +0100, Michal Kazior wrote:

Hi,

I'm not subscribed to netdev list and I can't find the message-id so I
can't reply directly to the original thread `BW regression after tcp:
refine TSO autosizing`.

I've noticed a big TCP performance drop with ath10k
(drivers/net/wireless/ath/ath10k) on 3.19-rc5. Instead of 500mbps I
get 250mbps in my testbed.

After bisecting I ended up at `tcp: refine TSO autosizing`. Reverting
`tcp: refine TSO autosizing` and `tcp: Do not apply TSO segment limit
to non-TSO packets` (for conflict free reverts) fixes the problem.

My testing setup is as follows:

  a) ath10k AP, github.com/kvalo/ath/tree/master 3.19-rc5, w/ reverts
  b) ath10k STA connected to (a), github.com/kvalo/ath/tree/master
3.19-rc5, w/ reverts
  c) (b) w/o reverts

Devices are 3x3 (AP) and 2x2 (Client) and are RF cabled. 11ac@80MHz
2x2 has 866mbps modulation rate. In practice this should deliver
~700mbps of real UDP traffic.

Here are some numbers:

UDP: (b) -  (a): 672mbps
UDP: (a) -  (b): 687mbps
TCP: (b) -  (a): 526mbps
TCP: (a) -  (b): 500mbps

UDP: (c) -  (a): 669mbps*
UDP: (a) -  (c): 689mbps*
TCP: (c) -  (a): 240mbps**
TCP: (a) -  (c): 490mbps*

* no changes/within error margin
** the performance drop

I'm using iperf:
   UDP: iperf -i1 -s -u vs iperf -i1 -c XX -u -B 200M -P5 -t 20
   TCP: iperf -i1 -s vs iperf -i1 -c XX -P5 -t 20

Result values were obtained at the receiver side.

Iperf reports a few frames lost and out-of-order at each UDP test
start (during first second) but later has no packet loss and no
out-of-order. This shouldn't have any effect on a TCP session, right?

The device delivers batched up tx/rx completions (no way to change
that). I suppose this could be an issue for timing sensitive
algorithms. Also keep in mind 802.11n and 802.11ac devices have frame
aggregation windows so there's an inherent extra (and non-uniform)
latency when compared to, e.g. ethernet devices.

The driver doesn't have GRO. I have an internal patch which implements
it. It improves overall TCP traffic (more stable, up to 600mbps TCP
which is ~100mbps more than without GRO) but the TCP: (c) -  (a)
performance drop remains unaffected regardless.

I've tried applying stretch ACK patchset (v2) on both machines and
re-run the above tests. I got no measurable difference in performance.

I've also run these tests with iwlwifi 7260 (also a 2x2) as (b) and
(c). It didn't seem to be affected by the TSO patch at all (it runs at
~360mbps of TCP regardless of the TSO patch).

Any hints/ideas?



Hi Michal

This patch restored original TSQ behavior, because the 1ms worth of data
per flow had totally destroyed TSQ intent.

vi +630 Documentation/networking/ip-sysctl.txt

tcp_limit_output_bytes - INTEGER
 Controls TCP Small Queue limit per tcp socket.
 TCP bulk sender tends to increase packets in flight until it
 gets losses notifications. With SNDBUF autotuning, this can
 result in a large amount of packets queued in qdisc/device
 on the local machine, hurting latency of other flows, for
 typical pfifo_fast qdiscs.
 tcp_limit_output_bytes limits the number of bytes on qdisc
 or device to reduce artificial RTT/cwnd and reduce bufferbloat.
 Default: 131072

This is why I suggested to Eyal Perry to change the TX interrupt
mitigation parameters as in :

ethtool -C eth0 tx-frames 4 rx-frames 4

With this change and the stretch ack fixes, I got 37Gbps of throughput
on a single flow, on a 40Gbit NIC (mlx4)

If a driver needs to buffer more than tcp_limit_output_bytes=131072 to
get line rate, I suggest that you either :

1) tweak tcp_limit_output_bytes, but its not practical from a driver.

2) change the driver, knowing what are its exact requirements, by
removing a fraction of skb-truesize at ndo_start_xmit() time as in :

if ((skb-destructor == sock_wfree ||
  skb-restuctor == tcp_wfree)
 skb-sk) {
 u32 fraction = skb-truesize / 2;

 skb-truesize -= fraction;
 atomic_sub(fraction,skb-sk-sk_wmem_alloc);
}


Hi Eric,

Your suggestions are still based on the fact that you consider wireless 
networking to be similar to ethernet, but as Michal indicated there are 
some fundamental differences starting with CSMA/CD versus CSMA/CA. Also 
the medium conditions are far from comparable. There is no shielding so 
it needs to deal with interference and dynamically drops the link rate 
so transmission of packets can take several milliseconds. Then with 11n 
they came up with aggregation with sends up to 64 packets in a single 
transmit over the air at worst case 6.5 Mbps (if I am not mistaken). The 
parameter value for tcp_limit_output_bytes of 131072 means that it 
allows queuing for about 1ms on a 1Gbps link, but I hope you can see 
this is not realistic for dealing with all variances of the wireless 
medium/standard. I suggested this as topic for the wireless workshop in 
Otawa [1], but 

Throughput regression with `tcp: refine TSO autosizing`

2015-01-29 Thread Michal Kazior
Hi,

I'm not subscribed to netdev list and I can't find the message-id so I
can't reply directly to the original thread `BW regression after tcp:
refine TSO autosizing`.

I've noticed a big TCP performance drop with ath10k
(drivers/net/wireless/ath/ath10k) on 3.19-rc5. Instead of 500mbps I
get 250mbps in my testbed.

After bisecting I ended up at `tcp: refine TSO autosizing`. Reverting
`tcp: refine TSO autosizing` and `tcp: Do not apply TSO segment limit
to non-TSO packets` (for conflict free reverts) fixes the problem.

My testing setup is as follows:

 a) ath10k AP, github.com/kvalo/ath/tree/master 3.19-rc5, w/ reverts
 b) ath10k STA connected to (a), github.com/kvalo/ath/tree/master
3.19-rc5, w/ reverts
 c) (b) w/o reverts

Devices are 3x3 (AP) and 2x2 (Client) and are RF cabled. 11ac@80MHz
2x2 has 866mbps modulation rate. In practice this should deliver
~700mbps of real UDP traffic.

Here are some numbers:

UDP: (b) - (a): 672mbps
UDP: (a) - (b): 687mbps
TCP: (b) - (a): 526mbps
TCP: (a) - (b): 500mbps

UDP: (c) - (a): 669mbps*
UDP: (a) - (c): 689mbps*
TCP: (c) - (a): 240mbps**
TCP: (a) - (c): 490mbps*

* no changes/within error margin
** the performance drop

I'm using iperf:
  UDP: iperf -i1 -s -u vs iperf -i1 -c XX -u -B 200M -P5 -t 20
  TCP: iperf -i1 -s vs iperf -i1 -c XX -P5 -t 20

Result values were obtained at the receiver side.

Iperf reports a few frames lost and out-of-order at each UDP test
start (during first second) but later has no packet loss and no
out-of-order. This shouldn't have any effect on a TCP session, right?

The device delivers batched up tx/rx completions (no way to change
that). I suppose this could be an issue for timing sensitive
algorithms. Also keep in mind 802.11n and 802.11ac devices have frame
aggregation windows so there's an inherent extra (and non-uniform)
latency when compared to, e.g. ethernet devices.

The driver doesn't have GRO. I have an internal patch which implements
it. It improves overall TCP traffic (more stable, up to 600mbps TCP
which is ~100mbps more than without GRO) but the TCP: (c) - (a)
performance drop remains unaffected regardless.

I've tried applying stretch ACK patchset (v2) on both machines and
re-run the above tests. I got no measurable difference in performance.

I've also run these tests with iwlwifi 7260 (also a 2x2) as (b) and
(c). It didn't seem to be affected by the TSO patch at all (it runs at
~360mbps of TCP regardless of the TSO patch).

Any hints/ideas?


Michał
--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throughput regression with `tcp: refine TSO autosizing`

2015-01-29 Thread Eric Dumazet
On Thu, 2015-01-29 at 12:48 +0100, Michal Kazior wrote:
 Hi,
 
 I'm not subscribed to netdev list and I can't find the message-id so I
 can't reply directly to the original thread `BW regression after tcp:
 refine TSO autosizing`.
 
 I've noticed a big TCP performance drop with ath10k
 (drivers/net/wireless/ath/ath10k) on 3.19-rc5. Instead of 500mbps I
 get 250mbps in my testbed.
 
 After bisecting I ended up at `tcp: refine TSO autosizing`. Reverting
 `tcp: refine TSO autosizing` and `tcp: Do not apply TSO segment limit
 to non-TSO packets` (for conflict free reverts) fixes the problem.
 
 My testing setup is as follows:
 
  a) ath10k AP, github.com/kvalo/ath/tree/master 3.19-rc5, w/ reverts
  b) ath10k STA connected to (a), github.com/kvalo/ath/tree/master
 3.19-rc5, w/ reverts
  c) (b) w/o reverts
 
 Devices are 3x3 (AP) and 2x2 (Client) and are RF cabled. 11ac@80MHz
 2x2 has 866mbps modulation rate. In practice this should deliver
 ~700mbps of real UDP traffic.
 
 Here are some numbers:
 
 UDP: (b) - (a): 672mbps
 UDP: (a) - (b): 687mbps
 TCP: (b) - (a): 526mbps
 TCP: (a) - (b): 500mbps
 
 UDP: (c) - (a): 669mbps*
 UDP: (a) - (c): 689mbps*
 TCP: (c) - (a): 240mbps**
 TCP: (a) - (c): 490mbps*
 
 * no changes/within error margin
 ** the performance drop
 
 I'm using iperf:
   UDP: iperf -i1 -s -u vs iperf -i1 -c XX -u -B 200M -P5 -t 20
   TCP: iperf -i1 -s vs iperf -i1 -c XX -P5 -t 20
 
 Result values were obtained at the receiver side.
 
 Iperf reports a few frames lost and out-of-order at each UDP test
 start (during first second) but later has no packet loss and no
 out-of-order. This shouldn't have any effect on a TCP session, right?
 
 The device delivers batched up tx/rx completions (no way to change
 that). I suppose this could be an issue for timing sensitive
 algorithms. Also keep in mind 802.11n and 802.11ac devices have frame
 aggregation windows so there's an inherent extra (and non-uniform)
 latency when compared to, e.g. ethernet devices.
 
 The driver doesn't have GRO. I have an internal patch which implements
 it. It improves overall TCP traffic (more stable, up to 600mbps TCP
 which is ~100mbps more than without GRO) but the TCP: (c) - (a)
 performance drop remains unaffected regardless.
 
 I've tried applying stretch ACK patchset (v2) on both machines and
 re-run the above tests. I got no measurable difference in performance.
 
 I've also run these tests with iwlwifi 7260 (also a 2x2) as (b) and
 (c). It didn't seem to be affected by the TSO patch at all (it runs at
 ~360mbps of TCP regardless of the TSO patch).
 
 Any hints/ideas?
 

Hi Michal

This patch restored original TSQ behavior, because the 1ms worth of data
per flow had totally destroyed TSQ intent.

vi +630 Documentation/networking/ip-sysctl.txt

tcp_limit_output_bytes - INTEGER
Controls TCP Small Queue limit per tcp socket.
TCP bulk sender tends to increase packets in flight until it
gets losses notifications. With SNDBUF autotuning, this can
result in a large amount of packets queued in qdisc/device
on the local machine, hurting latency of other flows, for
typical pfifo_fast qdiscs.
tcp_limit_output_bytes limits the number of bytes on qdisc
or device to reduce artificial RTT/cwnd and reduce bufferbloat.
Default: 131072

This is why I suggested to Eyal Perry to change the TX interrupt
mitigation parameters as in :

ethtool -C eth0 tx-frames 4 rx-frames 4

With this change and the stretch ack fixes, I got 37Gbps of throughput
on a single flow, on a 40Gbit NIC (mlx4)

If a driver needs to buffer more than tcp_limit_output_bytes=131072 to
get line rate, I suggest that you either :

1) tweak tcp_limit_output_bytes, but its not practical from a driver.

2) change the driver, knowing what are its exact requirements, by
removing a fraction of skb-truesize at ndo_start_xmit() time as in :

if ((skb-destructor == sock_wfree ||
 skb-restuctor == tcp_wfree) 
skb-sk) {
u32 fraction = skb-truesize / 2;

skb-truesize -= fraction;
atomic_sub(fraction, skb-sk-sk_wmem_alloc);
}

Thanks.


--
To unsubscribe from this list: send the line unsubscribe linux-wireless in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html