Your payment Released
Hello We want to let you know that record available to us states that sometime ago, you were contacted by some people who said they wanted to wire some money into your account and you people share at an agreed ratio. You opened communication with the consulting company then but after sometime, you cut off communication when you were asked to pay certain amount of money to facilitate the transfer of the said fund to your account. That transaction is real. It is not a scam. They want to pay you now. Do reply so that we tell you the steps to follow and your get your money within the next few weeks. Waiting for your reply John Kulpinski CEO First Mile Consulting Inc.
Re: [PATCH v2 2/2] openvswitch: use percpu flow stats
On Thu, Sep 15, 2016 at 3:11 PM, Thadeu Lima de Souza Cascardowrote: > Instead of using flow stats per NUMA node, use it per CPU. When using > megaflows, the stats lock can be a bottleneck in scalability. > > On a E5-2690 12-core system, usual throughput went from ~4Mpps to > ~15Mpps when forwarding between two 40GbE ports with a single flow > configured on the datapath. > > This has been tested on a system with possible CPUs 0-7,16-23. After > module removal, there were no corruption on the slab cache. > > Signed-off-by: Thadeu Lima de Souza Cascardo > Cc: pravin shelar Looks good. Acked-by: Pravin B Shelar
Re: [PATCH v2 1/2] openvswitch: fix flow stats accounting when node 0 is not possible
On Thu, Sep 15, 2016 at 3:11 PM, Thadeu Lima de Souza Cascardowrote: > On a system with only node 1 as possible, all statistics is going to be > accounted on node 0 as it will have a single writer. > > However, when getting and clearing the statistics, node 0 is not going > to be considered, as it's not a possible node. > > Tested that statistics are not zero on a system with only node 1 > possible. Also compile-tested with CONFIG_NUMA off. > > Signed-off-by: Thadeu Lima de Souza Cascardo Acked-by: Pravin B Shelar
Re: [PATCH net] MAINTAINERS: Gary Zambrano's email is bouncing
On Sat, 2016-09-17 at 16:38 -0700, Florian Fainelli wrote: > 2016-09-17 16:23 GMT-07:00 Joe Perches: > > On Sat, 2016-09-17 at 16:17 -0700, Florian Fainelli wrote: > > > The list does not accept public subscribers, so this is the correct > > > entry to use. > > Then M: is definitely _not_ the correct entry for this > > and it should be: > > L: bcm-kernel-feedback-l...@broadcom.com (subscribers-only) > Olof indicated otherwise, so who is right > here?https://www.spinics.net/lists/arm-kernel/msg512572.html > prompting to this patch series: > http://linux-arm-kernel.infradead.narkive.com/CRyvGOKd/patch-maintainers-change-l-to- This hasn't been applied to -next, and I looked at the existing L: entries. No worries here either if it's an exploder and not a mailing list. Pity it's not called something like bcm-linux-driv...@broadcom.com if it's really an exploder.
Re: [PATCH net] MAINTAINERS: Gary Zambrano's email is bouncing
On Sat, Sep 17, 2016 at 4:17 PM, Florian Fainelliwrote: > 2016-09-17 15:51 GMT-07:00 Joe Perches : >> On Sat, 2016-09-17 at 15:27 -0700, Florian Fainelli wrote: >>> Gary has not been with Broadcom for some time now, replace his address >>> with the internal mailing-list used for other entries. >>> >>> > Signed-off-by: Florian Fainelli >>> --- >>> Michael, >>> >>> Since this is an old driver, not sure who could step up as a maintainer >>> for b44? >> [] >>> diff --git a/MAINTAINERS b/MAINTAINERS >> [] >>> @@ -2500,8 +2500,8 @@ S: Supported >> >>> F:kernel/bpf/ >>> >>> BROADCOM B44 10/100 ETHERNET DRIVER >>> -M: Gary Zambrano >>> L: netdev@vger.kernel.org >>> +M: bcm-kernel-feedback-l...@broadcom.com >>> S: Supported >>> F: drivers/net/ethernet/broadcom/b44.* >> >> Without an actual maintainer, this should really be >> orphan and not supported. > > I would like to hear from Michael before concluding that > I have worked on this NIC more than 10 years ago. Last time I checked, I don't have this NIC anymore after moving offices several times. I don't mind being the maintainer, if no one else more suitable and have access to hardware wants to do it.
Re: [PATCH net] MAINTAINERS: Gary Zambrano's email is bouncing
2016-09-17 16:23 GMT-07:00 Joe Perches: > On Sat, 2016-09-17 at 16:17 -0700, Florian Fainelli wrote: >> 2016-09-17 15:51 GMT-07:00 Joe Perches : > [] >> > Without an actual maintainer, this should really be >> > orphan and not supported. >> I would like to hear from Michael before concluding that > > No worries. > >> > And the M: bcm-kernel-feedback-list@ should be L: >> The list does not accept public subscribers, so this is the correct >> entry to use. > > Then M: is definitely _not_ the correct entry for this > and it should be: > > L: bcm-kernel-feedback-l...@broadcom.com (subscribers-only) Olof indicated otherwise, so who is right here? https://www.spinics.net/lists/arm-kernel/msg512572.html prompting to this patch series: http://linux-arm-kernel.infradead.narkive.com/CRyvGOKd/patch-maintainers-change-l-to-m-for-broadcom-arm-soc-entries -- Florian
[PATCH net-next 00/11] rxrpc: Tracepoint addition and improvement
Here is a set of patches that add some more tracepoints and improve a couple of existing ones. New additions include: (1) Connection refcount tracking. (2) Client connection state machine tracking. (3) Tx and Rx packet lifecycle. (4) ACK reception and transmission. (5) recvmsg processing. Updates include: (1) Print the symbolic packet name in the Rx packet tracepoint. (2) Additional call refcount trace events. (3) Improvements to sk_buff tracking with AF_RXRPC. In addition: (1) Config option to inject packet loss during both transmission and reception. (2) Removal of some printks. This series needs to be applied on top of the previously posted fixes. The patches can be found here also: http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite Tagged thusly: git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git rxrpc-rewrite-20160917-2 David --- David Howells (11): rxrpc: Print the packet type name in the Rx packet trace rxrpc: Add some additional call tracing rxrpc: Add connection tracepoint and client conn state tracepoint rxrpc: Add a tracepoint to follow the life of a packet in the Tx buffer rxrpc: Add a tracepoint to log received ACK packets rxrpc: Add a tracepoint to log ACK transmission rxrpc: Add a tracepoint to follow packets in the Rx buffer rxrpc: Add a tracepoint to follow what recvmsg does rxrpc: Remove printks from rxrpc_recvmsg_data() to fix uninit var rxrpc: Improve skb tracing rxrpc: Add config to inject packet loss include/trace/events/rxrpc.h | 226 -- net/rxrpc/Kconfig|7 + net/rxrpc/af_rxrpc.c |5 + net/rxrpc/ar-internal.h | 159 +++--- net/rxrpc/call_accept.c |7 + net/rxrpc/call_event.c |8 + net/rxrpc/call_object.c | 31 -- net/rxrpc/conn_client.c | 82 +++ net/rxrpc/conn_event.c | 11 +- net/rxrpc/conn_object.c | 72 + net/rxrpc/conn_service.c |4 + net/rxrpc/input.c| 31 -- net/rxrpc/local_event.c |4 - net/rxrpc/misc.c | 81 +++ net/rxrpc/output.c | 20 +++- net/rxrpc/peer_event.c | 10 +- net/rxrpc/recvmsg.c | 60 --- net/rxrpc/sendmsg.c | 19 ++-- net/rxrpc/skbuff.c | 53 +++--- 19 files changed, 740 insertions(+), 150 deletions(-)
[PATCH net-next 02/11] rxrpc: Add some additional call tracing
Add additional call tracepoint points for noting call-connected, call-released and connection-failed events. Also fix one tracepoint that was using an integer instead of the corresponding enum value as the point type. Signed-off-by: David Howells--- net/rxrpc/ar-internal.h |3 +++ net/rxrpc/call_object.c | 18 ++ 2 files changed, 17 insertions(+), 4 deletions(-) diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h index 0f6fafa2c271..4a73c20d9436 100644 --- a/net/rxrpc/ar-internal.h +++ b/net/rxrpc/ar-internal.h @@ -539,6 +539,8 @@ enum rxrpc_call_trace { rxrpc_call_queued, rxrpc_call_queued_ref, rxrpc_call_seen, + rxrpc_call_connected, + rxrpc_call_release, rxrpc_call_got, rxrpc_call_got_userid, rxrpc_call_got_kernel, @@ -546,6 +548,7 @@ enum rxrpc_call_trace { rxrpc_call_put_userid, rxrpc_call_put_kernel, rxrpc_call_put_noqueue, + rxrpc_call_error, rxrpc_call__nr_trace }; diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c index 23f5a5f58282..0df9d1af8edb 100644 --- a/net/rxrpc/call_object.c +++ b/net/rxrpc/call_object.c @@ -53,6 +53,8 @@ const char rxrpc_call_traces[rxrpc_call__nr_trace][4] = { [rxrpc_call_new_service]= "NWs", [rxrpc_call_queued] = "QUE", [rxrpc_call_queued_ref] = "QUR", + [rxrpc_call_connected] = "CON", + [rxrpc_call_release]= "RLS", [rxrpc_call_seen] = "SEE", [rxrpc_call_got]= "GOT", [rxrpc_call_got_userid] = "Gus", @@ -61,6 +63,7 @@ const char rxrpc_call_traces[rxrpc_call__nr_trace][4] = { [rxrpc_call_put_userid] = "Pus", [rxrpc_call_put_kernel] = "Pke", [rxrpc_call_put_noqueue]= "PNQ", + [rxrpc_call_error] = "*E*", }; struct kmem_cache *rxrpc_call_jar; @@ -222,8 +225,8 @@ struct rxrpc_call *rxrpc_new_client_call(struct rxrpc_sock *rx, return call; } - trace_rxrpc_call(call, 0, atomic_read(>usage), here, -(const void *)user_call_ID); + trace_rxrpc_call(call, rxrpc_call_new_client, atomic_read(>usage), +here, (const void *)user_call_ID); /* Publish the call, even though it is incompletely set up as yet */ write_lock(>call_lock); @@ -263,6 +266,9 @@ struct rxrpc_call *rxrpc_new_client_call(struct rxrpc_sock *rx, if (ret < 0) goto error; + trace_rxrpc_call(call, rxrpc_call_connected, atomic_read(>usage), +here, ERR_PTR(ret)); + spin_lock_bh(>conn->params.peer->lock); hlist_add_head(>error_link, >conn->params.peer->error_targets); @@ -287,6 +293,8 @@ error_dup_user_ID: error: __rxrpc_set_call_completion(call, RXRPC_CALL_LOCAL_ERROR, RX_CALL_DEAD, ret); + trace_rxrpc_call(call, rxrpc_call_error, atomic_read(>usage), +here, ERR_PTR(ret)); rxrpc_release_call(rx, call); rxrpc_put_call(call, rxrpc_call_put); _leave(" = %d", ret); @@ -396,15 +404,17 @@ void rxrpc_get_call(struct rxrpc_call *call, enum rxrpc_call_trace op) */ void rxrpc_release_call(struct rxrpc_sock *rx, struct rxrpc_call *call) { + const void *here = __builtin_return_address(0); struct rxrpc_connection *conn = call->conn; bool put = false; int i; _enter("{%d,%d}", call->debug_id, atomic_read(>usage)); - ASSERTCMP(call->state, ==, RXRPC_CALL_COMPLETE); + trace_rxrpc_call(call, rxrpc_call_release, atomic_read(>usage), +here, (const void *)call->flags); - rxrpc_see_call(call); + ASSERTCMP(call->state, ==, RXRPC_CALL_COMPLETE); spin_lock_bh(>lock); if (test_and_set_bit(RXRPC_CALL_RELEASED, >flags))
[PATCH net-next 04/11] rxrpc: Add a tracepoint to follow the life of a packet in the Tx buffer
Add a tracepoint to follow the insertion of a packet into the transmit buffer, its transmission and its rotation out of the buffer. Signed-off-by: David Howells--- include/trace/events/rxrpc.h | 26 ++ net/rxrpc/ar-internal.h | 12 net/rxrpc/input.c|2 ++ net/rxrpc/misc.c |9 + net/rxrpc/sendmsg.c |9 - 5 files changed, 57 insertions(+), 1 deletion(-) diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h index c0c496c83f31..ffc74b3e5b76 100644 --- a/include/trace/events/rxrpc.h +++ b/include/trace/events/rxrpc.h @@ -208,6 +208,32 @@ TRACE_EVENT(rxrpc_abort, __entry->abort_code, __entry->error, __entry->why) ); +TRACE_EVENT(rxrpc_transmit, + TP_PROTO(struct rxrpc_call *call, enum rxrpc_transmit_trace why), + + TP_ARGS(call, why), + + TP_STRUCT__entry( + __field(struct rxrpc_call *,call) + __field(enum rxrpc_transmit_trace, why ) + __field(rxrpc_seq_t,tx_hard_ack ) + __field(rxrpc_seq_t,tx_top ) +), + + TP_fast_assign( + __entry->call = call; + __entry->why = why; + __entry->tx_hard_ack = call->tx_hard_ack; + __entry->tx_top = call->tx_top; + ), + + TP_printk("c=%p %s f=%08x n=%u", + __entry->call, + rxrpc_transmit_traces[__entry->why], + __entry->tx_hard_ack + 1, + __entry->tx_top - __entry->tx_hard_ack) + ); + #endif /* _TRACE_RXRPC_H */ /* This part must be outside protection */ diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h index 6ca40eea3022..afa5dcc05fe0 100644 --- a/net/rxrpc/ar-internal.h +++ b/net/rxrpc/ar-internal.h @@ -593,6 +593,18 @@ enum rxrpc_call_trace { extern const char rxrpc_call_traces[rxrpc_call__nr_trace][4]; +enum rxrpc_transmit_trace { + rxrpc_transmit_wait, + rxrpc_transmit_queue, + rxrpc_transmit_queue_reqack, + rxrpc_transmit_queue_last, + rxrpc_transmit_rotate, + rxrpc_transmit_end, + rxrpc_transmit__nr_trace +}; + +extern const char rxrpc_transmit_traces[rxrpc_transmit__nr_trace][4]; + extern const char *const rxrpc_pkts[]; extern const char *rxrpc_acks(u8 reason); diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c index c1f83d22f9b7..c7eb5104e91a 100644 --- a/net/rxrpc/input.c +++ b/net/rxrpc/input.c @@ -59,6 +59,7 @@ static void rxrpc_rotate_tx_window(struct rxrpc_call *call, rxrpc_seq_t to) spin_unlock(>lock); + trace_rxrpc_transmit(call, rxrpc_transmit_rotate); wake_up(>waitq); while (list) { @@ -107,6 +108,7 @@ static bool rxrpc_end_tx_phase(struct rxrpc_call *call, const char *abort_why) } write_unlock(>state_lock); + trace_rxrpc_transmit(call, rxrpc_transmit_end); _leave(" = ok"); return true; } diff --git a/net/rxrpc/misc.c b/net/rxrpc/misc.c index 598064d3bdd2..dca89995f03e 100644 --- a/net/rxrpc/misc.c +++ b/net/rxrpc/misc.c @@ -132,3 +132,12 @@ const char rxrpc_client_traces[rxrpc_client__nr_trace][7] = { [rxrpc_client_to_waiting] = "->Wait", [rxrpc_client_uncount] = "Uncoun", }; + +const char rxrpc_transmit_traces[rxrpc_transmit__nr_trace][4] = { + [rxrpc_transmit_wait] = "WAI", + [rxrpc_transmit_queue] = "QUE", + [rxrpc_transmit_queue_reqack] = "QRA", + [rxrpc_transmit_queue_last] = "QLS", + [rxrpc_transmit_rotate] = "ROT", + [rxrpc_transmit_end]= "END", +}; diff --git a/net/rxrpc/sendmsg.c b/net/rxrpc/sendmsg.c index 8bfddf4e338c..28d8f73cf11d 100644 --- a/net/rxrpc/sendmsg.c +++ b/net/rxrpc/sendmsg.c @@ -56,6 +56,7 @@ static int rxrpc_wait_for_tx_window(struct rxrpc_sock *rx, break; } + trace_rxrpc_transmit(call, rxrpc_transmit_wait); release_sock(>sk); *timeo = schedule_timeout(*timeo); lock_sock(>sk); @@ -104,8 +105,14 @@ static void rxrpc_queue_packet(struct rxrpc_call *call, struct sk_buff *skb, smp_wmb(); call->rxtx_buffer[ix] = skb; call->tx_top = seq; - if (last) + if (last) { set_bit(RXRPC_CALL_TX_LAST, >flags); + trace_rxrpc_transmit(call, rxrpc_transmit_queue_last); + } else if (sp->hdr.flags & RXRPC_REQUEST_ACK) { + trace_rxrpc_transmit(call, rxrpc_transmit_queue_reqack); + } else { + trace_rxrpc_transmit(call, rxrpc_transmit_queue); + } if (last || call->state ==
[PATCH net-next 11/11] rxrpc: Add config to inject packet loss
Add a configuration option to inject packet loss by discarding approximately every 8th packet received and approximately every 8th DATA packet transmitted. Note that no locking is used, but it shouldn't really matter. Signed-off-by: David Howells--- net/rxrpc/Kconfig |7 +++ net/rxrpc/input.c |8 net/rxrpc/output.c |9 + 3 files changed, 24 insertions(+) diff --git a/net/rxrpc/Kconfig b/net/rxrpc/Kconfig index 13396c74b5c1..86f8853a038c 100644 --- a/net/rxrpc/Kconfig +++ b/net/rxrpc/Kconfig @@ -26,6 +26,13 @@ config AF_RXRPC_IPV6 Say Y here to allow AF_RXRPC to use IPV6 UDP as well as IPV4 UDP as its network transport. +config AF_RXRPC_INJECT_LOSS + bool "Inject packet loss into RxRPC packet stream" + depends on AF_RXRPC + help + Say Y here to inject packet loss by discarding some received and some + transmitted packets. + config AF_RXRPC_DEBUG bool "RxRPC dynamic debugging" diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c index 84bb16d47b85..7ac1edf3aac7 100644 --- a/net/rxrpc/input.c +++ b/net/rxrpc/input.c @@ -712,6 +712,14 @@ void rxrpc_data_ready(struct sock *udp_sk) skb_orphan(skb); sp = rxrpc_skb(skb); + if (IS_ENABLED(CONFIG_AF_RXRPC_INJECT_LOSS)) { + static int lose; + if ((lose++ & 7) == 7) { + rxrpc_lose_skb(skb, rxrpc_skb_rx_lost); + return; + } + } + _net("Rx UDP packet from %08x:%04hu", ntohl(ip_hdr(skb)->saddr), ntohs(udp_hdr(skb)->source)); diff --git a/net/rxrpc/output.c b/net/rxrpc/output.c index a2cad5ce7416..16e18a94ffa6 100644 --- a/net/rxrpc/output.c +++ b/net/rxrpc/output.c @@ -225,6 +225,15 @@ int rxrpc_send_data_packet(struct rxrpc_connection *conn, struct sk_buff *skb) msg.msg_controllen = 0; msg.msg_flags = 0; + if (IS_ENABLED(CONFIG_AF_RXRPC_INJECT_LOSS)) { + static int lose; + if ((lose++ & 7) == 7) { + rxrpc_lose_skb(skb, rxrpc_skb_tx_lost); + _leave(" = 0 [lose]"); + return 0; + } + } + /* send the packet with the don't fragment bit set if we currently * think it's small enough */ if (skb->len - sizeof(struct rxrpc_wire_header) < conn->params.peer->maxdata) {
[PATCH net-next 05/11] rxrpc: Add a tracepoint to log received ACK packets
Add a tracepoint to log information from received ACK packets. Signed-off-by: David Howells--- include/trace/events/rxrpc.h | 26 ++ net/rxrpc/input.c|2 ++ 2 files changed, 28 insertions(+) diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h index ffc74b3e5b76..2b19f3fa5174 100644 --- a/include/trace/events/rxrpc.h +++ b/include/trace/events/rxrpc.h @@ -234,6 +234,32 @@ TRACE_EVENT(rxrpc_transmit, __entry->tx_top - __entry->tx_hard_ack) ); +TRACE_EVENT(rxrpc_rx_ack, + TP_PROTO(struct rxrpc_call *call, rxrpc_seq_t first, u8 reason, u8 n_acks), + + TP_ARGS(call, first, reason, n_acks), + + TP_STRUCT__entry( + __field(struct rxrpc_call *,call) + __field(rxrpc_seq_t,first ) + __field(u8, reason ) + __field(u8, n_acks ) +), + + TP_fast_assign( + __entry->call = call; + __entry->first = first; + __entry->reason = reason; + __entry->n_acks = n_acks; + ), + + TP_printk("c=%p %s f=%08x n=%u", + __entry->call, + rxrpc_acks(__entry->reason), + __entry->first, + __entry->n_acks) + ); + #endif /* _TRACE_RXRPC_H */ /* This part must be outside protection */ diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c index c7eb5104e91a..7b18ca124978 100644 --- a/net/rxrpc/input.c +++ b/net/rxrpc/input.c @@ -440,6 +440,8 @@ static void rxrpc_input_ack(struct rxrpc_call *call, struct sk_buff *skb, hard_ack = first_soft_ack - 1; nr_acks = buf.ack.nAcks; + trace_rxrpc_rx_ack(call, first_soft_ack, buf.ack.reason, nr_acks); + _proto("Rx ACK %%%u { m=%hu f=#%u p=#%u s=%%%u r=%s n=%u }", sp->hdr.serial, ntohs(buf.ack.maxSkew),
[PATCH net-next 08/11] rxrpc: Add a tracepoint to follow what recvmsg does
Add a tracepoint to follow what recvmsg does within AF_RXRPC. Signed-off-by: David Howells--- include/trace/events/rxrpc.h | 34 ++ net/rxrpc/ar-internal.h | 17 + net/rxrpc/misc.c | 14 ++ net/rxrpc/recvmsg.c | 34 ++ 4 files changed, 91 insertions(+), 8 deletions(-) diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h index 7dd5f0188681..58732202e9f0 100644 --- a/include/trace/events/rxrpc.h +++ b/include/trace/events/rxrpc.h @@ -323,6 +323,40 @@ TRACE_EVENT(rxrpc_receive, __entry->top) ); +TRACE_EVENT(rxrpc_recvmsg, + TP_PROTO(struct rxrpc_call *call, enum rxrpc_recvmsg_trace why, +rxrpc_seq_t seq, unsigned int offset, unsigned int len, +int ret), + + TP_ARGS(call, why, seq, offset, len, ret), + + TP_STRUCT__entry( + __field(struct rxrpc_call *,call) + __field(enum rxrpc_recvmsg_trace, why ) + __field(rxrpc_seq_t,seq ) + __field(unsigned int, offset ) + __field(unsigned int, len ) + __field(int,ret ) +), + + TP_fast_assign( + __entry->call = call; + __entry->why = why; + __entry->seq = seq; + __entry->offset = offset; + __entry->len = len; + __entry->ret = ret; + ), + + TP_printk("c=%p %s q=%08x o=%u l=%u ret=%d", + __entry->call, + rxrpc_recvmsg_traces[__entry->why], + __entry->seq, + __entry->offset, + __entry->len, + __entry->ret) + ); + #endif /* _TRACE_RXRPC_H */ /* This part must be outside protection */ diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h index e5d2f2fb8e41..a17341d2df3d 100644 --- a/net/rxrpc/ar-internal.h +++ b/net/rxrpc/ar-internal.h @@ -617,6 +617,23 @@ enum rxrpc_receive_trace { extern const char rxrpc_receive_traces[rxrpc_receive__nr_trace][4]; +enum rxrpc_recvmsg_trace { + rxrpc_recvmsg_enter, + rxrpc_recvmsg_wait, + rxrpc_recvmsg_dequeue, + rxrpc_recvmsg_hole, + rxrpc_recvmsg_next, + rxrpc_recvmsg_cont, + rxrpc_recvmsg_full, + rxrpc_recvmsg_data_return, + rxrpc_recvmsg_terminal, + rxrpc_recvmsg_to_be_accepted, + rxrpc_recvmsg_return, + rxrpc_recvmsg__nr_trace +}; + +extern const char rxrpc_recvmsg_traces[rxrpc_recvmsg__nr_trace][5]; + extern const char *const rxrpc_pkts[]; extern const char *rxrpc_acks(u8 reason); diff --git a/net/rxrpc/misc.c b/net/rxrpc/misc.c index db5f1d54fc90..c7065d893d1e 100644 --- a/net/rxrpc/misc.c +++ b/net/rxrpc/misc.c @@ -150,3 +150,17 @@ const char rxrpc_receive_traces[rxrpc_receive__nr_trace][4] = { [rxrpc_receive_rotate] = "ROT", [rxrpc_receive_end] = "END", }; + +const char rxrpc_recvmsg_traces[rxrpc_recvmsg__nr_trace][5] = { + [rxrpc_recvmsg_enter] = "ENTR", + [rxrpc_recvmsg_wait]= "WAIT", + [rxrpc_recvmsg_dequeue] = "DEQU", + [rxrpc_recvmsg_hole]= "HOLE", + [rxrpc_recvmsg_next]= "NEXT", + [rxrpc_recvmsg_cont]= "CONT", + [rxrpc_recvmsg_full]= "FULL", + [rxrpc_recvmsg_data_return] = "DATA", + [rxrpc_recvmsg_terminal]= "TERM", + [rxrpc_recvmsg_to_be_accepted] = "TBAC", + [rxrpc_recvmsg_return] = "RETN", +}; diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c index 22d51087c580..b62a08151895 100644 --- a/net/rxrpc/recvmsg.c +++ b/net/rxrpc/recvmsg.c @@ -94,6 +94,8 @@ static int rxrpc_recvmsg_term(struct rxrpc_call *call, struct msghdr *msg) break; } + trace_rxrpc_recvmsg(call, rxrpc_recvmsg_terminal, call->rx_hard_ack, + call->rx_pkt_offset, call->rx_pkt_len, ret); return ret; } @@ -124,6 +126,7 @@ static int rxrpc_recvmsg_new_call(struct rxrpc_sock *rx, write_unlock(>call_lock); } + trace_rxrpc_recvmsg(call, rxrpc_recvmsg_to_be_accepted, 1, 0, 0, ret); return ret; } @@ -310,8 +313,11 @@ static int rxrpc_recvmsg_data(struct socket *sock, struct rxrpc_call *call, for (seq = hard_ack + 1; before_eq(seq, top); seq++) { ix = seq & RXRPC_RXTX_BUFF_MASK; skb = call->rxtx_buffer[ix]; - if (!skb) + if (!skb) { +
[PATCH net-next 10/11] rxrpc: Improve skb tracing
Improve sk_buff tracing within AF_RXRPC by the following means: (1) Use an enum to note the event type rather than plain integers and use an array of event names rather than a big multi ?: list. (2) Distinguish Rx from Tx packets and account them separately. This requires the call phase to be tracked so that we know what we might find in rxtx_buffer[]. (3) Add a parameter to rxrpc_{new,see,get,free}_skb() to indicate the event type. (4) A pair of 'rotate' events are added to indicate packets that are about to be rotated out of the Rx and Tx windows. (5) A pair of 'lost' events are added, along with rxrpc_lose_skb() for packet loss injection recording. Signed-off-by: David Howells--- include/trace/events/rxrpc.h | 12 +++--- net/rxrpc/af_rxrpc.c |5 ++-- net/rxrpc/ar-internal.h | 33 ++ net/rxrpc/call_event.c |8 +++--- net/rxrpc/call_object.c | 11 ++--- net/rxrpc/conn_event.c |6 ++--- net/rxrpc/input.c| 13 ++ net/rxrpc/local_event.c |4 ++- net/rxrpc/misc.c | 18 ++ net/rxrpc/output.c |4 ++- net/rxrpc/peer_event.c | 10 net/rxrpc/recvmsg.c |7 +++--- net/rxrpc/sendmsg.c | 10 net/rxrpc/skbuff.c | 53 ++ 14 files changed, 131 insertions(+), 63 deletions(-) diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h index 58732202e9f0..75a5d8bf50e1 100644 --- a/include/trace/events/rxrpc.h +++ b/include/trace/events/rxrpc.h @@ -107,14 +107,14 @@ TRACE_EVENT(rxrpc_call, ); TRACE_EVENT(rxrpc_skb, - TP_PROTO(struct sk_buff *skb, int op, int usage, int mod_count, -const void *where), + TP_PROTO(struct sk_buff *skb, enum rxrpc_skb_trace op, +int usage, int mod_count, const void *where), TP_ARGS(skb, op, usage, mod_count, where), TP_STRUCT__entry( __field(struct sk_buff *, skb ) - __field(int,op ) + __field(enum rxrpc_skb_trace, op ) __field(int,usage ) __field(int,mod_count ) __field(const void *, where ) @@ -130,11 +130,7 @@ TRACE_EVENT(rxrpc_skb, TP_printk("s=%p %s u=%d m=%d p=%pSR", __entry->skb, - (__entry->op == 0 ? "NEW" : - __entry->op == 1 ? "SEE" : - __entry->op == 2 ? "GET" : - __entry->op == 3 ? "FRE" : - "PUR"), + rxrpc_skb_traces[__entry->op], __entry->usage, __entry->mod_count, __entry->where) diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c index 09f81befc705..8dbf7bed2cc4 100644 --- a/net/rxrpc/af_rxrpc.c +++ b/net/rxrpc/af_rxrpc.c @@ -45,7 +45,7 @@ u32 rxrpc_epoch; atomic_t rxrpc_debug_id; /* count of skbs currently in use */ -atomic_t rxrpc_n_skbs; +atomic_t rxrpc_n_tx_skbs, rxrpc_n_rx_skbs; struct workqueue_struct *rxrpc_workqueue; @@ -867,7 +867,8 @@ static void __exit af_rxrpc_exit(void) proto_unregister(_proto); rxrpc_destroy_all_calls(); rxrpc_destroy_all_connections(); - ASSERTCMP(atomic_read(_n_skbs), ==, 0); + ASSERTCMP(atomic_read(_n_tx_skbs), ==, 0); + ASSERTCMP(atomic_read(_n_rx_skbs), ==, 0); rxrpc_destroy_all_locals(); remove_proc_entry("rxrpc_conns", init_net.proc_net); diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h index a17341d2df3d..034f525f2235 100644 --- a/net/rxrpc/ar-internal.h +++ b/net/rxrpc/ar-internal.h @@ -520,6 +520,7 @@ struct rxrpc_call { rxrpc_seq_t rx_expect_next; /* Expected next packet sequence number */ u8 rx_winsize; /* Size of Rx window */ u8 tx_winsize; /* Maximum size of Tx window */ + booltx_phase; /* T if transmission phase, F if receive phase */ u8 nr_jumbo_bad; /* Number of jumbo dups/exceeds-windows */ /* receive-phase ACK management */ @@ -534,6 +535,27 @@ struct rxrpc_call { rxrpc_serial_t acks_latest;/* serial number of latest ACK received */ }; +enum rxrpc_skb_trace { + rxrpc_skb_rx_cleaned, + rxrpc_skb_rx_freed, + rxrpc_skb_rx_got, + rxrpc_skb_rx_lost, + rxrpc_skb_rx_received, + rxrpc_skb_rx_rotated, + rxrpc_skb_rx_purged, + rxrpc_skb_rx_seen, + rxrpc_skb_tx_cleaned, +
[PATCH net-next 09/11] rxrpc: Remove printks from rxrpc_recvmsg_data() to fix uninit var
Remove _enter/_debug/_leave calls from rxrpc_recvmsg_data() of which one uses an uninitialised variable. Signed-off-by: David Howells--- net/rxrpc/recvmsg.c |8 1 file changed, 8 deletions(-) diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c index b62a08151895..79e65668bc58 100644 --- a/net/rxrpc/recvmsg.c +++ b/net/rxrpc/recvmsg.c @@ -296,8 +296,6 @@ static int rxrpc_recvmsg_data(struct socket *sock, struct rxrpc_call *call, unsigned int rx_pkt_offset, rx_pkt_len; int ix, copy, ret = -EAGAIN, ret2; - _enter(""); - rx_pkt_offset = call->rx_pkt_offset; rx_pkt_len = call->rx_pkt_len; @@ -343,8 +341,6 @@ static int rxrpc_recvmsg_data(struct socket *sock, struct rxrpc_call *call, trace_rxrpc_recvmsg(call, rxrpc_recvmsg_cont, seq, rx_pkt_offset, rx_pkt_len, 0); } - _debug("recvmsg %x DATA #%u { %d, %d }", - sp->hdr.callNumber, seq, rx_pkt_offset, rx_pkt_len); /* We have to handle short, empty and used-up DATA packets. */ remain = len - *_offset; @@ -360,8 +356,6 @@ static int rxrpc_recvmsg_data(struct socket *sock, struct rxrpc_call *call, } /* handle piecemeal consumption of data packets */ - _debug("copied %d @%zu", copy, *_offset); - rx_pkt_offset += copy; rx_pkt_len -= copy; *_offset += copy; @@ -370,7 +364,6 @@ static int rxrpc_recvmsg_data(struct socket *sock, struct rxrpc_call *call, if (rx_pkt_len > 0) { trace_rxrpc_recvmsg(call, rxrpc_recvmsg_full, seq, rx_pkt_offset, rx_pkt_len, 0); - _debug("buffer full"); ASSERTCMP(*_offset, ==, len); ret = 0; break; @@ -398,7 +391,6 @@ out: done: trace_rxrpc_recvmsg(call, rxrpc_recvmsg_data_return, seq, rx_pkt_offset, rx_pkt_len, ret); - _leave(" = %d [%u/%u]", ret, seq, top); return ret; }
[PATCH net-next 06/11] rxrpc: Add a tracepoint to log ACK transmission
Add a tracepoint to log information about ACK transmission. Signed-off-by: David Howels--- include/trace/events/rxrpc.h | 30 ++ net/rxrpc/conn_event.c |3 +++ net/rxrpc/output.c |7 ++- 3 files changed, 39 insertions(+), 1 deletion(-) diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h index 2b19f3fa5174..d545d692ae22 100644 --- a/include/trace/events/rxrpc.h +++ b/include/trace/events/rxrpc.h @@ -260,6 +260,36 @@ TRACE_EVENT(rxrpc_rx_ack, __entry->n_acks) ); +TRACE_EVENT(rxrpc_tx_ack, + TP_PROTO(struct rxrpc_call *call, rxrpc_seq_t first, +rxrpc_serial_t serial, u8 reason, u8 n_acks), + + TP_ARGS(call, first, serial, reason, n_acks), + + TP_STRUCT__entry( + __field(struct rxrpc_call *,call) + __field(rxrpc_seq_t,first ) + __field(rxrpc_serial_t, serial ) + __field(u8, reason ) + __field(u8, n_acks ) +), + + TP_fast_assign( + __entry->call = call; + __entry->first = first; + __entry->serial = serial; + __entry->reason = reason; + __entry->n_acks = n_acks; + ), + + TP_printk("c=%p %s f=%08x r=%08x n=%u", + __entry->call, + rxrpc_acks(__entry->reason), + __entry->first, + __entry->serial, + __entry->n_acks) + ); + #endif /* _TRACE_RXRPC_H */ /* This part must be outside protection */ diff --git a/net/rxrpc/conn_event.c b/net/rxrpc/conn_event.c index a43f4c94a88d..9b19c51831aa 100644 --- a/net/rxrpc/conn_event.c +++ b/net/rxrpc/conn_event.c @@ -98,6 +98,9 @@ static void rxrpc_conn_retransmit_call(struct rxrpc_connection *conn, pkt.info.rwind = htonl(rxrpc_rx_window_size); pkt.info.jumbo_max = htonl(rxrpc_rx_jumbo_max); len += sizeof(pkt.ack) + sizeof(pkt.info); + + trace_rxrpc_tx_ack(NULL, chan->last_seq, 0, + RXRPC_ACK_DUPLICATE, 0); break; } diff --git a/net/rxrpc/output.c b/net/rxrpc/output.c index 0b21ed859de7..2c9daeadce87 100644 --- a/net/rxrpc/output.c +++ b/net/rxrpc/output.c @@ -38,12 +38,14 @@ struct rxrpc_pkt_buffer { static size_t rxrpc_fill_out_ack(struct rxrpc_call *call, struct rxrpc_pkt_buffer *pkt) { + rxrpc_serial_t serial; rxrpc_seq_t hard_ack, top, seq; int ix; u32 mtu, jmax; u8 *ackp = pkt->acks; /* Barrier against rxrpc_input_data(). */ + serial = call->ackr_serial; hard_ack = READ_ONCE(call->rx_hard_ack); top = smp_load_acquire(>rx_top); @@ -51,7 +53,7 @@ static size_t rxrpc_fill_out_ack(struct rxrpc_call *call, pkt->ack.maxSkew= htons(call->ackr_skew); pkt->ack.firstPacket= htonl(hard_ack + 1); pkt->ack.previousPacket = htonl(call->ackr_prev_seq); - pkt->ack.serial = htonl(call->ackr_serial); + pkt->ack.serial = htonl(serial); pkt->ack.reason = call->ackr_reason; pkt->ack.nAcks = top - hard_ack; @@ -75,6 +77,9 @@ static size_t rxrpc_fill_out_ack(struct rxrpc_call *call, pkt->ackinfo.rwind = htonl(call->rx_winsize); pkt->ackinfo.jumbo_max = htonl(jmax); + trace_rxrpc_tx_ack(call, hard_ack + 1, serial, call->ackr_reason, + top - hard_ack); + *ackp++ = 0; *ackp++ = 0; *ackp++ = 0;
Re: [PATCH net] MAINTAINERS: Gary Zambrano's email is bouncing
On Sat, 2016-09-17 at 16:17 -0700, Florian Fainelli wrote: > 2016-09-17 15:51 GMT-07:00 Joe Perches: [] > > Without an actual maintainer, this should really be > > orphan and not supported. > I would like to hear from Michael before concluding that No worries. > > And the M: bcm-kernel-feedback-list@ should be L: > The list does not accept public subscribers, so this is the correct > entry to use. Then M: is definitely _not_ the correct entry for this and it should be: L: bcm-kernel-feedback-l...@broadcom.com (subscribers-only)
[PATCH net-next 03/11] rxrpc: Add connection tracepoint and client conn state tracepoint
Add a pair of tracepoints, one to track rxrpc_connection struct ref counting and the other to track the client connection cache state. Signed-off-by: David Howells--- include/trace/events/rxrpc.h | 60 +++ net/rxrpc/ar-internal.h | 76 +-- net/rxrpc/call_accept.c |4 ++ net/rxrpc/call_object.c |2 - net/rxrpc/conn_client.c | 82 +- net/rxrpc/conn_event.c |2 + net/rxrpc/conn_object.c | 72 +++-- net/rxrpc/conn_service.c |4 ++ net/rxrpc/misc.c | 31 9 files changed, 274 insertions(+), 59 deletions(-) diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h index 0a30c673509c..c0c496c83f31 100644 --- a/include/trace/events/rxrpc.h +++ b/include/trace/events/rxrpc.h @@ -16,6 +16,66 @@ #include +TRACE_EVENT(rxrpc_conn, + TP_PROTO(struct rxrpc_connection *conn, enum rxrpc_conn_trace op, +int usage, const void *where), + + TP_ARGS(conn, op, usage, where), + + TP_STRUCT__entry( + __field(struct rxrpc_connection *, conn) + __field(int,op ) + __field(int,usage ) + __field(const void *, where ) +), + + TP_fast_assign( + __entry->conn = conn; + __entry->op = op; + __entry->usage = usage; + __entry->where = where; + ), + + TP_printk("C=%p %s u=%d sp=%pSR", + __entry->conn, + rxrpc_conn_traces[__entry->op], + __entry->usage, + __entry->where) + ); + +TRACE_EVENT(rxrpc_client, + TP_PROTO(struct rxrpc_connection *conn, int channel, +enum rxrpc_client_trace op), + + TP_ARGS(conn, channel, op), + + TP_STRUCT__entry( + __field(struct rxrpc_connection *, conn) + __field(u32,cid ) + __field(int,channel ) + __field(int,usage ) + __field(enum rxrpc_client_trace,op ) + __field(enum rxrpc_conn_cache_state, cs ) +), + + TP_fast_assign( + __entry->conn = conn; + __entry->channel = channel; + __entry->usage = atomic_read(>usage); + __entry->op = op; + __entry->cid = conn->proto.cid; + __entry->cs = conn->cache_state; + ), + + TP_printk("C=%p h=%2d %s %s i=%08x u=%d", + __entry->conn, + __entry->channel, + rxrpc_client_traces[__entry->op], + rxrpc_conn_cache_states[__entry->cs], + __entry->cid, + __entry->usage) + ); + TRACE_EVENT(rxrpc_call, TP_PROTO(struct rxrpc_call *call, enum rxrpc_call_trace op, int usage, const void *where, const void *aux), diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h index 4a73c20d9436..6ca40eea3022 100644 --- a/net/rxrpc/ar-internal.h +++ b/net/rxrpc/ar-internal.h @@ -314,6 +314,7 @@ enum rxrpc_conn_cache_state { RXRPC_CONN_CLIENT_ACTIVE, /* Conn is on active list, doing calls */ RXRPC_CONN_CLIENT_CULLED, /* Conn is culled and delisted, doing calls */ RXRPC_CONN_CLIENT_IDLE, /* Conn is on idle list, doing mostly nothing */ + RXRPC_CONN__NR_CACHE_STATES }; /* @@ -533,6 +534,44 @@ struct rxrpc_call { rxrpc_serial_t acks_latest;/* serial number of latest ACK received */ }; +enum rxrpc_conn_trace { + rxrpc_conn_new_client, + rxrpc_conn_new_service, + rxrpc_conn_queued, + rxrpc_conn_seen, + rxrpc_conn_got, + rxrpc_conn_put_client, + rxrpc_conn_put_service, + rxrpc_conn__nr_trace +}; + +extern const char rxrpc_conn_traces[rxrpc_conn__nr_trace][4]; + +enum rxrpc_client_trace { + rxrpc_client_activate_chans, + rxrpc_client_alloc, + rxrpc_client_chan_activate, + rxrpc_client_chan_disconnect, + rxrpc_client_chan_pass, + rxrpc_client_chan_unstarted, + rxrpc_client_cleanup, + rxrpc_client_count, + rxrpc_client_discard, + rxrpc_client_duplicate, + rxrpc_client_exposed, + rxrpc_client_replace, + rxrpc_client_to_active, +
[PATCH net-next 10/14] rxrpc: Fix the parsing of soft-ACKs
The soft-ACK parser doesn't increment the pointer into the soft-ACK list, resulting in the first ACK/NACK value being applied to all the relevant packets in the Tx queue. This has the potential to miss retransmissions and cause excessive retransmissions. Fix this by incrementing the pointer. Signed-off-by: David Howells--- net/rxrpc/input.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c index f0d9115b9b7e..c1f83d22f9b7 100644 --- a/net/rxrpc/input.c +++ b/net/rxrpc/input.c @@ -384,7 +384,7 @@ static void rxrpc_input_soft_acks(struct rxrpc_call *call, u8 *acks, for (; nr_acks > 0; nr_acks--, seq++) { ix = seq & RXRPC_RXTX_BUFF_MASK; - switch (*acks) { + switch (*acks++) { case RXRPC_ACK_TYPE_ACK: call->rxtx_annotations[ix] = RXRPC_TX_ANNO_ACK; break;
[PATCH net-next 07/11] rxrpc: Add a tracepoint to follow packets in the Rx buffer
Add a tracepoint to follow the life of packets that get added to a call's receive buffer. Signed-off-by: David Howells--- include/trace/events/rxrpc.h | 33 + net/rxrpc/ar-internal.h | 12 net/rxrpc/call_accept.c |3 +++ net/rxrpc/input.c|6 +- net/rxrpc/misc.c |9 + net/rxrpc/recvmsg.c | 11 +++ 6 files changed, 73 insertions(+), 1 deletion(-) diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h index d545d692ae22..7dd5f0188681 100644 --- a/include/trace/events/rxrpc.h +++ b/include/trace/events/rxrpc.h @@ -290,6 +290,39 @@ TRACE_EVENT(rxrpc_tx_ack, __entry->n_acks) ); +TRACE_EVENT(rxrpc_receive, + TP_PROTO(struct rxrpc_call *call, enum rxrpc_receive_trace why, +rxrpc_serial_t serial, rxrpc_seq_t seq), + + TP_ARGS(call, why, serial, seq), + + TP_STRUCT__entry( + __field(struct rxrpc_call *,call) + __field(enum rxrpc_receive_trace, why ) + __field(rxrpc_serial_t, serial ) + __field(rxrpc_seq_t,seq ) + __field(rxrpc_seq_t,hard_ack) + __field(rxrpc_seq_t,top ) +), + + TP_fast_assign( + __entry->call = call; + __entry->why = why; + __entry->serial = serial; + __entry->seq = seq; + __entry->hard_ack = call->rx_hard_ack; + __entry->top = call->rx_top; + ), + + TP_printk("c=%p %s r=%08x q=%08x w=%08x-%08x", + __entry->call, + rxrpc_receive_traces[__entry->why], + __entry->serial, + __entry->seq, + __entry->hard_ack, + __entry->top) + ); + #endif /* _TRACE_RXRPC_H */ /* This part must be outside protection */ diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h index afa5dcc05fe0..e5d2f2fb8e41 100644 --- a/net/rxrpc/ar-internal.h +++ b/net/rxrpc/ar-internal.h @@ -605,6 +605,18 @@ enum rxrpc_transmit_trace { extern const char rxrpc_transmit_traces[rxrpc_transmit__nr_trace][4]; +enum rxrpc_receive_trace { + rxrpc_receive_incoming, + rxrpc_receive_queue, + rxrpc_receive_queue_last, + rxrpc_receive_front, + rxrpc_receive_rotate, + rxrpc_receive_end, + rxrpc_receive__nr_trace +}; + +extern const char rxrpc_receive_traces[rxrpc_receive__nr_trace][4]; + extern const char *const rxrpc_pkts[]; extern const char *rxrpc_acks(u8 reason); diff --git a/net/rxrpc/call_accept.c b/net/rxrpc/call_accept.c index 3e474508ba75..a8d39d7cf42c 100644 --- a/net/rxrpc/call_accept.c +++ b/net/rxrpc/call_accept.c @@ -367,6 +367,9 @@ found_service: goto out; } + trace_rxrpc_receive(call, rxrpc_receive_incoming, + sp->hdr.serial, sp->hdr.seq); + /* Make the call live. */ rxrpc_incoming_call(rx, call, skb); conn = call->conn; diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c index 7b18ca124978..b690220533c6 100644 --- a/net/rxrpc/input.c +++ b/net/rxrpc/input.c @@ -284,8 +284,12 @@ next_subpacket: call->rxtx_buffer[ix] = skb; if (after(seq, call->rx_top)) smp_store_release(>rx_top, seq); - if (flags & RXRPC_LAST_PACKET) + if (flags & RXRPC_LAST_PACKET) { set_bit(RXRPC_CALL_RX_LAST, >flags); + trace_rxrpc_receive(call, rxrpc_receive_queue_last, serial, seq); + } else { + trace_rxrpc_receive(call, rxrpc_receive_queue, serial, seq); + } queued = true; if (after_eq(seq, call->rx_expect_next)) { diff --git a/net/rxrpc/misc.c b/net/rxrpc/misc.c index dca89995f03e..db5f1d54fc90 100644 --- a/net/rxrpc/misc.c +++ b/net/rxrpc/misc.c @@ -141,3 +141,12 @@ const char rxrpc_transmit_traces[rxrpc_transmit__nr_trace][4] = { [rxrpc_transmit_rotate] = "ROT", [rxrpc_transmit_end]= "END", }; + +const char rxrpc_receive_traces[rxrpc_receive__nr_trace][4] = { + [rxrpc_receive_incoming]= "INC", + [rxrpc_receive_queue] = "QUE", + [rxrpc_receive_queue_last] = "QLS", + [rxrpc_receive_front] = "FRN", + [rxrpc_receive_rotate] = "ROT", + [rxrpc_receive_end] = "END", +}; diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c index 8b8d7e14f800..22d51087c580 100644 --- a/net/rxrpc/recvmsg.c +++ b/net/rxrpc/recvmsg.c @@ -134,6 +134,7 @@ static void rxrpc_end_rx_phase(struct rxrpc_call
[PATCH net-next 05/14] rxrpc: Record calls that need to be accepted
Record calls that need to be accepted using sk_acceptq_added() otherwise the backlog counter goes negative because sk_acceptq_removed() is called. This causes the preallocator to malfunction. Calls that are preaccepted by AFS within the kernel aren't affected by this. Signed-off-by: David Howells--- net/rxrpc/call_accept.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/net/rxrpc/call_accept.c b/net/rxrpc/call_accept.c index 26c293ef98eb..323b8da50163 100644 --- a/net/rxrpc/call_accept.c +++ b/net/rxrpc/call_accept.c @@ -369,6 +369,8 @@ found_service: if (rx->notify_new_call) rx->notify_new_call(>sk, call, call->user_call_ID); + else + sk_acceptq_added(>sk); spin_lock(>state_lock); switch (conn->state) {
[PATCH net-next 02/14] rxrpc: Move the check of rx_pkt_offset from rxrpc_locate_data() to caller
Move the check of rx_pkt_offset from rxrpc_locate_data() to the caller, rxrpc_recvmsg_data(), so that it's more clear what's going on there. Signed-off-by: David Howells--- net/rxrpc/recvmsg.c |9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c index a284205b8ecf..0d085f5cf1bf 100644 --- a/net/rxrpc/recvmsg.c +++ b/net/rxrpc/recvmsg.c @@ -240,9 +240,6 @@ static int rxrpc_locate_data(struct rxrpc_call *call, struct sk_buff *skb, int ret; u8 annotation = *_annotation; - if (offset > 0) - return 0; - /* Locate the subpacket */ offset = sp->offset; len = skb->len - sp->offset; @@ -303,8 +300,10 @@ static int rxrpc_recvmsg_data(struct socket *sock, struct rxrpc_call *call, if (msg) sock_recv_timestamp(msg, sock->sk, skb); - ret = rxrpc_locate_data(call, skb, >rxtx_annotations[ix], - _pkt_offset, _pkt_len); + if (rx_pkt_offset == 0) + ret = rxrpc_locate_data(call, skb, + >rxtx_annotations[ix], + _pkt_offset, _pkt_len); _debug("recvmsg %x DATA #%u { %d, %d }", sp->hdr.callNumber, seq, rx_pkt_offset, rx_pkt_len);
[PATCH net-next 01/11] rxrpc: Print the packet type name in the Rx packet trace
Print a symbolic packet type name for each valid received packet in the trace output, not just a number. Signed-off-by: David Howells--- include/trace/events/rxrpc.h |5 +++-- net/rxrpc/ar-internal.h |6 +++--- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h index ea3b10ed91a8..0a30c673509c 100644 --- a/include/trace/events/rxrpc.h +++ b/include/trace/events/rxrpc.h @@ -93,11 +93,12 @@ TRACE_EVENT(rxrpc_rx_packet, memcpy(&__entry->hdr, >hdr, sizeof(__entry->hdr)); ), - TP_printk("%08x:%08x:%08x:%04x %08x %08x %02x %02x", + TP_printk("%08x:%08x:%08x:%04x %08x %08x %02x %02x %s", __entry->hdr.epoch, __entry->hdr.cid, __entry->hdr.callNumber, __entry->hdr.serviceId, __entry->hdr.serial, __entry->hdr.seq, - __entry->hdr.type, __entry->hdr.flags) + __entry->hdr.type, __entry->hdr.flags, + __entry->hdr.type <= 15 ? rxrpc_pkts[__entry->hdr.type] : "?UNK") ); TRACE_EVENT(rxrpc_rx_done, diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h index e78c40b37db5..0f6fafa2c271 100644 --- a/net/rxrpc/ar-internal.h +++ b/net/rxrpc/ar-internal.h @@ -551,6 +551,9 @@ enum rxrpc_call_trace { extern const char rxrpc_call_traces[rxrpc_call__nr_trace][4]; +extern const char *const rxrpc_pkts[]; +extern const char *rxrpc_acks(u8 reason); + #include /* @@ -851,11 +854,8 @@ extern unsigned int rxrpc_rx_mtu; extern unsigned int rxrpc_rx_jumbo_max; extern unsigned int rxrpc_resend_timeout; -extern const char *const rxrpc_pkts[]; extern const s8 rxrpc_ack_priority[]; -extern const char *rxrpc_acks(u8 reason); - /* * output.c */
[PATCH net-next 11/14] rxrpc: Fix retransmission algorithm
Make the retransmission algorithm use for-loops instead of do-loops and move the counter increments into the for-statement increment slots. Though the do-loops are slighly more efficient since there will be at least one pass through the each loop, the counter increments are harder to get right as the continue-statements skip them. Without this, if there are any positive acks within the loop, the do-loop will cycle forever because the counter increment is never done. Signed-off-by: David Howells--- net/rxrpc/call_event.c | 12 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/net/rxrpc/call_event.c b/net/rxrpc/call_event.c index 9367c3be31eb..f0cabc48a1b7 100644 --- a/net/rxrpc/call_event.c +++ b/net/rxrpc/call_event.c @@ -163,8 +163,7 @@ static void rxrpc_resend(struct rxrpc_call *call) */ now = jiffies; resend_at = now + rxrpc_resend_timeout; - seq = cursor + 1; - do { + for (seq = cursor + 1; before_eq(seq, top); seq++) { ix = seq & RXRPC_RXTX_BUFF_MASK; annotation = call->rxtx_annotations[ix]; if (annotation == RXRPC_TX_ANNO_ACK) @@ -184,8 +183,7 @@ static void rxrpc_resend(struct rxrpc_call *call) /* Okay, we need to retransmit a packet. */ call->rxtx_annotations[ix] = RXRPC_TX_ANNO_RETRANS; - seq++; - } while (before_eq(seq, top)); + } call->resend_at = resend_at; @@ -194,8 +192,7 @@ static void rxrpc_resend(struct rxrpc_call *call) * lock is dropped, it may clear some of the retransmission markers for * packets that it soft-ACKs. */ - seq = cursor + 1; - do { + for (seq = cursor + 1; before_eq(seq, top); seq++) { ix = seq & RXRPC_RXTX_BUFF_MASK; annotation = call->rxtx_annotations[ix]; if (annotation != RXRPC_TX_ANNO_RETRANS) @@ -237,8 +234,7 @@ static void rxrpc_resend(struct rxrpc_call *call) if (after(call->tx_hard_ack, seq)) seq = call->tx_hard_ack; - seq++; - } while (before_eq(seq, top)); + } out_unlock: spin_unlock_bh(>lock);
[PATCH net-next 08/14] rxrpc: Call rxrpc_release_call() on error in rxrpc_new_client_call()
Call rxrpc_release_call() on getting an error in rxrpc_new_client_call() rather than trying to do the cleanup ourselves. This isn't a problem, provided we set RXRPC_CALL_HAS_USERID only if we actually add the call to the calls tree as cleanup code fragments that would otherwise cause problems are conditional. Without this, we miss some of the cleanup. Signed-off-by: David Howells--- net/rxrpc/call_object.c | 36 1 file changed, 12 insertions(+), 24 deletions(-) diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c index b0ffbd9664e6..23f5a5f58282 100644 --- a/net/rxrpc/call_object.c +++ b/net/rxrpc/call_object.c @@ -226,9 +226,6 @@ struct rxrpc_call *rxrpc_new_client_call(struct rxrpc_sock *rx, (const void *)user_call_ID); /* Publish the call, even though it is incompletely set up as yet */ - call->user_call_ID = user_call_ID; - __set_bit(RXRPC_CALL_HAS_USERID, >flags); - write_lock(>call_lock); pp = >calls.rb_node; @@ -242,10 +239,12 @@ struct rxrpc_call *rxrpc_new_client_call(struct rxrpc_sock *rx, else if (user_call_ID > xcall->user_call_ID) pp = &(*pp)->rb_right; else - goto found_user_ID_now_present; + goto error_dup_user_ID; } rcu_assign_pointer(call->socket, rx); + call->user_call_ID = user_call_ID; + __set_bit(RXRPC_CALL_HAS_USERID, >flags); rxrpc_get_call(call, rxrpc_call_got_userid); rb_link_node(>sock_node, parent, pp); rb_insert_color(>sock_node, >calls); @@ -276,33 +275,22 @@ struct rxrpc_call *rxrpc_new_client_call(struct rxrpc_sock *rx, _leave(" = %p [new]", call); return call; -error: - write_lock(>call_lock); - rb_erase(>sock_node, >calls); - write_unlock(>call_lock); - rxrpc_put_call(call, rxrpc_call_put_userid); - - write_lock(_call_lock); - list_del_init(>link); - write_unlock(_call_lock); - -error_out: - __rxrpc_set_call_completion(call, RXRPC_CALL_LOCAL_ERROR, - RX_CALL_DEAD, ret); - set_bit(RXRPC_CALL_RELEASED, >flags); - rxrpc_put_call(call, rxrpc_call_put); - _leave(" = %d", ret); - return ERR_PTR(ret); - /* We unexpectedly found the user ID in the list after taking * the call_lock. This shouldn't happen unless the user races * with itself and tries to add the same user ID twice at the * same time in different threads. */ -found_user_ID_now_present: +error_dup_user_ID: write_unlock(>call_lock); ret = -EEXIST; - goto error_out; + +error: + __rxrpc_set_call_completion(call, RXRPC_CALL_LOCAL_ERROR, + RX_CALL_DEAD, ret); + rxrpc_release_call(rx, call); + rxrpc_put_call(call, rxrpc_call_put); + _leave(" = %d", ret); + return ERR_PTR(ret); } /*
[PATCH net-next 06/14] rxrpc: Purge the to_be_accepted queue on socket release
Purge the queue of to_be_accepted calls on socket release. Note that purging sock_calls doesn't release the ref owned by to_be_accepted. Probably the sock_calls list is redundant given a purges of the recvmsg_q, the to_be_accepted queue and the calls tree. Signed-off-by: David Howells--- net/rxrpc/call_object.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c index 22f9b0d1a138..b0ffbd9664e6 100644 --- a/net/rxrpc/call_object.c +++ b/net/rxrpc/call_object.c @@ -476,6 +476,16 @@ void rxrpc_release_calls_on_socket(struct rxrpc_sock *rx) _enter("%p", rx); + while (!list_empty(>to_be_accepted)) { + call = list_entry(rx->to_be_accepted.next, + struct rxrpc_call, accept_link); + list_del(>accept_link); + rxrpc_abort_call("SKR", call, 0, RX_CALL_DEAD, ECONNRESET); + rxrpc_send_call_packet(call, RXRPC_PACKET_TYPE_ABORT); + rxrpc_release_call(rx, call); + rxrpc_put_call(call, rxrpc_call_put); + } + while (!list_empty(>sock_calls)) { call = list_entry(rx->sock_calls.next, struct rxrpc_call, sock_link);
[PATCH net-next 04/14] rxrpc: Fix handling of the last packet in rxrpc_recvmsg_data()
The code for determining the last packet in rxrpc_recvmsg_data() has been using the RXRPC_CALL_RX_LAST flag to determine if the rx_top pointer points to the last packet or not. This isn't a good idea, however, as the input code may be running simultaneously on another CPU and that sets the flag *before* updating the top pointer. Fix this by the following means: (1) Restrict the use of RXRPC_CALL_RX_LAST to the input routines only. There's otherwise a synchronisation problem between detecting the flag and checking tx_top. This could probably be dealt with by appropriate application of memory barriers, but there's a simpler way. (2) Set RXRPC_CALL_RX_LAST after setting rx_top. (3) Make rxrpc_rotate_rx_window() consult the flags header field of the DATA packet it's about to discard to see if that was the last packet. Use this as the basis for ending the Rx phase. This shouldn't be a problem because the recvmsg side of things is guaranteed to see the packets in order. (4) Make rxrpc_recvmsg_data() return 1 to indicate the end of the data if: (a) the packet it has just processed is marked as RXRPC_LAST_PACKET (b) the call's Rx phase has been ended. Signed-off-by: David Howells--- net/rxrpc/input.c |4 +++- net/rxrpc/recvmsg.c | 49 + 2 files changed, 36 insertions(+), 17 deletions(-) diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c index 75af0bd316c7..f0d9115b9b7e 100644 --- a/net/rxrpc/input.c +++ b/net/rxrpc/input.c @@ -238,7 +238,7 @@ next_subpacket: len = RXRPC_JUMBO_DATALEN; if (flags & RXRPC_LAST_PACKET) { - if (test_and_set_bit(RXRPC_CALL_RX_LAST, >flags) && + if (test_bit(RXRPC_CALL_RX_LAST, >flags) && seq != call->rx_top) return rxrpc_proto_abort("LSN", call, seq); } else { @@ -282,6 +282,8 @@ next_subpacket: call->rxtx_buffer[ix] = skb; if (after(seq, call->rx_top)) smp_store_release(>rx_top, seq); + if (flags & RXRPC_LAST_PACKET) + set_bit(RXRPC_CALL_RX_LAST, >flags); queued = true; if (after_eq(seq, call->rx_expect_next)) { diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c index 1edf2cf62cc5..8b8d7e14f800 100644 --- a/net/rxrpc/recvmsg.c +++ b/net/rxrpc/recvmsg.c @@ -134,6 +134,8 @@ static void rxrpc_end_rx_phase(struct rxrpc_call *call) { _enter("%d,%s", call->debug_id, rxrpc_call_states[call->state]); + ASSERTCMP(call->rx_hard_ack, ==, call->rx_top); + if (call->state == RXRPC_CALL_CLIENT_RECV_REPLY) { rxrpc_propose_ACK(call, RXRPC_ACK_IDLE, 0, 0, true, false); rxrpc_send_call_packet(call, RXRPC_PACKET_TYPE_ACK); @@ -163,8 +165,10 @@ static void rxrpc_end_rx_phase(struct rxrpc_call *call) */ static void rxrpc_rotate_rx_window(struct rxrpc_call *call) { + struct rxrpc_skb_priv *sp; struct sk_buff *skb; rxrpc_seq_t hard_ack, top; + u8 flags; int ix; _enter("%d", call->debug_id); @@ -177,6 +181,8 @@ static void rxrpc_rotate_rx_window(struct rxrpc_call *call) ix = hard_ack & RXRPC_RXTX_BUFF_MASK; skb = call->rxtx_buffer[ix]; rxrpc_see_skb(skb); + sp = rxrpc_skb(skb); + flags = sp->hdr.flags; call->rxtx_buffer[ix] = NULL; call->rxtx_annotations[ix] = 0; /* Barrier against rxrpc_input_data(). */ @@ -184,8 +190,8 @@ static void rxrpc_rotate_rx_window(struct rxrpc_call *call) rxrpc_free_skb(skb); - _debug("%u,%u,%lx", hard_ack, top, call->flags); - if (hard_ack == top && test_bit(RXRPC_CALL_RX_LAST, >flags)) + _debug("%u,%u,%02x", hard_ack, top, flags); + if (flags & RXRPC_LAST_PACKET) rxrpc_end_rx_phase(call); } @@ -278,13 +284,19 @@ static int rxrpc_recvmsg_data(struct socket *sock, struct rxrpc_call *call, size_t remain; bool last; unsigned int rx_pkt_offset, rx_pkt_len; - int ix, copy, ret = 0; + int ix, copy, ret = -EAGAIN, ret2; _enter(""); rx_pkt_offset = call->rx_pkt_offset; rx_pkt_len = call->rx_pkt_len; + if (call->state >= RXRPC_CALL_SERVER_ACK_REQUEST) { + seq = call->rx_hard_ack; + ret = 1; + goto done; + } + /* Barriers against rxrpc_input_data(). */ hard_ack = call->rx_hard_ack; top = smp_load_acquire(>rx_top); @@ -301,11 +313,13 @@ static int rxrpc_recvmsg_data(struct socket *sock, struct rxrpc_call *call, sock_recv_timestamp(msg, sock->sk, skb); if (rx_pkt_offset == 0) { - ret = rxrpc_locate_data(call, skb, - >rxtx_annotations[ix], - _pkt_offset,
[PATCH net-next 12/14] rxrpc: Don't transmit an ACK if there's no reason set
Don't transmit an ACK if call->ackr_reason in unset. There's the possibility of a race between recvmsg() sending an ACK and the background processing thread trying to send the same one. Signed-off-by: David Howells--- net/rxrpc/output.c |5 + 1 file changed, 5 insertions(+) diff --git a/net/rxrpc/output.c b/net/rxrpc/output.c index 06a9aca739d1..aa0507214b31 100644 --- a/net/rxrpc/output.c +++ b/net/rxrpc/output.c @@ -137,6 +137,11 @@ int rxrpc_send_call_packet(struct rxrpc_call *call, u8 type) switch (type) { case RXRPC_PACKET_TYPE_ACK: spin_lock_bh(>lock); + if (!call->ackr_reason) { + spin_unlock_bh(>lock); + ret = 0; + goto out; + } n = rxrpc_fill_out_ack(call, pkt); call->ackr_reason = 0;
Re: [PATCH net] MAINTAINERS: Gary Zambrano's email is bouncing
2016-09-17 15:51 GMT-07:00 Joe Perches: > On Sat, 2016-09-17 at 15:27 -0700, Florian Fainelli wrote: >> Gary has not been with Broadcom for some time now, replace his address >> with the internal mailing-list used for other entries. >> >> > Signed-off-by: Florian Fainelli >> --- >> Michael, >> >> Since this is an old driver, not sure who could step up as a maintainer >> for b44? > [] >> diff --git a/MAINTAINERS b/MAINTAINERS > [] >> @@ -2500,8 +2500,8 @@ S: Supported > >> F:kernel/bpf/ >> >> BROADCOM B44 10/100 ETHERNET DRIVER >> -M: Gary Zambrano >> L: netdev@vger.kernel.org >> +M: bcm-kernel-feedback-l...@broadcom.com >> S: Supported >> F: drivers/net/ethernet/broadcom/b44.* > > Without an actual maintainer, this should really be > orphan and not supported. I would like to hear from Michael before concluding that > > And the M: bcm-kernel-feedback-list@ should be L: The list does not accept public subscribers, so this is the correct entry to use. > > BCM4401 NICs are essentially from 2002. > > Does anyone really use these any longer with a > current distribution or kernel version? This NIC is also embedded inside BCM47xx/BCM53xx which is still getting active support from Rafal and Hauke. -- Florian
[PATCH net-next 14/14] rxrpc: Fix the basic transmit DATA packet content size at 1412 bytes
Fix the basic transmit DATA packet content size at 1412 bytes so that they can be arbitrarily assembled into jumbo packets. In the future, I'm thinking of moving to keeping a jumbo packet header at the beginning of each packet in the Tx queue and creating the packet header on the spot when kernel_sendmsg() is invoked. That way, jumbo packets can be assembled on the spur of the moment for (re-)transmission. Signed-off-by: David Howells--- net/rxrpc/sendmsg.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/rxrpc/sendmsg.c b/net/rxrpc/sendmsg.c index cba236575073..8bfddf4e338c 100644 --- a/net/rxrpc/sendmsg.c +++ b/net/rxrpc/sendmsg.c @@ -214,7 +214,7 @@ static int rxrpc_send_data(struct rxrpc_sock *rx, goto maybe_error; } - max = call->conn->params.peer->maxdata; + max = RXRPC_JUMBO_DATALEN; max -= call->conn->security_size; max &= ~(call->conn->size_align - 1UL);
[PATCH net-next 09/14] rxrpc: Fix unexposed client conn release
If the last call on a client connection is release after the connection has had a bunch of calls allocated but before any DATA packets are sent (so that it's not yet marked RXRPC_CONN_EXPOSED), an assertion will happen in rxrpc_disconnect_client_call(). af_rxrpc: Assertion failed - 1(0x1) >= 2(0x2) is false [ cut here ] kernel BUG at ../net/rxrpc/conn_client.c:753! This is because it's expecting the conn to have been exposed and to have 2 or more refs - but this isn't necessarily the case. Simply remove the assertion. This allows the conn to be moved into the inactive state and deleted if it isn't resurrected before the final put is called. Signed-off-by: David Howells--- net/rxrpc/conn_client.c |1 - 1 file changed, 1 deletion(-) diff --git a/net/rxrpc/conn_client.c b/net/rxrpc/conn_client.c index 5a675c43cace..226bc910e556 100644 --- a/net/rxrpc/conn_client.c +++ b/net/rxrpc/conn_client.c @@ -721,7 +721,6 @@ void rxrpc_disconnect_client_call(struct rxrpc_call *call) } ASSERTCMP(rcu_access_pointer(chan->call), ==, call); - ASSERTCMP(atomic_read(>usage), >=, 2); /* If a client call was exposed to the world, we save the result for * retransmission.
[PATCH net-next 13/14] rxrpc: Be consistent about switch value in rxrpc_send_call_packet()
rxrpc_send_call_packet() should use type in both its switch-statements rather than using pkt->whdr.type. This might give the compiler an easier job of uninitialised variable checking. Signed-off-by: David Howells--- net/rxrpc/output.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/rxrpc/output.c b/net/rxrpc/output.c index aa0507214b31..0b21ed859de7 100644 --- a/net/rxrpc/output.c +++ b/net/rxrpc/output.c @@ -182,7 +182,7 @@ int rxrpc_send_call_packet(struct rxrpc_call *call, u8 type) , iov, ioc, len); if (ret < 0 && call->state < RXRPC_CALL_COMPLETE) { - switch (pkt->whdr.type) { + switch (type) { case RXRPC_PACKET_TYPE_ACK: rxrpc_propose_ACK(call, pkt->ack.reason, ntohs(pkt->ack.maxSkew),
[PATCH net-next 07/14] rxrpc: Fix the putting of client connections
In rxrpc_put_one_client_conn(), if a connection has RXRPC_CONN_COUNTED set on it, then it's accounted for in rxrpc_nr_client_conns and may be on various lists - and this is cleaned up correctly. However, if the connection doesn't have RXRPC_CONN_COUNTED set on it, then the put routine returns rather than just skipping the extra bit of cleanup. Fix this by making the extra bit of clean up conditional instead and always killing off the connection. This manifests itself as connections with a zero usage count hanging around in /proc/net/rxrpc_conns because the connection allocated, but discarded, due to a race with another process that set up a parallel connection, which was then shared instead. Signed-off-by: David Howells--- net/rxrpc/conn_client.c | 28 +--- 1 file changed, 13 insertions(+), 15 deletions(-) diff --git a/net/rxrpc/conn_client.c b/net/rxrpc/conn_client.c index 9344a8416ceb..5a675c43cace 100644 --- a/net/rxrpc/conn_client.c +++ b/net/rxrpc/conn_client.c @@ -818,7 +818,7 @@ idle_connection: static struct rxrpc_connection * rxrpc_put_one_client_conn(struct rxrpc_connection *conn) { - struct rxrpc_connection *next; + struct rxrpc_connection *next = NULL; struct rxrpc_local *local = conn->params.local; unsigned int nr_conns; @@ -834,24 +834,22 @@ rxrpc_put_one_client_conn(struct rxrpc_connection *conn) ASSERTCMP(conn->cache_state, ==, RXRPC_CONN_CLIENT_INACTIVE); - if (!test_bit(RXRPC_CONN_COUNTED, >flags)) - return NULL; - - spin_lock(_client_conn_cache_lock); - nr_conns = --rxrpc_nr_client_conns; + if (test_bit(RXRPC_CONN_COUNTED, >flags)) { + spin_lock(_client_conn_cache_lock); + nr_conns = --rxrpc_nr_client_conns; + + if (nr_conns < rxrpc_max_client_connections && + !list_empty(_waiting_client_conns)) { + next = list_entry(rxrpc_waiting_client_conns.next, + struct rxrpc_connection, cache_link); + rxrpc_get_connection(next); + rxrpc_activate_conn(next); + } - next = NULL; - if (nr_conns < rxrpc_max_client_connections && - !list_empty(_waiting_client_conns)) { - next = list_entry(rxrpc_waiting_client_conns.next, - struct rxrpc_connection, cache_link); - rxrpc_get_connection(next); - rxrpc_activate_conn(next); + spin_unlock(_client_conn_cache_lock); } - spin_unlock(_client_conn_cache_lock); rxrpc_kill_connection(conn); - if (next) rxrpc_activate_channels(next);
[PATCH net-next 00/14] rxrpc: Fixes & miscellany
Here are some more AF_RXRPC fix patches with a couple of miscellaneous changes also. Fixes include: (1) Make RxRPC IPv6 support conditional on IPv6 being available. (2) Move the condition check in rxrpc_locate_data() into the caller and check the error return. (3) Fix the detection of the last received packet in recvmsg. (4) Account calls that need acceptance and clean up any unaccepted ones if the socket gets closed. (5) Fix the cleanup of client connections. (6) Fix the soft-ACK parsing and the retransmission of packets based on those ACKs. (7) Suppress transmission of an ACK when there's no pending ACK to transmit because another thread stole it. And some miscellany: (8) Whitespace removal. (9) Switch-value consistency in rxrpc_send_call_packet(). (10) Fix the basic transmission packet size to allow for spur-of-the-moment jumbo DATA packet production. The patches can be found here also (non-terminally on the branch): http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite Tagged thusly: git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git rxrpc-rewrite-20160917-1 David --- David Howells (14): rxrpc: Remove some whitespace. rxrpc: Move the check of rx_pkt_offset from rxrpc_locate_data() to caller rxrpc: Check the return value of rxrpc_locate_data() rxrpc: Fix handling of the last packet in rxrpc_recvmsg_data() rxrpc: Record calls that need to be accepted rxrpc: Purge the to_be_accepted queue on socket release rxrpc: Fix the putting of client connections rxrpc: Call rxrpc_release_call() on error in rxrpc_new_client_call() rxrpc: Fix unexposed client conn release rxrpc: Fix the parsing of soft-ACKs rxrpc: Fix retransmission algorithm rxrpc: Don't transmit an ACK if there's no reason set rxrpc: Be consistent about switch value in rxrpc_send_call_packet() rxrpc: Fix the basic transmit DATA packet content size at 1412 bytes net/rxrpc/call_accept.c |2 ++ net/rxrpc/call_event.c | 14 net/rxrpc/call_object.c | 46 - net/rxrpc/conn_client.c | 29 -- net/rxrpc/input.c |6 - net/rxrpc/output.c |7 +- net/rxrpc/recvmsg.c | 53 --- net/rxrpc/sendmsg.c |2 +- 8 files changed, 89 insertions(+), 70 deletions(-)
[PATCH net-next 01/14] rxrpc: Remove some whitespace.
Remove a tab that's on a line that should otherwise be blank. Signed-off-by: David Howells--- net/rxrpc/call_event.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/rxrpc/call_event.c b/net/rxrpc/call_event.c index 61432049869b..9367c3be31eb 100644 --- a/net/rxrpc/call_event.c +++ b/net/rxrpc/call_event.c @@ -31,7 +31,7 @@ static void rxrpc_set_timer(struct rxrpc_call *call) _enter("{%ld,%ld,%ld:%ld}", call->ack_at - now, call->resend_at - now, call->expire_at - now, call->timer.expires - now); - + read_lock_bh(>state_lock); if (call->state < RXRPC_CALL_COMPLETE) {
[PATCH net-next 03/14] rxrpc: Check the return value of rxrpc_locate_data()
Check the return value of rxrpc_locate_data() in rxrpc_recvmsg_data(). Signed-off-by: David Howells--- net/rxrpc/recvmsg.c |5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c index 0d085f5cf1bf..1edf2cf62cc5 100644 --- a/net/rxrpc/recvmsg.c +++ b/net/rxrpc/recvmsg.c @@ -300,10 +300,13 @@ static int rxrpc_recvmsg_data(struct socket *sock, struct rxrpc_call *call, if (msg) sock_recv_timestamp(msg, sock->sk, skb); - if (rx_pkt_offset == 0) + if (rx_pkt_offset == 0) { ret = rxrpc_locate_data(call, skb, >rxtx_annotations[ix], _pkt_offset, _pkt_len); + if (ret < 0) + goto out; + } _debug("recvmsg %x DATA #%u { %d, %d }", sp->hdr.callNumber, seq, rx_pkt_offset, rx_pkt_len);
[PATCH net-next 1/5] pie: use qdisc_dequeue_head wrapper
Doesn't change generated code. Signed-off-by: Florian Westphal--- net/sched/sch_pie.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c index a570b0b..d976d74 100644 --- a/net/sched/sch_pie.c +++ b/net/sched/sch_pie.c @@ -511,7 +511,7 @@ static int pie_dump_stats(struct Qdisc *sch, struct gnet_dump *d) static struct sk_buff *pie_qdisc_dequeue(struct Qdisc *sch) { struct sk_buff *skb; - skb = __qdisc_dequeue_head(sch, >q); + skb = qdisc_dequeue_head(sch); if (!skb) return NULL; -- 2.7.3
[PATCH net-next 0/5] sched: convert queues to single-linked list
During Netfilter Workshop 2016 Eric Dumazet pointed out that qdisc schedulers use doubly-linked lists, even though single-linked list would be enough. The double-linked skb lists incur one extra write on enqueue/dequeue operations (to change ->prev pointer of next list elem). This series converts qdiscs to single-linked version, listhead maintains pointers to first (for dequeue) and last skb (for enqueue). Most qdiscs don't queue at all and instead use a leaf qdisc (typically pfifo_fast) so only a few schedulers needed changes. I briefly tested netem and htb and they seemed fine. UDP_STREAM netperf with 64 byte packets via veth+pfifo_fast shows a small (~2%) improvement. Florian Westphal (5): pie: use qdisc_dequeue_head wrapper sched: don't use skb queue helpers sched: remove qdisc arg from __qdisc_dequeue_head sched: replace __skb_dequeue with __qdisc_dequeue_head sched: add and use qdisc_skb_head helpers include/net/sch_generic.h | 72 +++--- net/sched/sch_codel.c |4 +- net/sched/sch_fifo.c |4 +- net/sched/sch_generic.c | 30 +++ net/sched/sch_htb.c | 24 --- net/sched/sch_netem.c | 20 +--- net/sched/sch_pie.c |4 +- 7 files changed, 115 insertions(+), 43 deletions(-)
[PATCH net-next 4/5] sched: replace __skb_dequeue with __qdisc_dequeue_head
After previous patch these functions are identical. Replace __skb_dequeue in qdiscs with __qdisc_dequeue_head. Next patch will then make __qdisc_dequeue_head handle single-linked list instead of strcut sk_buff_head argument. Doesn't change generated code. Signed-off-by: Florian Westphal--- net/sched/sch_codel.c | 4 ++-- net/sched/sch_netem.c | 2 +- net/sched/sch_pie.c | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/net/sched/sch_codel.c b/net/sched/sch_codel.c index 4002df3..5bfa79e 100644 --- a/net/sched/sch_codel.c +++ b/net/sched/sch_codel.c @@ -69,7 +69,7 @@ struct codel_sched_data { static struct sk_buff *dequeue_func(struct codel_vars *vars, void *ctx) { struct Qdisc *sch = ctx; - struct sk_buff *skb = __skb_dequeue(>q); + struct sk_buff *skb = __qdisc_dequeue_head(>q); if (skb) sch->qstats.backlog -= qdisc_pkt_len(skb); @@ -172,7 +172,7 @@ static int codel_change(struct Qdisc *sch, struct nlattr *opt) qlen = sch->q.qlen; while (sch->q.qlen > sch->limit) { - struct sk_buff *skb = __skb_dequeue(>q); + struct sk_buff *skb = __qdisc_dequeue_head(>q); dropped += qdisc_pkt_len(skb); qdisc_qstats_backlog_dec(sch, skb); diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c index 1832d77..0a964b3 100644 --- a/net/sched/sch_netem.c +++ b/net/sched/sch_netem.c @@ -587,7 +587,7 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch) struct rb_node *p; tfifo_dequeue: - skb = __skb_dequeue(>q); + skb = __qdisc_dequeue_head(>q); if (skb) { qdisc_qstats_backlog_dec(sch, skb); deliver: diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c index d976d74..5c3a99d 100644 --- a/net/sched/sch_pie.c +++ b/net/sched/sch_pie.c @@ -231,7 +231,7 @@ static int pie_change(struct Qdisc *sch, struct nlattr *opt) /* Drop excess packets if new limit is lower */ qlen = sch->q.qlen; while (sch->q.qlen > sch->limit) { - struct sk_buff *skb = __skb_dequeue(>q); + struct sk_buff *skb = __qdisc_dequeue_head(>q); dropped += qdisc_pkt_len(skb); qdisc_qstats_backlog_dec(sch, skb); -- 2.7.3
[PATCH net-next 5/5] sched: add and use qdisc_skb_head helpers
This change replaces sk_buff_head struct in Qdiscs with new qdisc_skb_head. Its similar to the skb_buff_head api, but does not use skb->prev pointers. Qdiscs will commonly enqueue at the tail of a list and dequeue at head. While skb_buff_head works fine for this, enqueue/dequeue needs to also adjust the prev pointer of next element. The ->prev pointer is not required for qdiscs so we can just leave it undefined and avoid one cacheline write access for en/dequeue. Suggested-by: Eric DumazetSigned-off-by: Florian Westphal --- include/net/sch_generic.h | 63 ++- net/sched/sch_generic.c | 21 net/sched/sch_htb.c | 24 +++--- net/sched/sch_netem.c | 14 +-- 4 files changed, 94 insertions(+), 28 deletions(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 0741ed4..e6aa0a2 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -36,6 +36,14 @@ struct qdisc_size_table { u16 data[]; }; +/* similar to sk_buff_head, but skb->prev pointer is undefined. */ +struct qdisc_skb_head { + struct sk_buff *head; + struct sk_buff *tail; + __u32 qlen; + spinlock_t lock; +}; + struct Qdisc { int (*enqueue)(struct sk_buff *skb, struct Qdisc *sch, @@ -76,7 +84,7 @@ struct Qdisc { * For performance sake on SMP, we put highly modified fields at the end */ struct sk_buff *gso_skb cacheline_aligned_in_smp; - struct sk_buff_head q; + struct qdisc_skb_head q; struct gnet_stats_basic_packed bstats; seqcount_t running; struct gnet_stats_queue qstats; @@ -600,10 +608,27 @@ static inline void qdisc_qstats_overlimit(struct Qdisc *sch) sch->qstats.overlimits++; } +static inline void qdisc_skb_head_init(struct qdisc_skb_head *qh) +{ + qh->head = NULL; + qh->tail = NULL; + qh->qlen = 0; +} + static inline int __qdisc_enqueue_tail(struct sk_buff *skb, struct Qdisc *sch, - struct sk_buff_head *list) + struct qdisc_skb_head *qh) { - __skb_queue_tail(list, skb); + struct sk_buff *last = qh->tail; + + if (last) { + skb->next = NULL; + last->next = skb; + qh->tail = skb; + } else { + qh->tail = skb; + qh->head = skb; + } + qh->qlen++; qdisc_qstats_backlog_inc(sch, skb); return NET_XMIT_SUCCESS; @@ -614,9 +639,17 @@ static inline int qdisc_enqueue_tail(struct sk_buff *skb, struct Qdisc *sch) return __qdisc_enqueue_tail(skb, sch, >q); } -static inline struct sk_buff *__qdisc_dequeue_head(struct sk_buff_head *list) +static inline struct sk_buff *__qdisc_dequeue_head(struct qdisc_skb_head *qh) { - struct sk_buff *skb = __skb_dequeue(list); + struct sk_buff *skb = qh->head; + + if (likely(skb != NULL)) { + qh->head = skb->next; + qh->qlen--; + if (qh->head == NULL) + qh->tail = NULL; + skb->next = NULL; + } return skb; } @@ -643,10 +676,10 @@ static inline void __qdisc_drop(struct sk_buff *skb, struct sk_buff **to_free) } static inline unsigned int __qdisc_queue_drop_head(struct Qdisc *sch, - struct sk_buff_head *list, + struct qdisc_skb_head *qh, struct sk_buff **to_free) { - struct sk_buff *skb = __skb_dequeue(list); + struct sk_buff *skb = __qdisc_dequeue_head(qh); if (likely(skb != NULL)) { unsigned int len = qdisc_pkt_len(skb); @@ -667,7 +700,9 @@ static inline unsigned int qdisc_queue_drop_head(struct Qdisc *sch, static inline struct sk_buff *qdisc_peek_head(struct Qdisc *sch) { - return skb_peek(>q); + const struct qdisc_skb_head *qh = >q; + + return qh->head; } /* generic pseudo peek method for non-work-conserving qdisc */ @@ -702,15 +737,19 @@ static inline struct sk_buff *qdisc_dequeue_peeked(struct Qdisc *sch) return skb; } -static inline void __qdisc_reset_queue(struct sk_buff_head *list) +static inline void __qdisc_reset_queue(struct qdisc_skb_head *qh) { /* * We do not know the backlog in bytes of this list, it * is up to the caller to correct it */ - if (!skb_queue_empty(list)) { - rtnl_kfree_skbs(list->next, list->prev); - __skb_queue_head_init(list); + ASSERT_RTNL(); + if (qh->qlen) { + rtnl_kfree_skbs(qh->head, qh->tail); + + qh->head =
[PATCH net-next 3/5] sched: remove qdisc arg from __qdisc_dequeue_head
Moves qdisc stat accouting to qdisc_dequeue_head. The only direct caller of the __qdisc_dequeue_head version open-codes this now. This allows us to later use __qdisc_dequeue_head as a replacement of __skb_dequeue() (which operates on sk_buff_head list). Signed-off-by: Florian Westphal--- include/net/sch_generic.h | 15 --- net/sched/sch_generic.c | 7 ++- 2 files changed, 14 insertions(+), 8 deletions(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 52a2015..0741ed4 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -614,11 +614,17 @@ static inline int qdisc_enqueue_tail(struct sk_buff *skb, struct Qdisc *sch) return __qdisc_enqueue_tail(skb, sch, >q); } -static inline struct sk_buff *__qdisc_dequeue_head(struct Qdisc *sch, - struct sk_buff_head *list) +static inline struct sk_buff *__qdisc_dequeue_head(struct sk_buff_head *list) { struct sk_buff *skb = __skb_dequeue(list); + return skb; +} + +static inline struct sk_buff *qdisc_dequeue_head(struct Qdisc *sch) +{ + struct sk_buff *skb = __qdisc_dequeue_head(>q); + if (likely(skb != NULL)) { qdisc_qstats_backlog_dec(sch, skb); qdisc_bstats_update(sch, skb); @@ -627,11 +633,6 @@ static inline struct sk_buff *__qdisc_dequeue_head(struct Qdisc *sch, return skb; } -static inline struct sk_buff *qdisc_dequeue_head(struct Qdisc *sch) -{ - return __qdisc_dequeue_head(sch, >q); -} - /* Instead of calling kfree_skb() while root qdisc lock is held, * queue the skb for future freeing at end of __dev_xmit_skb() */ diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 5e63bf6..73877d9 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -506,7 +506,12 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc) if (likely(band >= 0)) { struct sk_buff_head *list = band2list(priv, band); - struct sk_buff *skb = __qdisc_dequeue_head(qdisc, list); + struct sk_buff *skb = __qdisc_dequeue_head(list); + + if (likely(skb != NULL)) { + qdisc_qstats_backlog_dec(qdisc, skb); + qdisc_bstats_update(qdisc, skb); + } qdisc->q.qlen--; if (skb_queue_empty(list)) -- 2.7.3
[PATCH net-next 2/5] sched: don't use skb queue helpers
A followup change will replace the sk_buff_head in the qdisc struct with a slightly different list. Use of the sk_buff_head helpers will thus cause compiler warnings. Open-code these accesses in an extra change to ease review. Signed-off-by: Florian Westphal--- net/sched/sch_fifo.c| 4 ++-- net/sched/sch_generic.c | 2 +- net/sched/sch_netem.c | 4 ++-- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/net/sched/sch_fifo.c b/net/sched/sch_fifo.c index baeed6a..1e37247 100644 --- a/net/sched/sch_fifo.c +++ b/net/sched/sch_fifo.c @@ -31,7 +31,7 @@ static int bfifo_enqueue(struct sk_buff *skb, struct Qdisc *sch, static int pfifo_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free) { - if (likely(skb_queue_len(>q) < sch->limit)) + if (likely(sch->q.qlen < sch->limit)) return qdisc_enqueue_tail(skb, sch); return qdisc_drop(skb, sch, to_free); @@ -42,7 +42,7 @@ static int pfifo_tail_enqueue(struct sk_buff *skb, struct Qdisc *sch, { unsigned int prev_backlog; - if (likely(skb_queue_len(>q) < sch->limit)) + if (likely(sch->q.qlen < sch->limit)) return qdisc_enqueue_tail(skb, sch); prev_backlog = sch->qstats.backlog; diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 0d21b56..5e63bf6 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -486,7 +486,7 @@ static inline struct sk_buff_head *band2list(struct pfifo_fast_priv *priv, static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc, struct sk_buff **to_free) { - if (skb_queue_len(>q) < qdisc_dev(qdisc)->tx_queue_len) { + if (qdisc->q.qlen < qdisc_dev(qdisc)->tx_queue_len) { int band = prio2band[skb->priority & TC_PRIO_MAX]; struct pfifo_fast_priv *priv = qdisc_priv(qdisc); struct sk_buff_head *list = band2list(priv, band); diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c index aaaf021..1832d77 100644 --- a/net/sched/sch_netem.c +++ b/net/sched/sch_netem.c @@ -502,7 +502,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch, 1<<(prandom_u32() % 8); } - if (unlikely(skb_queue_len(>q) >= sch->limit)) + if (unlikely(sch->q.qlen >= sch->limit)) return qdisc_drop(skb, sch, to_free); qdisc_qstats_backlog_inc(sch, skb); @@ -522,7 +522,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch, if (q->rate) { struct sk_buff *last; - if (!skb_queue_empty(>q)) + if (sch->q.qlen) last = skb_peek_tail(>q); else last = netem_rb_to_skb(rb_last(>t_root)); -- 2.7.3
Re: [PATCH net] MAINTAINERS: Gary Zambrano's email is bouncing
On Sat, 2016-09-17 at 15:27 -0700, Florian Fainelli wrote: > Gary has not been with Broadcom for some time now, replace his address > with the internal mailing-list used for other entries. > > > Signed-off-by: Florian Fainelli> --- > Michael, > > Since this is an old driver, not sure who could step up as a maintainer > for b44? [] > diff --git a/MAINTAINERS b/MAINTAINERS [] > @@ -2500,8 +2500,8 @@ S: Supported > F:kernel/bpf/ > > BROADCOM B44 10/100 ETHERNET DRIVER > -M: Gary Zambrano > L: netdev@vger.kernel.org > +M: bcm-kernel-feedback-l...@broadcom.com > S: Supported > F: drivers/net/ethernet/broadcom/b44.* Without an actual maintainer, this should really be orphan and not supported. And the M: bcm-kernel-feedback-list@ should be L: BCM4401 NICs are essentially from 2002. Does anyone really use these any longer with a current distribution or kernel version?
[PATCH net] MAINTAINERS: Gary Zambrano's email is bouncing
Gary has not been with Broadcom for some time now, replace his address with the internal mailing-list used for other entries. Signed-off-by: Florian Fainelli--- Michael, Since this is an old driver, not sure who could step up as a maintainer for b44? MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index a5e1270dfbf1..dffc3bca17ee 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -2500,8 +2500,8 @@ S:Supported F: kernel/bpf/ BROADCOM B44 10/100 ETHERNET DRIVER -M: Gary Zambrano L: netdev@vger.kernel.org +M: bcm-kernel-feedback-l...@broadcom.com S: Supported F: drivers/net/ethernet/broadcom/b44.* -- 2.7.4
[PATCH 2/2] net: ethernet: broadcom: b44: use new api ethtool_{get|set}_link_ksettings
The ethtool api {get|set}_settings is deprecated. We move this driver to new api {get|set}_link_ksettings. Signed-off-by: Philippe Reynes--- drivers/net/ethernet/broadcom/b44.c | 98 +++ 1 files changed, 54 insertions(+), 44 deletions(-) diff --git a/drivers/net/ethernet/broadcom/b44.c b/drivers/net/ethernet/broadcom/b44.c index 936f06f..17aa33c 100644 --- a/drivers/net/ethernet/broadcom/b44.c +++ b/drivers/net/ethernet/broadcom/b44.c @@ -1832,58 +1832,65 @@ static int b44_nway_reset(struct net_device *dev) return r; } -static int b44_get_settings(struct net_device *dev, struct ethtool_cmd *cmd) +static int b44_get_link_ksettings(struct net_device *dev, + struct ethtool_link_ksettings *cmd) { struct b44 *bp = netdev_priv(dev); + u32 supported, advertising; if (bp->flags & B44_FLAG_EXTERNAL_PHY) { BUG_ON(!dev->phydev); - return phy_ethtool_gset(dev->phydev, cmd); + return phy_ethtool_ksettings_get(dev->phydev, cmd); } - cmd->supported = (SUPPORTED_Autoneg); - cmd->supported |= (SUPPORTED_100baseT_Half | - SUPPORTED_100baseT_Full | - SUPPORTED_10baseT_Half | - SUPPORTED_10baseT_Full | - SUPPORTED_MII); + supported = (SUPPORTED_Autoneg); + supported |= (SUPPORTED_100baseT_Half | + SUPPORTED_100baseT_Full | + SUPPORTED_10baseT_Half | + SUPPORTED_10baseT_Full | + SUPPORTED_MII); - cmd->advertising = 0; + advertising = 0; if (bp->flags & B44_FLAG_ADV_10HALF) - cmd->advertising |= ADVERTISED_10baseT_Half; + advertising |= ADVERTISED_10baseT_Half; if (bp->flags & B44_FLAG_ADV_10FULL) - cmd->advertising |= ADVERTISED_10baseT_Full; + advertising |= ADVERTISED_10baseT_Full; if (bp->flags & B44_FLAG_ADV_100HALF) - cmd->advertising |= ADVERTISED_100baseT_Half; + advertising |= ADVERTISED_100baseT_Half; if (bp->flags & B44_FLAG_ADV_100FULL) - cmd->advertising |= ADVERTISED_100baseT_Full; - cmd->advertising |= ADVERTISED_Pause | ADVERTISED_Asym_Pause; - ethtool_cmd_speed_set(cmd, ((bp->flags & B44_FLAG_100_BASE_T) ? - SPEED_100 : SPEED_10)); - cmd->duplex = (bp->flags & B44_FLAG_FULL_DUPLEX) ? + advertising |= ADVERTISED_100baseT_Full; + advertising |= ADVERTISED_Pause | ADVERTISED_Asym_Pause; + cmd->base.speed = (bp->flags & B44_FLAG_100_BASE_T) ? + SPEED_100 : SPEED_10; + cmd->base.duplex = (bp->flags & B44_FLAG_FULL_DUPLEX) ? DUPLEX_FULL : DUPLEX_HALF; - cmd->port = 0; - cmd->phy_address = bp->phy_addr; - cmd->transceiver = (bp->flags & B44_FLAG_EXTERNAL_PHY) ? - XCVR_EXTERNAL : XCVR_INTERNAL; - cmd->autoneg = (bp->flags & B44_FLAG_FORCE_LINK) ? + cmd->base.port = 0; + cmd->base.phy_address = bp->phy_addr; + cmd->base.autoneg = (bp->flags & B44_FLAG_FORCE_LINK) ? AUTONEG_DISABLE : AUTONEG_ENABLE; - if (cmd->autoneg == AUTONEG_ENABLE) - cmd->advertising |= ADVERTISED_Autoneg; + if (cmd->base.autoneg == AUTONEG_ENABLE) + advertising |= ADVERTISED_Autoneg; + + ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.supported, + supported); + ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.advertising, + advertising); + if (!netif_running(dev)){ - ethtool_cmd_speed_set(cmd, 0); - cmd->duplex = 0xff; + cmd->base.speed = 0; + cmd->base.duplex = 0xff; } - cmd->maxtxpkt = 0; - cmd->maxrxpkt = 0; + return 0; } -static int b44_set_settings(struct net_device *dev, struct ethtool_cmd *cmd) +static int b44_set_link_ksettings(struct net_device *dev, + const struct ethtool_link_ksettings *cmd) { struct b44 *bp = netdev_priv(dev); u32 speed; int ret; + u32 advertising; if (bp->flags & B44_FLAG_EXTERNAL_PHY) { BUG_ON(!dev->phydev); @@ -1891,31 +1898,34 @@ static int b44_set_settings(struct net_device *dev, struct ethtool_cmd *cmd) if (netif_running(dev)) b44_setup_phy(bp); - ret = phy_ethtool_sset(dev->phydev, cmd); + ret = phy_ethtool_ksettings_set(dev->phydev, cmd); spin_unlock_irq(>lock); return ret; } - speed = ethtool_cmd_speed(cmd); + speed = cmd->base.speed; + +
[PATCH 1/2] net: ethernet: broadcom: b44: use phydev from struct net_device
The private structure contain a pointer to phydev, but the structure net_device already contain such pointer. So we can remove the pointer phydev in the private structure, and update the driver to use the one contained in struct net_device. Signed-off-by: Philippe Reynes--- drivers/net/ethernet/broadcom/b44.c | 22 +++--- drivers/net/ethernet/broadcom/b44.h |1 - 2 files changed, 11 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/broadcom/b44.c b/drivers/net/ethernet/broadcom/b44.c index 74f0a37..936f06f 100644 --- a/drivers/net/ethernet/broadcom/b44.c +++ b/drivers/net/ethernet/broadcom/b44.c @@ -1486,7 +1486,7 @@ static int b44_open(struct net_device *dev) b44_enable_ints(bp); if (bp->flags & B44_FLAG_EXTERNAL_PHY) - phy_start(bp->phydev); + phy_start(dev->phydev); netif_start_queue(dev); out: @@ -1651,7 +1651,7 @@ static int b44_close(struct net_device *dev) netif_stop_queue(dev); if (bp->flags & B44_FLAG_EXTERNAL_PHY) - phy_stop(bp->phydev); + phy_stop(dev->phydev); napi_disable(>napi); @@ -1837,8 +1837,8 @@ static int b44_get_settings(struct net_device *dev, struct ethtool_cmd *cmd) struct b44 *bp = netdev_priv(dev); if (bp->flags & B44_FLAG_EXTERNAL_PHY) { - BUG_ON(!bp->phydev); - return phy_ethtool_gset(bp->phydev, cmd); + BUG_ON(!dev->phydev); + return phy_ethtool_gset(dev->phydev, cmd); } cmd->supported = (SUPPORTED_Autoneg); @@ -1886,12 +1886,12 @@ static int b44_set_settings(struct net_device *dev, struct ethtool_cmd *cmd) int ret; if (bp->flags & B44_FLAG_EXTERNAL_PHY) { - BUG_ON(!bp->phydev); + BUG_ON(!dev->phydev); spin_lock_irq(>lock); if (netif_running(dev)) b44_setup_phy(bp); - ret = phy_ethtool_sset(bp->phydev, cmd); + ret = phy_ethtool_sset(dev->phydev, cmd); spin_unlock_irq(>lock); @@ -2137,8 +2137,8 @@ static int b44_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd) spin_lock_irq(>lock); if (bp->flags & B44_FLAG_EXTERNAL_PHY) { - BUG_ON(!bp->phydev); - err = phy_mii_ioctl(bp->phydev, ifr, cmd); + BUG_ON(!dev->phydev); + err = phy_mii_ioctl(dev->phydev, ifr, cmd); } else { err = generic_mii_ioctl(>mii_if, if_mii(ifr), cmd, NULL); } @@ -2206,7 +2206,7 @@ static const struct net_device_ops b44_netdev_ops = { static void b44_adjust_link(struct net_device *dev) { struct b44 *bp = netdev_priv(dev); - struct phy_device *phydev = bp->phydev; + struct phy_device *phydev = dev->phydev; bool status_changed = 0; BUG_ON(!phydev); @@ -2303,7 +2303,6 @@ static int b44_register_phy_one(struct b44 *bp) SUPPORTED_MII); phydev->advertising = phydev->supported; - bp->phydev = phydev; bp->old_link = 0; bp->phy_addr = phydev->mdio.addr; @@ -2323,9 +2322,10 @@ err_out: static void b44_unregister_phy_one(struct b44 *bp) { + struct net_device *dev = bp->dev; struct mii_bus *mii_bus = bp->mii_bus; - phy_disconnect(bp->phydev); + phy_disconnect(dev->phydev); mdiobus_unregister(mii_bus); mdiobus_free(mii_bus); } diff --git a/drivers/net/ethernet/broadcom/b44.h b/drivers/net/ethernet/broadcom/b44.h index 65d88d7..89d2cf3 100644 --- a/drivers/net/ethernet/broadcom/b44.h +++ b/drivers/net/ethernet/broadcom/b44.h @@ -404,7 +404,6 @@ struct b44 { u32 tx_pending; u8 phy_addr; u8 force_copybreak; - struct phy_device *phydev; struct mii_bus *mii_bus; int old_link; struct mii_if_info mii_if; -- 1.7.4.4
Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems
Hi all, I have an odroid c2 board which shows this issue. No data is transmitted or received after a moment of intense tx traffic. Copying a 1GB file per scp from the board triggers it repeatedly. The board has a stmmac - user ID: 0x11, Synopsys ID: 0x37. When switching the network to 100Mb/s the copying does not seam to trigger the issue. I've attached the ethtool statistics before and after the problem. Thanks for your help, André > Hi Alexandre, > > On Mon, Sep 12, 2016 at 6:37 PM, Alexandre Torgue >wrote: > > Which Synopsys IP version do you use ? > found this in a dmesg log: > [1.504784] stmmac - user ID: 0x11, Synopsys ID: 0x37 > [1.509785] Ring mode enabled > [1.512796] DMA HW capability register supported > [1.517286] Normal descriptors > [1.520565] RX Checksum Offload Engine supported > [1.525219] COE Type 2 > [1.527638] TX Checksum insertion supported > [1.531862] Wake-Up On Lan supported > [1.535483] Enable RX Mitigation via HW Watchdog Timer > [1.543851] libphy: stmmac: probed > [1.544025] eth0: PHY ID 001cc916 at 0 IRQ POLL (stmmac-0:00) > active [1.550321] eth0: PHY ID 001cc916 at 7 IRQ POLL > (stmmac-0:07) > > >> Gbit ethernet on my device is provided by a Realtek RTL8211F RGMII > >> PHY. Similar issues were reported in #linux-amlogic by a user with > >> an Odroid C2 board (= similar hardware). > >> > >> The symptoms are: > >> Receiving data is plenty fast (I can max out my internet connection > >> easily, and with iperf3 I get ~900Mbit/s). > >> Transmitting data from the device is unfortunately very slow, > >> traffic sometimes even stalls completely. > >> > >> I have attached the iperf results and the output of > >> /sys/kernel/debug/stmmaceth/eth0/descriptors_status. > >> Below you can find the ifconfig, netstat and stmmac dma_cap info > >> (*after* I ran all tests). > >> > >> The "involved parties" are: > >> - Meson GXBB specific network configuration registers (I have have > >> double-checked them with the reference drivers: everything seems > >> fine here) > >> - stmmac: it seems that nobody else has reported these kind of > >> issues so far, however I'd still like to hear where I should > >> enable some debugging bits to rule out any stmmac bug > > > > > > On my side, I just tested on the same "kind" of system: > > -SYNOPSYS GMAC 3.7 > > -RTL8211EG as PHY > > > > With I perf, I reach: > > -RX: 932 Mbps > > -TX: 820Mbps > > > > Can you check ethtool -S eth0 (most precisely "MMC"counter and > > errors) ? Which kernel version do you use ? > I am using a 4.8.0-rc4 kernel, based on Kevin's "integration" branch: > [0] Unfortunately I don't have access to my device in the next few > days, but I'll keep you updated once I have the ethtool output. > > > Thanks for your time > Regards, > Martin > > > [0] > https://git.kernel.org/cgit/linux/kernel/git/khilman/linux-amlogic.git/log/?h=v4.8/integ > > ___ > linux-amlogic mailing list > linux-amlo...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-amlogic > ethstats.after Description: Binary data ethstats.before Description: Binary data
Re: [PATCH v2 net-next 07/16] tcp: track data delivery rate for a TCP connection
On Sat, Sep 17, 2016 at 12:04 PM, kbuild test robotwrote: > Hi Yuchung, > > [auto build test ERROR on net-next/master] > > url: > https://github.com/0day-ci/linux/commits/Neal-Cardwell/tcp-BBR-congestion-control-algorithm/20160918-014058 > config: x86_64-randconfig-s2-09180225 (attached as .config) > compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7 > reproduce: > # save the attached .config to linux build tree > make ARCH=x86_64 > > All error/warnings (new ones prefixed by >>): > >net/ipv4/tcp_input.c: In function 'tcp_ack': >>> net/ipv4/tcp_input.c:3559: error: unknown field 'v64' specified in >>> initializer >>> net/ipv4/tcp_input.c:3559: warning: missing braces around initializer >net/ipv4/tcp_input.c:3559: warning: (near initialization for > 'rs.prior_mstamp.') > > vim +/v64 +3559 net/ipv4/tcp_input.c > > 3553 /* This routine deals with incoming acks, but not outgoing ones. */ > 3554 static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int > flag) > 3555 { > 3556 struct inet_connection_sock *icsk = inet_csk(sk); > 3557 struct tcp_sock *tp = tcp_sk(sk); > 3558 struct tcp_sacktag_state sack_state; >> 3559 struct rate_sample rs = { .prior_mstamp.v64 = 0, >> .prior_delivered = 0 }; > Arg, silly compilers out there. We can omit prior_mstamp , as the compiler will zero fields anyway struct rate_sample rs = { .prior_delivered = 0 };
Re: [PATCH v2 net-next 07/16] tcp: track data delivery rate for a TCP connection
Hi Yuchung, [auto build test ERROR on net-next/master] url: https://github.com/0day-ci/linux/commits/Neal-Cardwell/tcp-BBR-congestion-control-algorithm/20160918-014058 config: x86_64-randconfig-s2-09180225 (attached as .config) compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All error/warnings (new ones prefixed by >>): net/ipv4/tcp_input.c: In function 'tcp_ack': >> net/ipv4/tcp_input.c:3559: error: unknown field 'v64' specified in >> initializer >> net/ipv4/tcp_input.c:3559: warning: missing braces around initializer net/ipv4/tcp_input.c:3559: warning: (near initialization for 'rs.prior_mstamp.') vim +/v64 +3559 net/ipv4/tcp_input.c 3553 /* This routine deals with incoming acks, but not outgoing ones. */ 3554 static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) 3555 { 3556 struct inet_connection_sock *icsk = inet_csk(sk); 3557 struct tcp_sock *tp = tcp_sk(sk); 3558 struct tcp_sacktag_state sack_state; > 3559 struct rate_sample rs = { .prior_mstamp.v64 = 0, > .prior_delivered = 0 }; 3560 u32 prior_snd_una = tp->snd_una; 3561 u32 ack_seq = TCP_SKB_CB(skb)->seq; 3562 u32 ack = TCP_SKB_CB(skb)->ack_seq; --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
[PATCH v2 net-next 10/16] tcp: allow congestion control module to request TSO skb segment count
Add the tso_segs_goal() function in tcp_congestion_ops to allow the congestion control module to specify the number of segments that should be in a TSO skb sent by tcp_write_xmit() and tcp_xmit_retransmit_queue(). The congestion control module can either request a particular number of segments in TSO skb that we transmit, or return 0 if it doesn't care. This allows the upcoming BBR congestion control module to select small TSO skb sizes if the module detects that the bottleneck bandwidth is very low, or that the connection is policed to a low rate. Signed-off-by: Van JacobsonSigned-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Nandita Dukkipati Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- include/net/tcp.h | 2 ++ net/ipv4/tcp_output.c | 15 +-- 2 files changed, 15 insertions(+), 2 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index a69ed7f..f8f581f 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -913,6 +913,8 @@ struct tcp_congestion_ops { u32 (*undo_cwnd)(struct sock *sk); /* hook for packet ack accounting (optional) */ void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample); + /* suggest number of segments for each skb to transmit (optional) */ + u32 (*tso_segs_goal)(struct sock *sk); /* get info for inet_diag (optional) */ size_t (*get_info)(struct sock *sk, u32 ext, int *attr, union tcp_cc_info *info); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index e02c8eb..0137956 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1566,6 +1566,17 @@ static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now) return min_t(u32, segs, sk->sk_gso_max_segs); } +/* Return the number of segments we want in the skb we are transmitting. + * See if congestion control module wants to decide; otherwise, autosize. + */ +static u32 tcp_tso_segs(struct sock *sk, unsigned int mss_now) +{ + const struct tcp_congestion_ops *ca_ops = inet_csk(sk)->icsk_ca_ops; + u32 tso_segs = ca_ops->tso_segs_goal ? ca_ops->tso_segs_goal(sk) : 0; + + return tso_segs ? : tcp_tso_autosize(sk, mss_now); +} + /* Returns the portion of skb which can be sent right away */ static unsigned int tcp_mss_split_point(const struct sock *sk, const struct sk_buff *skb, @@ -2061,7 +2072,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, } } - max_segs = tcp_tso_autosize(sk, mss_now); + max_segs = tcp_tso_segs(sk, mss_now); while ((skb = tcp_send_head(sk))) { unsigned int limit; @@ -2778,7 +2789,7 @@ void tcp_xmit_retransmit_queue(struct sock *sk) last_lost = tp->snd_una; } - max_segs = tcp_tso_autosize(sk, tcp_current_mss(sk)); + max_segs = tcp_tso_segs(sk, tcp_current_mss(sk)); tcp_for_write_queue_from(skb, sk) { __u8 sacked; int segs; -- 2.8.0.rc3.226.g39d4020
[PATCH v2 net-next 14/16] tcp: new CC hook to set sending rate with rate_sample in any CA state
From: Yuchung ChengThis commit introduces an optional new "omnipotent" hook, cong_control(), for congestion control modules. The cong_control() function is called at the end of processing an ACK (i.e., after updating sequence numbers, the SACK scoreboard, and loss detection). At that moment we have precise delivery rate information the congestion control module can use to control the sending behavior (using cwnd, TSO skb size, and pacing rate) in any CA state. This function can also be used by a congestion control that prefers not to use the default cwnd reduction approach (i.e., the PRR algorithm) during CA_Recovery to control the cwnd and sending rate during loss recovery. We take advantage of the fact that recent changes defer the retransmission or transmission of new data (e.g. by F-RTO) in recovery until the new tcp_cong_control() function is run. With this commit, we only run tcp_update_pacing_rate() if the congestion control is not using this new API. New congestion controls which use the new API do not want the TCP stack to run the default pacing rate calculation and overwrite whatever pacing rate they have chosen at initialization time. Signed-off-by: Van Jacobson Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Nandita Dukkipati Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- include/net/tcp.h| 4 net/ipv4/tcp_cong.c | 2 +- net/ipv4/tcp_input.c | 17 ++--- 3 files changed, 19 insertions(+), 4 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 1aa9628..f83b7f2 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -919,6 +919,10 @@ struct tcp_congestion_ops { u32 (*tso_segs_goal)(struct sock *sk); /* returns the multiplier used in tcp_sndbuf_expand (optional) */ u32 (*sndbuf_expand)(struct sock *sk); + /* call when packets are delivered to update cwnd and pacing rate, +* after all the ca_state processing. (optional) +*/ + void (*cong_control)(struct sock *sk, const struct rate_sample *rs); /* get info for inet_diag (optional) */ size_t (*get_info)(struct sock *sk, u32 ext, int *attr, union tcp_cc_info *info); diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index 882caa4..1294af4 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -69,7 +69,7 @@ int tcp_register_congestion_control(struct tcp_congestion_ops *ca) int ret = 0; /* all algorithms must implement ssthresh and cong_avoid ops */ - if (!ca->ssthresh || !ca->cong_avoid) { + if (!ca->ssthresh || !(ca->cong_avoid || ca->cong_control)) { pr_err("%s does not implement required ops\n", ca->name); return -EINVAL; } diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index a134e66..931fe32 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2536,6 +2536,9 @@ static inline void tcp_end_cwnd_reduction(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); + if (inet_csk(sk)->icsk_ca_ops->cong_control) + return; + /* Reset cwnd to ssthresh in CWR or Recovery (unless it's undone) */ if (inet_csk(sk)->icsk_ca_state == TCP_CA_CWR || (tp->undo_marker && tp->snd_ssthresh < TCP_INFINITE_SSTHRESH)) { @@ -3312,8 +3315,15 @@ static inline bool tcp_may_raise_cwnd(const struct sock *sk, const int flag) * information. All transmission or retransmission are delayed afterwards. */ static void tcp_cong_control(struct sock *sk, u32 ack, u32 acked_sacked, -int flag) +int flag, const struct rate_sample *rs) { + const struct inet_connection_sock *icsk = inet_csk(sk); + + if (icsk->icsk_ca_ops->cong_control) { + icsk->icsk_ca_ops->cong_control(sk, rs); + return; + } + if (tcp_in_cwnd_reduction(sk)) { /* Reduce cwnd if state mandates */ tcp_cwnd_reduction(sk, acked_sacked, flag); @@ -3683,7 +3693,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) delivered = tp->delivered - delivered; /* freshly ACKed or SACKed */ lost = tp->lost - lost; /* freshly marked lost */ tcp_rate_gen(sk, delivered, lost, , ); - tcp_cong_control(sk, ack, delivered, flag); + tcp_cong_control(sk, ack, delivered, flag, ); tcp_xmit_recovery(sk, rexmit); return 1; @@ -5981,7 +5991,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) } else tcp_init_metrics(sk); - tcp_update_pacing_rate(sk); + if (!inet_csk(sk)->icsk_ca_ops->cong_control) +
[PATCH v2 net-next 05/16] tcp: switch back to proper tcp_skb_cb size check in tcp_init()
From: Eric DumazetRevert to the tcp_skb_cb size check that tcp_init() had before commit b4772ef879a8 ("net: use common macro for assering skb->cb[] available size in protocol families"). As related commit 744d5a3e9fe2 ("net: move skb->dropcount to skb->cb[]") explains, the sock_skb_cb_check_size() mechanism was added to ensure that there is space for dropcount, "for protocol families using it". But TCP is not a protocol using dropcount, so tcp_init() doesn't need to provision space for dropcount in the skb->cb[], and thus we can revert to the older form of the tcp_skb_cb size check. Doing so allows TCP to use 4 more bytes of the skb->cb[] space. Fixes: b4772ef879a8 ("net: use common macro for assering skb->cb[] available size in protocol families") Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng --- net/ipv4/tcp.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 5b0b49c..53798e1 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -3244,11 +3244,12 @@ static void __init tcp_init_mem(void) void __init tcp_init(void) { - unsigned long limit; int max_rshare, max_wshare, cnt; + unsigned long limit; + struct sk_buff *skb; unsigned int i; - sock_skb_cb_check_size(sizeof(struct tcp_skb_cb)); + BUILD_BUG_ON(sizeof(struct tcp_skb_cb) > sizeof(skb->cb)); percpu_counter_init(_sockets_allocated, 0, GFP_KERNEL); percpu_counter_init(_orphan_count, 0, GFP_KERNEL); -- 2.8.0.rc3.226.g39d4020
[PATCH v2 net-next 07/16] tcp: track data delivery rate for a TCP connection
From: Yuchung ChengThis patch generates data delivery rate (throughput) samples on a per-ACK basis. These rate samples can be used by congestion control modules, and specifically will be used by TCP BBR in later patches in this series. Key state: tp->delivered: Tracks the total number of data packets (original or not) delivered so far. This is an already-existing field. tp->delivered_mstamp: the last time tp->delivered was updated. Algorithm: A rate sample is calculated as (d1 - d0)/(t1 - t0) on a per-ACK basis: d1: the current tp->delivered after processing the ACK t1: the current time after processing the ACK d0: the prior tp->delivered when the acked skb was transmitted t0: the prior tp->delivered_mstamp when the acked skb was transmitted When an skb is transmitted, we snapshot d0 and t0 in its control block in tcp_rate_skb_sent(). When an ACK arrives, it may SACK and ACK some skbs. For each SACKed or ACKed skb, tcp_rate_skb_delivered() updates the rate_sample struct to reflect the latest (d0, t0). Finally, tcp_rate_gen() generates a rate sample by storing (d1 - d0) in rs->delivered and (t1 - t0) in rs->interval_us. One caveat: if an skb was sent with no packets in flight, then tp->delivered_mstamp may be either invalid (if the connection is starting) or outdated (if the connection was idle). In that case, we'll re-stamp tp->delivered_mstamp. At first glance it seems t0 should always be the time when an skb was transmitted, but actually this could over-estimate the rate due to phase mismatch between transmit and ACK events. To track the delivery rate, we ensure that if packets are in flight then t0 and and t1 are times at which packets were marked delivered. If the initial and final RTTs are different then one may be corrupted by some sort of noise. The noise we see most often is sending gaps caused by delayed, compressed, or stretched acks. This either affects both RTTs equally or artificially reduces the final RTT. We approach this by recording the info we need to compute the initial RTT (duration of the "send phase" of the window) when we recorded the associated inflight. Then, for a filter to avoid bandwidth overestimates, we generalize the per-sample bandwidth computation from: bw = delivered / ack_phase_rtt to the following: bw = delivered / max(send_phase_rtt, ack_phase_rtt) In large-scale experiments, this filtering approach incorporating send_phase_rtt is effective at avoiding bandwidth overestimates due to ACK compression or stretched ACKs. Signed-off-by: Van Jacobson Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Nandita Dukkipati Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- include/linux/tcp.h | 2 + include/net/tcp.h | 35 +++- net/ipv4/Makefile | 2 +- net/ipv4/tcp_input.c | 46 +++- net/ipv4/tcp_output.c | 4 ++ net/ipv4/tcp_rate.c | 149 ++ 6 files changed, 222 insertions(+), 16 deletions(-) create mode 100644 net/ipv4/tcp_rate.c diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 38590fb..c50e6ae 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -268,6 +268,8 @@ struct tcp_sock { u32 prr_out;/* Total number of pkts sent during Recovery. */ u32 delivered; /* Total data packets delivered incl. rexmits */ u32 lost; /* Total data packets lost incl. rexmits */ + struct skb_mstamp first_tx_mstamp; /* start of window send phase */ + struct skb_mstamp delivered_mstamp; /* time we reached "delivered" */ u32 rcv_wnd;/* Current receiver window */ u32 write_seq; /* Tail(+1) of data held in tcp send buffer */ diff --git a/include/net/tcp.h b/include/net/tcp.h index 2f1648a..b261c89 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -763,8 +763,14 @@ struct tcp_skb_cb { __u32 ack_seq;/* Sequence number ACK'd*/ union { struct { - /* There is space for up to 20 bytes */ + /* There is space for up to 24 bytes */ __u32 in_flight;/* Bytes in flight when packet sent */ + /* pkts S/ACKed so far upon tx of skb, incl retrans: */ + __u32 delivered; + /* start of send pipeline phase */ + struct skb_mstamp first_tx_mstamp; + /* when we reached the "delivered" count */ + struct skb_mstamp delivered_mstamp; } tx; /* only used for outgoing skbs */ union { struct inet_skb_parmh4; @@ -860,6 +866,26 @@ struct ack_sample {
[PATCH v2 net-next 04/16] net_sched: sch_fq: add low_rate_threshold parameter
From: Eric DumazetThis commit adds to the fq module a low_rate_threshold parameter to insert a delay after all packets if the socket requests a pacing rate below the threshold. This helps achieve more precise control of the sending rate with low-rate paths, especially policers. The basic issue is that if a congestion control module detects a policer at a certain rate, it may want fq to be able to shape to that policed rate. That way the sender can avoid policer drops by having the packets arrive at the policer at or just under the policed rate. The default threshold of 550Kbps was chosen analytically so that for policers or links at 500Kbps or 512Kbps fq would very likely invoke this mechanism, even if the pacing rate was briefly slightly above the available bandwidth. This value was then empirically validated with two years of production testing on YouTube video servers. Signed-off-by: Van Jacobson Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Nandita Dukkipati Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- include/uapi/linux/pkt_sched.h | 2 ++ net/sched/sch_fq.c | 22 +++--- 2 files changed, 21 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h index 2382eed..f8e39db 100644 --- a/include/uapi/linux/pkt_sched.h +++ b/include/uapi/linux/pkt_sched.h @@ -792,6 +792,8 @@ enum { TCA_FQ_ORPHAN_MASK, /* mask applied to orphaned skb hashes */ + TCA_FQ_LOW_RATE_THRESHOLD, /* per packet delay under this rate */ + __TCA_FQ_MAX }; diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c index e5458b9..40ad4fc 100644 --- a/net/sched/sch_fq.c +++ b/net/sched/sch_fq.c @@ -94,6 +94,7 @@ struct fq_sched_data { u32 flow_max_rate; /* optional max rate per flow */ u32 flow_plimit;/* max packets per flow */ u32 orphan_mask;/* mask for orphaned skb */ + u32 low_rate_threshold; struct rb_root *fq_root; u8 rate_enable; u8 fq_trees_log; @@ -433,7 +434,7 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch) struct fq_flow_head *head; struct sk_buff *skb; struct fq_flow *f; - u32 rate; + u32 rate, plen; skb = fq_dequeue_head(sch, >internal); if (skb) @@ -482,7 +483,7 @@ begin: prefetch(>end); f->credit -= qdisc_pkt_len(skb); - if (f->credit > 0 || !q->rate_enable) + if (!q->rate_enable) goto out; /* Do not pace locally generated ack packets */ @@ -493,8 +494,15 @@ begin: if (skb->sk) rate = min(skb->sk->sk_pacing_rate, rate); + if (rate <= q->low_rate_threshold) { + f->credit = 0; + plen = qdisc_pkt_len(skb); + } else { + plen = max(qdisc_pkt_len(skb), q->quantum); + if (f->credit > 0) + goto out; + } if (rate != ~0U) { - u32 plen = max(qdisc_pkt_len(skb), q->quantum); u64 len = (u64)plen * NSEC_PER_SEC; if (likely(rate)) @@ -662,6 +670,7 @@ static const struct nla_policy fq_policy[TCA_FQ_MAX + 1] = { [TCA_FQ_FLOW_MAX_RATE] = { .type = NLA_U32 }, [TCA_FQ_BUCKETS_LOG]= { .type = NLA_U32 }, [TCA_FQ_FLOW_REFILL_DELAY] = { .type = NLA_U32 }, + [TCA_FQ_LOW_RATE_THRESHOLD] = { .type = NLA_U32 }, }; static int fq_change(struct Qdisc *sch, struct nlattr *opt) @@ -716,6 +725,10 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt) if (tb[TCA_FQ_FLOW_MAX_RATE]) q->flow_max_rate = nla_get_u32(tb[TCA_FQ_FLOW_MAX_RATE]); + if (tb[TCA_FQ_LOW_RATE_THRESHOLD]) + q->low_rate_threshold = + nla_get_u32(tb[TCA_FQ_LOW_RATE_THRESHOLD]); + if (tb[TCA_FQ_RATE_ENABLE]) { u32 enable = nla_get_u32(tb[TCA_FQ_RATE_ENABLE]); @@ -781,6 +794,7 @@ static int fq_init(struct Qdisc *sch, struct nlattr *opt) q->fq_root = NULL; q->fq_trees_log = ilog2(1024); q->orphan_mask = 1024 - 1; + q->low_rate_threshold = 55 / 8; qdisc_watchdog_init(>watchdog, sch); if (opt) @@ -811,6 +825,8 @@ static int fq_dump(struct Qdisc *sch, struct sk_buff *skb) nla_put_u32(skb, TCA_FQ_FLOW_REFILL_DELAY, jiffies_to_usecs(q->flow_refill_delay)) || nla_put_u32(skb, TCA_FQ_ORPHAN_MASK, q->orphan_mask) || + nla_put_u32(skb, TCA_FQ_LOW_RATE_THRESHOLD, + q->low_rate_threshold) || nla_put_u32(skb, TCA_FQ_BUCKETS_LOG,
[PATCH v2 net-next 03/16] tcp: use windowed min filter library for TCP min_rtt estimation
Refactor the TCP min_rtt code to reuse the new win_minmax library in lib/win_minmax.c to simplify the TCP code. This is a pure refactor: the functionality is exactly the same. We just moved the windowed min code to make TCP easier to read and maintain, and to allow other parts of the kernel to use the windowed min/max filter code. Signed-off-by: Van JacobsonSigned-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Nandita Dukkipati Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- include/linux/tcp.h | 5 ++-- include/net/tcp.h| 2 +- net/ipv4/tcp.c | 2 +- net/ipv4/tcp_input.c | 64 net/ipv4/tcp_minisocks.c | 2 +- 5 files changed, 10 insertions(+), 65 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index c723a46..6433cc8 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -19,6 +19,7 @@ #include +#include #include #include #include @@ -234,9 +235,7 @@ struct tcp_sock { u32 mdev_max_us;/* maximal mdev for the last rtt period */ u32 rttvar_us; /* smoothed mdev_max*/ u32 rtt_seq;/* sequence number to update rttvar */ - struct rtt_meas { - u32 rtt, ts;/* RTT in usec and sampling time in jiffies. */ - } rtt_min[3]; + struct minmax rtt_min; u32 packets_out;/* Packets which are "in flight"*/ u32 retrans_out;/* Retransmitted packets out*/ diff --git a/include/net/tcp.h b/include/net/tcp.h index fdfbedd..2f1648a 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -671,7 +671,7 @@ static inline bool tcp_ca_dst_locked(const struct dst_entry *dst) /* Minimum RTT in usec. ~0 means not available. */ static inline u32 tcp_min_rtt(const struct tcp_sock *tp) { - return tp->rtt_min[0].rtt; + return minmax_get(>rtt_min); } /* Compute the actual receive window we are currently advertising. diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index a13fcb3..5b0b49c 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -387,7 +387,7 @@ void tcp_init_sock(struct sock *sk) icsk->icsk_rto = TCP_TIMEOUT_INIT; tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT); - tp->rtt_min[0].rtt = ~0U; + minmax_reset(>rtt_min, tcp_time_stamp, ~0U); /* So many TCP implementations out there (incorrectly) count the * initial SYN frame in their delayed-ACK and congestion control diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 70b892d..ac5b38f 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2879,67 +2879,13 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked, *rexmit = REXMIT_LOST; } -/* Kathleen Nichols' algorithm for tracking the minimum value of - * a data stream over some fixed time interval. (E.g., the minimum - * RTT over the past five minutes.) It uses constant space and constant - * time per update yet almost always delivers the same minimum as an - * implementation that has to keep all the data in the window. - * - * The algorithm keeps track of the best, 2nd best & 3rd best min - * values, maintaining an invariant that the measurement time of the - * n'th best >= n-1'th best. It also makes sure that the three values - * are widely separated in the time window since that bounds the worse - * case error when that data is monotonically increasing over the window. - * - * Upon getting a new min, we can forget everything earlier because it - * has no value - the new min is <= everything else in the window by - * definition and it's the most recent. So we restart fresh on every new min - * and overwrites 2nd & 3rd choices. The same property holds for 2nd & 3rd - * best. - */ static void tcp_update_rtt_min(struct sock *sk, u32 rtt_us) { - const u32 now = tcp_time_stamp, wlen = sysctl_tcp_min_rtt_wlen * HZ; - struct rtt_meas *m = tcp_sk(sk)->rtt_min; - struct rtt_meas rttm = { - .rtt = likely(rtt_us) ? rtt_us : jiffies_to_usecs(1), - .ts = now, - }; - u32 elapsed; - - /* Check if the new measurement updates the 1st, 2nd, or 3rd choices */ - if (unlikely(rttm.rtt <= m[0].rtt)) - m[0] = m[1] = m[2] = rttm; - else if (rttm.rtt <= m[1].rtt) - m[1] = m[2] = rttm; - else if (rttm.rtt <= m[2].rtt) - m[2] = rttm; - - elapsed = now - m[0].ts; - if (unlikely(elapsed > wlen)) { - /* Passed entire window without a new min so make 2nd choice -* the new min & 3rd choice the new 2nd. So forth and so on. -*/ - m[0] = m[1]; - m[1] = m[2]; - m[2] = rttm; - if
[PATCH v2 net-next 12/16] tcp: export tcp_mss_to_mtu() for congestion control modules
Export tcp_mss_to_mtu(), so that congestion control modules can use this to help calculate a pacing rate. Signed-off-by: Van JacobsonSigned-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Nandita Dukkipati Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- net/ipv4/tcp_output.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 0bf3d48..7d025a7 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1362,6 +1362,7 @@ int tcp_mss_to_mtu(struct sock *sk, int mss) } return mtu; } +EXPORT_SYMBOL(tcp_mss_to_mtu); /* MTU probing init per socket */ void tcp_mtup_init(struct sock *sk) -- 2.8.0.rc3.226.g39d4020
[PATCH v2 net-next 02/16] lib/win_minmax: windowed min or max estimator
This commit introduces a generic library to estimate either the min or max value of a time-varying variable over a recent time window. This is code originally from Kathleen Nichols. The current form of the code is from Van Jacobson. A single struct minmax_sample will track the estimated windowed-max value of the series if you call minmax_running_max() or the estimated windowed-min value of the series if you call minmax_running_min(). Nearly equivalent code is already in place for minimum RTT estimation in the TCP stack. This commit extracts that code and generalizes it to handle both min and max. Moving the code here reduces the footprint and complexity of the TCP code base and makes the filter generally available for other parts of the codebase, including an upcoming TCP congestion control module. This library works well for time series where the measurements are smoothly increasing or decreasing. Signed-off-by: Van JacobsonSigned-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Nandita Dukkipati Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- include/linux/win_minmax.h | 37 + lib/Makefile | 2 +- lib/win_minmax.c | 98 ++ 3 files changed, 136 insertions(+), 1 deletion(-) create mode 100644 include/linux/win_minmax.h create mode 100644 lib/win_minmax.c diff --git a/include/linux/win_minmax.h b/include/linux/win_minmax.h new file mode 100644 index 000..5656960 --- /dev/null +++ b/include/linux/win_minmax.h @@ -0,0 +1,37 @@ +/** + * lib/minmax.c: windowed min/max tracker by Kathleen Nichols. + * + */ +#ifndef MINMAX_H +#define MINMAX_H + +#include + +/* A single data point for our parameterized min-max tracker */ +struct minmax_sample { + u32 t; /* time measurement was taken */ + u32 v; /* value measured */ +}; + +/* State for the parameterized min-max tracker */ +struct minmax { + struct minmax_sample s[3]; +}; + +static inline u32 minmax_get(const struct minmax *m) +{ + return m->s[0].v; +} + +static inline u32 minmax_reset(struct minmax *m, u32 t, u32 meas) +{ + struct minmax_sample val = { .t = t, .v = meas }; + + m->s[2] = m->s[1] = m->s[0] = val; + return m->s[0].v; +} + +u32 minmax_running_max(struct minmax *m, u32 win, u32 t, u32 meas); +u32 minmax_running_min(struct minmax *m, u32 win, u32 t, u32 meas); + +#endif diff --git a/lib/Makefile b/lib/Makefile index 5dc77a8..df747e5 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -22,7 +22,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ sha1.o chacha20.o md5.o irq_regs.o argv_split.o \ flex_proportions.o ratelimit.o show_mem.o \ is_single_threaded.o plist.o decompress.o kobject_uevent.o \ -earlycpio.o seq_buf.o nmi_backtrace.o nodemask.o +earlycpio.o seq_buf.o nmi_backtrace.o nodemask.o win_minmax.o lib-$(CONFIG_MMU) += ioremap.o lib-$(CONFIG_SMP) += cpumask.o diff --git a/lib/win_minmax.c b/lib/win_minmax.c new file mode 100644 index 000..c8420d4 --- /dev/null +++ b/lib/win_minmax.c @@ -0,0 +1,98 @@ +/** + * lib/minmax.c: windowed min/max tracker + * + * Kathleen Nichols' algorithm for tracking the minimum (or maximum) + * value of a data stream over some fixed time interval. (E.g., + * the minimum RTT over the past five minutes.) It uses constant + * space and constant time per update yet almost always delivers + * the same minimum as an implementation that has to keep all the + * data in the window. + * + * The algorithm keeps track of the best, 2nd best & 3rd best min + * values, maintaining an invariant that the measurement time of + * the n'th best >= n-1'th best. It also makes sure that the three + * values are widely separated in the time window since that bounds + * the worse case error when that data is monotonically increasing + * over the window. + * + * Upon getting a new min, we can forget everything earlier because + * it has no value - the new min is <= everything else in the window + * by definition and it's the most recent. So we restart fresh on + * every new min and overwrites 2nd & 3rd choices. The same property + * holds for 2nd & 3rd best. + */ +#include +#include + +/* As time advances, update the 1st, 2nd, and 3rd choices. */ +static u32 minmax_subwin_update(struct minmax *m, u32 win, + const struct minmax_sample *val) +{ + u32 dt = val->t - m->s[0].t; + + if (unlikely(dt > win)) { + /* +* Passed entire window without a new val so make 2nd +* choice the new val & 3rd choice the new 2nd choice. +* we may have to iterate this since our 2nd choice +* may also be outside the window (we checked on entry +* that the third
[PATCH v2 net-next 13/16] tcp: allow congestion control to expand send buffer differently
From: Yuchung ChengCurrently the TCP send buffer expands to twice cwnd, in order to allow limited transmits in the CA_Recovery state. This assumes that cwnd does not increase in the CA_Recovery. For some congestion control algorithms, like the upcoming BBR module, if the losses in recovery do not indicate congestion then we may continue to raise cwnd multiplicatively in recovery. In such cases the current multiplier will falsely limit the sending rate, much as if it were limited by the application. This commit adds an optional congestion control callback to use a different multiplier to expand the TCP send buffer. For congestion control modules that do not specificy this callback, TCP continues to use the previous default of 2. Signed-off-by: Van Jacobson Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Nandita Dukkipati Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- include/net/tcp.h| 2 ++ net/ipv4/tcp_input.c | 4 +++- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 3492041..1aa9628 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -917,6 +917,8 @@ struct tcp_congestion_ops { void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample); /* suggest number of segments for each skb to transmit (optional) */ u32 (*tso_segs_goal)(struct sock *sk); + /* returns the multiplier used in tcp_sndbuf_expand (optional) */ + u32 (*sndbuf_expand)(struct sock *sk); /* get info for inet_diag (optional) */ size_t (*get_info)(struct sock *sk, u32 ext, int *attr, union tcp_cc_info *info); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index df26af0..a134e66 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -289,6 +289,7 @@ static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr static void tcp_sndbuf_expand(struct sock *sk) { const struct tcp_sock *tp = tcp_sk(sk); + const struct tcp_congestion_ops *ca_ops = inet_csk(sk)->icsk_ca_ops; int sndmem, per_mss; u32 nr_segs; @@ -309,7 +310,8 @@ static void tcp_sndbuf_expand(struct sock *sk) * Cubic needs 1.7 factor, rounded to 2 to include * extra cushion (application might react slowly to POLLOUT) */ - sndmem = 2 * nr_segs * per_mss; + sndmem = ca_ops->sndbuf_expand ? ca_ops->sndbuf_expand(sk) : 2; + sndmem *= nr_segs * per_mss; if (sk->sk_sndbuf < sndmem) sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]); -- 2.8.0.rc3.226.g39d4020
[PATCH v2 net-next 09/16] tcp: export data delivery rate
From: Yuchung ChengThis commit export two new fields in struct tcp_info: tcpi_delivery_rate: The most recent goodput, as measured by tcp_rate_gen(). If the socket is limited by the sending application (e.g., no data to send), it reports the highest measurement instead of the most recent. The unit is bytes per second (like other rate fields in tcp_info). tcpi_delivery_rate_app_limited: A boolean indicating if the goodput was measured when the socket's throughput was limited by the sending application. This delivery rate information can be useful for applications that want to know the current throughput the TCP connection is seeing, e.g. adaptive bitrate video streaming. It can also be very useful for debugging or troubleshooting. Signed-off-by: Van Jacobson Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Nandita Dukkipati Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- include/linux/tcp.h | 5 - include/uapi/linux/tcp.h | 3 +++ net/ipv4/tcp.c | 11 ++- net/ipv4/tcp_rate.c | 12 +++- 4 files changed, 28 insertions(+), 3 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index fdcd00f..a17ae7b 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -213,7 +213,8 @@ struct tcp_sock { u8 reord;/* reordering detected */ } rack; u16 advmss; /* Advertised MSS */ - u8 unused; + u8 rate_app_limited:1, /* rate_{delivered,interval_us} limited? */ + unused:7; u8 nonagle : 4,/* Disable Nagle algorithm? */ thin_lto: 1,/* Use linear timeouts for thin streams */ thin_dupack : 1,/* Fast retransmit on first dupack */ @@ -271,6 +272,8 @@ struct tcp_sock { u32 app_limited;/* limited until "delivered" reaches this val */ struct skb_mstamp first_tx_mstamp; /* start of window send phase */ struct skb_mstamp delivered_mstamp; /* time we reached "delivered" */ + u32 rate_delivered;/* saved rate sample: packets delivered */ + u32 rate_interval_us; /* saved rate sample: time elapsed */ u32 rcv_wnd;/* Current receiver window */ u32 write_seq; /* Tail(+1) of data held in tcp send buffer */ diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index 482898f..73ac0db 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -167,6 +167,7 @@ struct tcp_info { __u8tcpi_backoff; __u8tcpi_options; __u8tcpi_snd_wscale : 4, tcpi_rcv_wscale : 4; + __u8tcpi_delivery_rate_app_limited:1; __u32 tcpi_rto; __u32 tcpi_ato; @@ -211,6 +212,8 @@ struct tcp_info { __u32 tcpi_min_rtt; __u32 tcpi_data_segs_in; /* RFC4898 tcpEStatsDataSegsIn */ __u32 tcpi_data_segs_out; /* RFC4898 tcpEStatsDataSegsOut */ + + __u64 tcpi_delivery_rate; }; /* for TCP_MD5SIG socket option */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 0327a44..46b05b2 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2695,7 +2695,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info) { const struct tcp_sock *tp = tcp_sk(sk); /* iff sk_type == SOCK_STREAM */ const struct inet_connection_sock *icsk = inet_csk(sk); - u32 now = tcp_time_stamp; + u32 now = tcp_time_stamp, intv; unsigned int start; int notsent_bytes; u64 rate64; @@ -2785,6 +2785,15 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info) info->tcpi_min_rtt = tcp_min_rtt(tp); info->tcpi_data_segs_in = tp->data_segs_in; info->tcpi_data_segs_out = tp->data_segs_out; + + info->tcpi_delivery_rate_app_limited = tp->rate_app_limited ? 1 : 0; + rate = READ_ONCE(tp->rate_delivered); + intv = READ_ONCE(tp->rate_interval_us); + if (rate && intv) { + rate64 = (u64)rate * tp->mss_cache * USEC_PER_SEC; + do_div(rate64, intv); + put_unaligned(rate64, >tcpi_delivery_rate); + } } EXPORT_SYMBOL_GPL(tcp_get_info); diff --git a/net/ipv4/tcp_rate.c b/net/ipv4/tcp_rate.c index 52ff84b..9be1581 100644 --- a/net/ipv4/tcp_rate.c +++ b/net/ipv4/tcp_rate.c @@ -149,12 +149,22 @@ void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost, * for connections suffer heavy or prolonged losses. */ if (unlikely(rs->interval_us < tcp_min_rtt(tp))) { - rs->interval_us = -1; if (!rs->is_retrans) pr_debug("tcp rate: %ld %d %u %u %u\n", rs->interval_us, rs->delivered,
[PATCH v2 net-next 15/16] tcp: increase ICSK_CA_PRIV_SIZE from 64 bytes to 88
The TCP CUBIC module already uses 64 bytes. The upcoming TCP BBR module uses 88 bytes. Signed-off-by: Van JacobsonSigned-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Nandita Dukkipati Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- include/net/inet_connection_sock.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 49dcad4..197a30d 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -134,8 +134,8 @@ struct inet_connection_sock { } icsk_mtup; u32 icsk_user_timeout; - u64 icsk_ca_priv[64 / sizeof(u64)]; -#define ICSK_CA_PRIV_SIZE (8 * sizeof(u64)) + u64 icsk_ca_priv[88 / sizeof(u64)]; +#define ICSK_CA_PRIV_SIZE (11 * sizeof(u64)) }; #define ICSK_TIME_RETRANS 1 /* Retransmit timer */ -- 2.8.0.rc3.226.g39d4020
[PATCH v2 net-next 11/16] tcp: export tcp_tso_autosize() and parameterize minimum number of TSO segments
To allow congestion control modules to use the default TSO auto-sizing algorithm as one of the ingredients in their own decision about TSO sizing: 1) Export tcp_tso_autosize() so that CC modules can use it. 2) Change tcp_tso_autosize() to allow callers to specify a minimum number of segments per TSO skb, in case the congestion control module has a different notion of the best floor for TSO skbs for the connection right now. For very low-rate paths or policed connections it can be appropriate to use smaller TSO skbs. Signed-off-by: Van JacobsonSigned-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Nandita Dukkipati Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- include/net/tcp.h | 2 ++ net/ipv4/tcp_output.c | 9 ++--- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index f8f581f..3492041 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -533,6 +533,8 @@ __u32 cookie_v6_init_sequence(const struct sk_buff *skb, __u16 *mss); #endif /* tcp_output.c */ +u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now, +int min_tso_segs); void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss, int nonagle); bool tcp_may_send_now(struct sock *sk); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 0137956..0bf3d48 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1549,7 +1549,8 @@ static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp, /* Return how many segs we'd like on a TSO packet, * to send one TSO packet per ms */ -static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now) +u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now, +int min_tso_segs) { u32 bytes, segs; @@ -1561,10 +1562,11 @@ static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now) * This preserves ACK clocking and is consistent * with tcp_tso_should_defer() heuristic. */ - segs = max_t(u32, bytes / mss_now, sysctl_tcp_min_tso_segs); + segs = max_t(u32, bytes / mss_now, min_tso_segs); return min_t(u32, segs, sk->sk_gso_max_segs); } +EXPORT_SYMBOL(tcp_tso_autosize); /* Return the number of segments we want in the skb we are transmitting. * See if congestion control module wants to decide; otherwise, autosize. @@ -1574,7 +1576,8 @@ static u32 tcp_tso_segs(struct sock *sk, unsigned int mss_now) const struct tcp_congestion_ops *ca_ops = inet_csk(sk)->icsk_ca_ops; u32 tso_segs = ca_ops->tso_segs_goal ? ca_ops->tso_segs_goal(sk) : 0; - return tso_segs ? : tcp_tso_autosize(sk, mss_now); + return tso_segs ? : + tcp_tso_autosize(sk, mss_now, sysctl_tcp_min_tso_segs); } /* Returns the portion of skb which can be sent right away */ -- 2.8.0.rc3.226.g39d4020
[PATCH v2 net-next 16/16] tcp_bbr: add BBR congestion control
This commit implements a new TCP congestion control algorithm: BBR (Bottleneck Bandwidth and RTT). A detailed description of BBR will be published in ACM Queue, Vol. 14 No. 5, September-October 2016, as "BBR: Congestion-Based Congestion Control". BBR has significantly increased throughput and reduced latency for connections on Google's internal backbone networks and google.com and YouTube Web servers. BBR requires only changes on the sender side, not in the network or the receiver side. Thus it can be incrementally deployed on today's Internet, or in datacenters. The Internet has predominantly used loss-based congestion control (largely Reno or CUBIC) since the 1980s, relying on packet loss as the signal to slow down. While this worked well for many years, loss-based congestion control is unfortunately out-dated in today's networks. On today's Internet, loss-based congestion control causes the infamous bufferbloat problem, often causing seconds of needless queuing delay, since it fills the bloated buffers in many last-mile links. On today's high-speed long-haul links using commodity switches with shallow buffers, loss-based congestion control has abysmal throughput because it over-reacts to losses caused by transient traffic bursts. In 1981 Kleinrock and Gale showed that the optimal operating point for a network maximizes delivered bandwidth while minimizing delay and loss, not only for single connections but for the network as a whole. Finding that optimal operating point has been elusive, since any single network measurement is ambiguous: network measurements are the result of both bandwidth and propagation delay, and those two cannot be measured simultaneously. While it is impossible to disambiguate any single bandwidth or RTT measurement, a connection's behavior over time tells a clearer story. BBR uses a measurement strategy designed to resolve this ambiguity. It combines these measurements with a robust servo loop using recent control systems advances to implement a distributed congestion control algorithm that reacts to actual congestion, not packet loss or transient queue delay, and is designed to converge with high probability to a point near the optimal operating point. In a nutshell, BBR creates an explicit model of the network pipe by sequentially probing the bottleneck bandwidth and RTT. On the arrival of each ACK, BBR derives the current delivery rate of the last round trip, and feeds it through a windowed max-filter to estimate the bottleneck bandwidth. Conversely it uses a windowed min-filter to estimate the round trip propagation delay. The max-filtered bandwidth and min-filtered RTT estimates form BBR's model of the network pipe. Using its model, BBR sets control parameters to govern sending behavior. The primary control is the pacing rate: BBR applies a gain multiplier to transmit faster or slower than the observed bottleneck bandwidth. The conventional congestion window (cwnd) is now the secondary control; the cwnd is set to a small multiple of the estimated BDP (bandwidth-delay product) in order to allow full utilization and bandwidth probing while bounding the potential amount of queue at the bottleneck. When a BBR connection starts, it enters STARTUP mode and applies a high gain to perform an exponential search to quickly probe the bottleneck bandwidth (doubling its sending rate each round trip, like slow start). However, instead of continuing until it fills up the buffer (i.e. a loss), or until delay or ACK spacing reaches some threshold (like Hystart), it uses its model of the pipe to estimate when that pipe is full: it estimates the pipe is full when it notices the estimated bandwidth has stopped growing. At that point it exits STARTUP and enters DRAIN mode, where it reduces its pacing rate to drain the queue it estimates it has created. Then BBR enters steady state. In steady state, PROBE_BW mode cycles between first pacing faster to probe for more bandwidth, then pacing slower to drain any queue that created if no more bandwidth was available, and then cruising at the estimated bandwidth to utilize the pipe without creating excess queue. Occasionally, on an as-needed basis, it sends significantly slower to probe for RTT (PROBE_RTT mode). Our long-term goal is to improve the congestion control algorithms used on the Internet. We are hopeful that BBR can help advance the efforts toward this goal, and motivate the community to do further research. Test results, performance evaluations, feedback, and BBR-related discussions are very welcome in the public e-mail list for BBR: https://groups.google.com/forum/#!forum/bbr-dev Signed-off-by: Van JacobsonSigned-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Nandita Dukkipati Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- include/uapi/linux/inet_diag.h | 13
[PATCH v2 net-next 06/16] tcp: count packets marked lost for a TCP connection
Count the number of packets that a TCP connection marks lost. Congestion control modules can use this loss rate information for more intelligent decisions about how fast to send. Specifically, this is used in TCP BBR policer detection. BBR uses a high packet loss rate as one signal in its policer detection and policer bandwidth estimation algorithm. The BBR policer detection algorithm cannot simply track retransmits, because a retransmit can be (and often is) an indicator of packets lost long, long ago. This is particularly true in a long CA_Loss period that repairs the initial massive losses when a policer kicks in. Signed-off-by: Van JacobsonSigned-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Nandita Dukkipati Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- include/linux/tcp.h | 1 + net/ipv4/tcp_input.c | 25 - 2 files changed, 25 insertions(+), 1 deletion(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 6433cc8..38590fb 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -267,6 +267,7 @@ struct tcp_sock { * receiver in Recovery. */ u32 prr_out;/* Total number of pkts sent during Recovery. */ u32 delivered; /* Total data packets delivered incl. rexmits */ + u32 lost; /* Total data packets lost incl. rexmits */ u32 rcv_wnd;/* Current receiver window */ u32 write_seq; /* Tail(+1) of data held in tcp send buffer */ diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index ac5b38f..024b579 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -899,12 +899,29 @@ static void tcp_verify_retransmit_hint(struct tcp_sock *tp, struct sk_buff *skb) tp->retransmit_high = TCP_SKB_CB(skb)->end_seq; } +/* Sum the number of packets on the wire we have marked as lost. + * There are two cases we care about here: + * a) Packet hasn't been marked lost (nor retransmitted), + *and this is the first loss. + * b) Packet has been marked both lost and retransmitted, + *and this means we think it was lost again. + */ +static void tcp_sum_lost(struct tcp_sock *tp, struct sk_buff *skb) +{ + __u8 sacked = TCP_SKB_CB(skb)->sacked; + + if (!(sacked & TCPCB_LOST) || + ((sacked & TCPCB_LOST) && (sacked & TCPCB_SACKED_RETRANS))) + tp->lost += tcp_skb_pcount(skb); +} + static void tcp_skb_mark_lost(struct tcp_sock *tp, struct sk_buff *skb) { if (!(TCP_SKB_CB(skb)->sacked & (TCPCB_LOST|TCPCB_SACKED_ACKED))) { tcp_verify_retransmit_hint(tp, skb); tp->lost_out += tcp_skb_pcount(skb); + tcp_sum_lost(tp, skb); TCP_SKB_CB(skb)->sacked |= TCPCB_LOST; } } @@ -913,6 +930,7 @@ void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb) { tcp_verify_retransmit_hint(tp, skb); + tcp_sum_lost(tp, skb); if (!(TCP_SKB_CB(skb)->sacked & (TCPCB_LOST|TCPCB_SACKED_ACKED))) { tp->lost_out += tcp_skb_pcount(skb); TCP_SKB_CB(skb)->sacked |= TCPCB_LOST; @@ -1890,6 +1908,7 @@ void tcp_enter_loss(struct sock *sk) struct sk_buff *skb; bool new_recovery = icsk->icsk_ca_state < TCP_CA_Recovery; bool is_reneg; /* is receiver reneging on SACKs? */ + bool mark_lost; /* Reduce ssthresh if it has not yet been made inside this window. */ if (icsk->icsk_ca_state <= TCP_CA_Disorder || @@ -1923,8 +1942,12 @@ void tcp_enter_loss(struct sock *sk) if (skb == tcp_send_head(sk)) break; + mark_lost = (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) || +is_reneg); + if (mark_lost) + tcp_sum_lost(tp, skb); TCP_SKB_CB(skb)->sacked &= (~TCPCB_TAGBITS)|TCPCB_SACKED_ACKED; - if (!(TCP_SKB_CB(skb)->sacked_SACKED_ACKED) || is_reneg) { + if (mark_lost) { TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED; TCP_SKB_CB(skb)->sacked |= TCPCB_LOST; tp->lost_out += tcp_skb_pcount(skb); -- 2.8.0.rc3.226.g39d4020
[PATCH v2 net-next 08/16] tcp: track application-limited rate samples
From: Soheil Hassas YeganehThis commit adds code to track whether the delivery rate represented by each rate_sample was limited by the application. Upon each transmit, we store in the is_app_limited field in the skb a boolean bit indicating whether there is a known "bubble in the pipe": a point in the rate sample interval where the sender was application-limited, and did not transmit even though the cwnd and pacing rate allowed it. This logic marks the flow app-limited on a write if *all* of the following are true: 1) There is less than 1 MSS of unsent data in the write queue available to transmit. 2) There is no packet in the sender's queues (e.g. in fq or the NIC tx queue). 3) The connection is not limited by cwnd. 4) There are no lost packets to retransmit. The tcp_rate_check_app_limited() code in tcp_rate.c determines whether the connection is application-limited at the moment. If the flow is application-limited, it sets the tp->app_limited field. If the flow is application-limited then that means there is effectively a "bubble" of silence in the pipe now, and this silence will be reflected in a lower bandwidth sample for any rate samples from now until we get an ACK indicating this bubble has exited the pipe: specifically, until we get an ACK for the next packet we transmit. When we send every skb we record in scb->tx.is_app_limited whether the resulting rate sample will be application-limited. The code in tcp_rate_gen() checks to see when it is safe to mark all known application-limited bubbles of silence as having exited the pipe. It does this by checking to see when the delivered count moves past the tp->app_limited marker. At this point it zeroes the tp->app_limited marker, as all known bubbles are out of the pipe. We make room for the tx.is_app_limited bit in the skb by borrowing a bit from the in_flight field used by NV to record the number of bytes in flight. The receive window in the TCP header is 16 bits, and the max receive window scaling shift factor is 14 (RFC 1323). So the max receive window offered by the TCP protocol is 2^(16+14) = 2^30. So we only need 30 bits for the tx.in_flight used by NV. Signed-off-by: Van Jacobson Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Nandita Dukkipati Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- include/linux/tcp.h | 1 + include/net/tcp.h| 6 +- net/ipv4/tcp.c | 8 net/ipv4/tcp_minisocks.c | 3 +++ net/ipv4/tcp_rate.c | 29 - 5 files changed, 45 insertions(+), 2 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index c50e6ae..fdcd00f 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -268,6 +268,7 @@ struct tcp_sock { u32 prr_out;/* Total number of pkts sent during Recovery. */ u32 delivered; /* Total data packets delivered incl. rexmits */ u32 lost; /* Total data packets lost incl. rexmits */ + u32 app_limited;/* limited until "delivered" reaches this val */ struct skb_mstamp first_tx_mstamp; /* start of window send phase */ struct skb_mstamp delivered_mstamp; /* time we reached "delivered" */ diff --git a/include/net/tcp.h b/include/net/tcp.h index b261c89..a69ed7f 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -764,7 +764,9 @@ struct tcp_skb_cb { union { struct { /* There is space for up to 24 bytes */ - __u32 in_flight;/* Bytes in flight when packet sent */ + __u32 in_flight:30,/* Bytes in flight at transmit */ + is_app_limited:1, /* cwnd not fully used? */ + unused:1; /* pkts S/ACKed so far upon tx of skb, incl retrans: */ __u32 delivered; /* start of send pipeline phase */ @@ -883,6 +885,7 @@ struct rate_sample { int losses;/* number of packets marked lost upon ACK */ u32 acked_sacked; /* number of packets newly (S)ACKed upon ACK */ u32 prior_in_flight; /* in flight before this ACK */ + bool is_app_limited;/* is sample from packet with bubble in pipe? */ bool is_retrans;/* is sample from retransmission? */ }; @@ -978,6 +981,7 @@ void tcp_rate_skb_delivered(struct sock *sk, struct sk_buff *skb, struct rate_sample *rs); void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost, struct skb_mstamp *now, struct rate_sample *rs); +void tcp_rate_check_app_limited(struct sock *sk); /* These functions determine how the current flow behaves in respect of SACK * handling. SACK is
[PATCH v2 net-next 01/16] tcp: cdg: rename struct minmax in tcp_cdg.c to avoid a naming conflict
From: Soheil Hassas YeganehThe upcoming change "lib/win_minmax: windowed min or max estimator" introduces a struct called minmax, which is then included in include/linux/tcp.h in the upcoming change "tcp: use windowed min filter library for TCP min_rtt estimation". This would create a compilation error for tcp_cdg.c, which defines its own minmax struct. To avoid this naming conflict (and potentially others in the future), this commit renames the version used in tcp_cdg.c to cdg_minmax. Signed-off-by: Soheil Hassas Yeganeh Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Eric Dumazet Cc: Kenneth Klette Jonassen --- net/ipv4/tcp_cdg.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/net/ipv4/tcp_cdg.c b/net/ipv4/tcp_cdg.c index 03725b2..35b2803 100644 --- a/net/ipv4/tcp_cdg.c +++ b/net/ipv4/tcp_cdg.c @@ -56,7 +56,7 @@ MODULE_PARM_DESC(use_shadow, "use shadow window heuristic"); module_param(use_tolerance, bool, 0644); MODULE_PARM_DESC(use_tolerance, "use loss tolerance heuristic"); -struct minmax { +struct cdg_minmax { union { struct { s32 min; @@ -74,10 +74,10 @@ enum cdg_state { }; struct cdg { - struct minmax rtt; - struct minmax rtt_prev; - struct minmax *gradients; - struct minmax gsum; + struct cdg_minmax rtt; + struct cdg_minmax rtt_prev; + struct cdg_minmax *gradients; + struct cdg_minmax gsum; bool gfilled; u8 tail; u8 state; @@ -353,7 +353,7 @@ static void tcp_cdg_cwnd_event(struct sock *sk, const enum tcp_ca_event ev) { struct cdg *ca = inet_csk_ca(sk); struct tcp_sock *tp = tcp_sk(sk); - struct minmax *gradients; + struct cdg_minmax *gradients; switch (ev) { case CA_EVENT_CWND_RESTART: -- 2.8.0.rc3.226.g39d4020
[PATCH v2 net-next 00/16] tcp: BBR congestion control algorithm
tcp: BBR congestion control algorithm This patch series implements a new TCP congestion control algorithm: BBR (Bottleneck Bandwidth and RTT). A paper with a detailed description of BBR will be published in ACM Queue, September-October 2016, as "BBR: Congestion-Based Congestion Control". BBR is widely deployed in production at Google. The patch series starts with a set of supporting infrastructure changes, including a few that extend the congestion control framework. The last patch adds BBR as a TCP congestion control module. Please see individual patches for the details. - v1 -> v2: fix issues caught by build bots: - fix "tcp: export data delivery rate" to use rate64 instead of rate, so there is a 64-bit numerator for the do_div call - fix conflicting definitions for minmax caused by "tcp: use windowed min filter library for TCP min_rtt estimation" with a new commit: tcp: cdg: rename struct minmax in tcp_cdg.c to avoid a naming conflict - fix warning about the use of __packed in "tcp: track data delivery rate for a TCP connection", which involves the addition of a new commit: tcp: switch back to proper tcp_skb_cb size check in tcp_init() Eric Dumazet (2): net_sched: sch_fq: add low_rate_threshold parameter tcp: switch back to proper tcp_skb_cb size check in tcp_init() Neal Cardwell (8): lib/win_minmax: windowed min or max estimator tcp: use windowed min filter library for TCP min_rtt estimation tcp: count packets marked lost for a TCP connection tcp: allow congestion control module to request TSO skb segment count tcp: export tcp_tso_autosize() and parameterize minimum number of TSO segments tcp: export tcp_mss_to_mtu() for congestion control modules tcp: increase ICSK_CA_PRIV_SIZE from 64 bytes to 88 tcp_bbr: add BBR congestion control Soheil Hassas Yeganeh (2): tcp: cdg: rename struct minmax in tcp_cdg.c to avoid a naming conflict tcp: track application-limited rate samples Yuchung Cheng (4): tcp: track data delivery rate for a TCP connection tcp: export data delivery rate tcp: allow congestion control to expand send buffer differently tcp: new CC hook to set sending rate with rate_sample in any CA state include/linux/tcp.h| 14 +- include/linux/win_minmax.h | 37 ++ include/net/inet_connection_sock.h | 4 +- include/net/tcp.h | 53 ++- include/uapi/linux/inet_diag.h | 13 + include/uapi/linux/pkt_sched.h | 2 + include/uapi/linux/tcp.h | 3 + lib/Makefile | 2 +- lib/win_minmax.c | 98 + net/ipv4/Kconfig | 18 + net/ipv4/Makefile | 3 +- net/ipv4/tcp.c | 26 +- net/ipv4/tcp_bbr.c | 875 + net/ipv4/tcp_cdg.c | 12 +- net/ipv4/tcp_cong.c| 2 +- net/ipv4/tcp_input.c | 154 +++ net/ipv4/tcp_minisocks.c | 5 +- net/ipv4/tcp_output.c | 27 +- net/ipv4/tcp_rate.c| 186 net/sched/sch_fq.c | 22 +- 20 files changed, 1449 insertions(+), 107 deletions(-) create mode 100644 include/linux/win_minmax.h create mode 100644 lib/win_minmax.c create mode 100644 net/ipv4/tcp_bbr.c create mode 100644 net/ipv4/tcp_rate.c -- 2.8.0.rc3.226.g39d4020
Re: [RFC PATCH 9/9] ethernet: sun8i-emac: add pm_runtime support
On 09/14/2016 07:03 AM, LABBE Corentin wrote: > On Mon, Sep 12, 2016 at 10:44:51PM +0200, Maxime Ripard wrote: >>> +static int __maybe_unused sun8i_emac_resume(struct platform_device *pdev) >>> +{ >>> + struct net_device *ndev = platform_get_drvdata(pdev); >>> + struct sun8i_emac_priv *priv = netdev_priv(ndev); >>> + >>> + phy_start(ndev->phydev); >>> + >>> + sun8i_emac_start_tx(ndev); >>> + sun8i_emac_start_rx(ndev); >>> + >>> + if (netif_running(ndev)) >>> + netif_device_attach(ndev); >>> + >>> + netif_start_queue(ndev); >>> + >>> + napi_enable(>napi); >>> + >>> + return 0; >>> +} >> >> The main idea behind the runtime PM hooks is that they bring the >> device to a working state and shuts it down when it's not needed >> anymore. >> > > I expect that the first part (all pm_runtime_xxx) of the patch bring that. > When the interface is not opened: > cat /sys/devices/platform/soc/1c3.ethernet/power/runtime_status > suspended If your interface is not open, it should be in a low power state, only when it gets open (which means it is used) should you make it functional, that's pretty much the same thing as the runtime PM reference count usage here. I don't see a lot of value for using runtime_pm_* hooks here except calling into the existing suspend/resume functions that you have defined already, but then again, the code should be modular enough already in the driver. Runtime PM for network devices cannot be used as efficiently as you would with any kind of host-initiated bus/controller because your device needs to be able to receive packets without the host's ability to wake up the device to receive packets, so, with the exception of MDIO (which is host initiated), everything else besides except packet transmission (then again, I would not want to wait N ms to bring the interface in a state where it can now transmit packets, that's terrible for latency) is pretty much impossible to fully suspend due to its asynchronous nature. -- Florian
[PATCH net-next v3 0/3] net: ethernet: mediatek: add HW LRO functions
The series add the large receive offload (LRO) functions by hardware and the ethtool functions to configure RX flows of HW LRO. changes since v3: - Respin the patch by the newer driver - Move the dts description of hwlro to optional properties changes since v2: - Add ndo_fix_features to prevent NETIF_F_LRO off while RX flow is programmed - Rephrase the dts property is a capability if the hardware supports LRO changes since v1: - Add HW LRO support - Add ethtool hooks to set LRO RX flows Nelson Chang (3): net: ethernet: mediatek: add HW LRO functions of PDMA RX rings net: ethernet: mediatek: add ethtool functions to configure RX flows of HW LRO net: ethernet: mediatek: add dts configuration to enable HW LRO .../devicetree/bindings/net/mediatek-net.txt | 2 + drivers/net/ethernet/mediatek/mtk_eth_soc.c| 433 +++-- drivers/net/ethernet/mediatek/mtk_eth_soc.h| 75 +++- 3 files changed, 485 insertions(+), 25 deletions(-) -- 1.9.1
[PATCH -next] cxgb4: Fix return value check in cfg_queues_uld()
From: Wei YongjunFix the retrn value check which testing the wrong variable in cfg_queues_uld(). Fixes: 94cdb8bb993a ("cxgb4: Add support for dynamic allocation of resources for ULD") Signed-off-by: Wei Yongjun --- drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c index 5d402ba..4d1de62 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c @@ -245,7 +245,7 @@ int cfg_queues_uld(struct adapter *adap, unsigned int uld_type, } rxq_info->rspq_id = kcalloc(nrxq, sizeof(unsigned short), GFP_KERNEL); - if (!rxq_info->uldrxq) { + if (!rxq_info->rspq_id) { kfree(rxq_info->uldrxq); kfree(rxq_info); return -ENOMEM;
[PATCH net-next v3 3/3] net: ethernet: mediatek: add the dts property to set if the HW supports LRO
Add the dts property for the capability if the hardware supports LRO. Signed-off-by: Nelson Chang--- Documentation/devicetree/bindings/net/mediatek-net.txt | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/Documentation/devicetree/bindings/net/mediatek-net.txt b/Documentation/devicetree/bindings/net/mediatek-net.txt index 32eaaca..6103e55 100644 --- a/Documentation/devicetree/bindings/net/mediatek-net.txt +++ b/Documentation/devicetree/bindings/net/mediatek-net.txt @@ -24,7 +24,7 @@ Required properties: Optional properties: - interrupt-parent: Should be the phandle for the interrupt controller that services interrupts for this device - +- mediatek,hwlro: the capability if the hardware supports LRO functions * Ethernet MAC node @@ -51,6 +51,7 @@ eth: ethernet@1b10 { reset-names = "eth"; mediatek,ethsys = <>; mediatek,pctl = <_pctl_a>; + mediatek,hwlro; #address-cells = <1>; #size-cells = <0>; -- 1.9.1
[PATCH net-next v3 2/3] net: ethernet: mediatek: add ethtool functions to configure RX flows of HW LRO
The codes add ethtool functions to set RX flows for HW LRO. Because the HW LRO hardware can only recognize the destination IP of TCP/IP RX flows, the ethtool command to add HW LRO flow is as below: ethtool -N [devname] flow-type tcp4 dst-ip [ip_addr] loc [0~1] Otherwise, cause the hardware can set total four destination IPs, each GMAC (GMAC1/GMAC2) can set two IPs separately at most. Signed-off-by: Nelson Chang--- drivers/net/ethernet/mediatek/mtk_eth_soc.c | 236 1 file changed, 236 insertions(+) diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c index 18600cb..481f360 100644 --- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c +++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c @@ -1357,6 +1357,182 @@ static void mtk_hwlro_rx_uninit(struct mtk_eth *eth) mtk_w32(eth, 0, MTK_PDMA_LRO_CTRL_DW0); } +static void mtk_hwlro_val_ipaddr(struct mtk_eth *eth, int idx, __be32 ip) +{ + u32 reg_val; + + reg_val = mtk_r32(eth, MTK_LRO_CTRL_DW2_CFG(idx)); + + /* invalidate the IP setting */ + mtk_w32(eth, (reg_val & ~MTK_RING_MYIP_VLD), MTK_LRO_CTRL_DW2_CFG(idx)); + + mtk_w32(eth, ip, MTK_LRO_DIP_DW0_CFG(idx)); + + /* validate the IP setting */ + mtk_w32(eth, (reg_val | MTK_RING_MYIP_VLD), MTK_LRO_CTRL_DW2_CFG(idx)); +} + +static void mtk_hwlro_inval_ipaddr(struct mtk_eth *eth, int idx) +{ + u32 reg_val; + + reg_val = mtk_r32(eth, MTK_LRO_CTRL_DW2_CFG(idx)); + + /* invalidate the IP setting */ + mtk_w32(eth, (reg_val & ~MTK_RING_MYIP_VLD), MTK_LRO_CTRL_DW2_CFG(idx)); + + mtk_w32(eth, 0, MTK_LRO_DIP_DW0_CFG(idx)); +} + +static int mtk_hwlro_get_ip_cnt(struct mtk_mac *mac) +{ + int cnt = 0; + int i; + + for (i = 0; i < MTK_MAX_LRO_IP_CNT; i++) { + if (mac->hwlro_ip[i]) + cnt++; + } + + return cnt; +} + +static int mtk_hwlro_add_ipaddr(struct net_device *dev, + struct ethtool_rxnfc *cmd) +{ + struct ethtool_rx_flow_spec *fsp = + (struct ethtool_rx_flow_spec *)>fs; + struct mtk_mac *mac = netdev_priv(dev); + struct mtk_eth *eth = mac->hw; + int hwlro_idx; + + if ((fsp->flow_type != TCP_V4_FLOW) || + (!fsp->h_u.tcp_ip4_spec.ip4dst) || + (fsp->location > 1)) + return -EINVAL; + + mac->hwlro_ip[fsp->location] = htonl(fsp->h_u.tcp_ip4_spec.ip4dst); + hwlro_idx = (mac->id * MTK_MAX_LRO_IP_CNT) + fsp->location; + + mac->hwlro_ip_cnt = mtk_hwlro_get_ip_cnt(mac); + + mtk_hwlro_val_ipaddr(eth, hwlro_idx, mac->hwlro_ip[fsp->location]); + + return 0; +} + +static int mtk_hwlro_del_ipaddr(struct net_device *dev, + struct ethtool_rxnfc *cmd) +{ + struct ethtool_rx_flow_spec *fsp = + (struct ethtool_rx_flow_spec *)>fs; + struct mtk_mac *mac = netdev_priv(dev); + struct mtk_eth *eth = mac->hw; + int hwlro_idx; + + if (fsp->location > 1) + return -EINVAL; + + mac->hwlro_ip[fsp->location] = 0; + hwlro_idx = (mac->id * MTK_MAX_LRO_IP_CNT) + fsp->location; + + mac->hwlro_ip_cnt = mtk_hwlro_get_ip_cnt(mac); + + mtk_hwlro_inval_ipaddr(eth, hwlro_idx); + + return 0; +} + +static void mtk_hwlro_netdev_disable(struct net_device *dev) +{ + struct mtk_mac *mac = netdev_priv(dev); + struct mtk_eth *eth = mac->hw; + int i, hwlro_idx; + + for (i = 0; i < MTK_MAX_LRO_IP_CNT; i++) { + mac->hwlro_ip[i] = 0; + hwlro_idx = (mac->id * MTK_MAX_LRO_IP_CNT) + i; + + mtk_hwlro_inval_ipaddr(eth, hwlro_idx); + } + + mac->hwlro_ip_cnt = 0; +} + +static int mtk_hwlro_get_fdir_entry(struct net_device *dev, + struct ethtool_rxnfc *cmd) +{ + struct mtk_mac *mac = netdev_priv(dev); + struct ethtool_rx_flow_spec *fsp = + (struct ethtool_rx_flow_spec *)>fs; + + /* only tcp dst ipv4 is meaningful, others are meaningless */ + fsp->flow_type = TCP_V4_FLOW; + fsp->h_u.tcp_ip4_spec.ip4dst = ntohl(mac->hwlro_ip[fsp->location]); + fsp->m_u.tcp_ip4_spec.ip4dst = 0; + + fsp->h_u.tcp_ip4_spec.ip4src = 0; + fsp->m_u.tcp_ip4_spec.ip4src = 0x; + fsp->h_u.tcp_ip4_spec.psrc = 0; + fsp->m_u.tcp_ip4_spec.psrc = 0x; + fsp->h_u.tcp_ip4_spec.pdst = 0; + fsp->m_u.tcp_ip4_spec.pdst = 0x; + fsp->h_u.tcp_ip4_spec.tos = 0; + fsp->m_u.tcp_ip4_spec.tos = 0xff; + + return 0; +} + +static int mtk_hwlro_get_fdir_all(struct net_device *dev, + struct ethtool_rxnfc *cmd, + u32 *rule_locs) +{ + struct mtk_mac *mac = netdev_priv(dev); + int cnt = 0; + int i; + + for (i = 0; i <
[PATCH net-next v3 1/3] net: ethernet: mediatek: add HW LRO functions of PDMA RX rings
The codes add the large receive offload (LRO) functions by hardware as below: 1) PDMA has total four RX rings that one is the normal ring, and others can be configured as LRO rings. 2) Only TCP/IP RX flows can be offloaded. The hardware can set four IP addresses at most, if the destination IP of the RX flow matches one of them, it has the chance to be offloaded. 3) There three RX flows can be offloaded at most, and one flow is mapped to one RX ring. 4) If there are more than three candidate RX flows, the hardware can choose three of them by throughput comparison results. Signed-off-by: Nelson Chang--- drivers/net/ethernet/mediatek/mtk_eth_soc.c | 215 +--- drivers/net/ethernet/mediatek/mtk_eth_soc.h | 75 +- 2 files changed, 265 insertions(+), 25 deletions(-) diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c index 522fe8d..18600cb 100644 --- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c +++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c @@ -820,11 +820,51 @@ drop: return NETDEV_TX_OK; } +static struct mtk_rx_ring *mtk_get_rx_ring(struct mtk_eth *eth) +{ + int i; + struct mtk_rx_ring *ring; + int idx; + + if (!eth->hwlro) + return >rx_ring[0]; + + for (i = 0; i < MTK_MAX_RX_RING_NUM; i++) { + ring = >rx_ring[i]; + idx = NEXT_RX_DESP_IDX(ring->calc_idx, ring->dma_size); + if (ring->dma[idx].rxd2 & RX_DMA_DONE) { + ring->calc_idx_update = true; + return ring; + } + } + + return NULL; +} + +static void mtk_update_rx_cpu_idx(struct mtk_eth *eth) +{ + struct mtk_rx_ring *ring; + int i; + + if (!eth->hwlro) { + ring = >rx_ring[0]; + mtk_w32(eth, ring->calc_idx, ring->crx_idx_reg); + } else { + for (i = 0; i < MTK_MAX_RX_RING_NUM; i++) { + ring = >rx_ring[i]; + if (ring->calc_idx_update) { + ring->calc_idx_update = false; + mtk_w32(eth, ring->calc_idx, ring->crx_idx_reg); + } + } + } +} + static int mtk_poll_rx(struct napi_struct *napi, int budget, struct mtk_eth *eth) { - struct mtk_rx_ring *ring = >rx_ring; - int idx = ring->calc_idx; + struct mtk_rx_ring *ring; + int idx; struct sk_buff *skb; u8 *data, *new_data; struct mtk_rx_dma *rxd, trxd; @@ -836,7 +876,11 @@ static int mtk_poll_rx(struct napi_struct *napi, int budget, dma_addr_t dma_addr; int mac = 0; - idx = NEXT_RX_DESP_IDX(idx); + ring = mtk_get_rx_ring(eth); + if (unlikely(!ring)) + goto rx_done; + + idx = NEXT_RX_DESP_IDX(ring->calc_idx, ring->dma_size); rxd = >dma[idx]; data = ring->data[idx]; @@ -907,12 +951,13 @@ release_desc: done++; } +rx_done: if (done) { /* make sure that all changes to the dma ring are flushed before * we continue */ wmb(); - mtk_w32(eth, ring->calc_idx, MTK_PRX_CRX_IDX0); + mtk_update_rx_cpu_idx(eth); } return done; @@ -1135,32 +1180,41 @@ static void mtk_tx_clean(struct mtk_eth *eth) } } -static int mtk_rx_alloc(struct mtk_eth *eth) +static int mtk_rx_alloc(struct mtk_eth *eth, int ring_no, int rx_flag) { - struct mtk_rx_ring *ring = >rx_ring; + struct mtk_rx_ring *ring = >rx_ring[ring_no]; + int rx_data_len, rx_dma_size; int i; - ring->frag_size = mtk_max_frag_size(ETH_DATA_LEN); + if (rx_flag == MTK_RX_FLAGS_HWLRO) { + rx_data_len = MTK_MAX_LRO_RX_LENGTH; + rx_dma_size = MTK_HW_LRO_DMA_SIZE; + } else { + rx_data_len = ETH_DATA_LEN; + rx_dma_size = MTK_DMA_SIZE; + } + + ring->frag_size = mtk_max_frag_size(rx_data_len); ring->buf_size = mtk_max_buf_size(ring->frag_size); - ring->data = kcalloc(MTK_DMA_SIZE, sizeof(*ring->data), + ring->data = kcalloc(rx_dma_size, sizeof(*ring->data), GFP_KERNEL); if (!ring->data) return -ENOMEM; - for (i = 0; i < MTK_DMA_SIZE; i++) { + for (i = 0; i < rx_dma_size; i++) { ring->data[i] = netdev_alloc_frag(ring->frag_size); if (!ring->data[i]) return -ENOMEM; } ring->dma = dma_alloc_coherent(eth->dev, - MTK_DMA_SIZE * sizeof(*ring->dma), + rx_dma_size * sizeof(*ring->dma),
[PATCH net-next v2 0/3] net: ethernet: mediatek: add HW LRO functions
The series add the large receive offload (LRO) functions by hardware and the ethtool functions to configure RX flows of HW LRO. changes since v3: - Respin the patch by the newer driver - Move the dts description of hwlro to optional properties changes since v2: - Add ndo_fix_features to prevent NETIF_F_LRO off while RX flow is programmed - Rephrase the dts property is a capability if the hardware supports LRO changes since v1: - Add HW LRO support - Add ethtool hooks to set LRO RX flows Nelson Chang (3): net: ethernet: mediatek: add HW LRO functions of PDMA RX rings net: ethernet: mediatek: add ethtool functions to configure RX flows of HW LRO net: ethernet: mediatek: add dts configuration to enable HW LRO .../devicetree/bindings/net/mediatek-net.txt | 2 + drivers/net/ethernet/mediatek/mtk_eth_soc.c| 433 +++-- drivers/net/ethernet/mediatek/mtk_eth_soc.h| 75 +++- 3 files changed, 485 insertions(+), 25 deletions(-) -- 1.9.1
RE: [PATCH net-next v2 0/3] net: ethernet: mediatek: add HW LRO functions
Thanks David! I'll respin the patch and submit the newer version. -Original Message- From: David Miller [mailto:da...@davemloft.net] Sent: Saturday, September 17, 2016 9:46 PM To: Nelson Chang (張家祥) Cc: j...@phrozen.org; f.faine...@gmail.com; n...@openwrt.org; netdev@vger.kernel.org; linux-media...@lists.infradead.org; nelsonch...@gmail.com Subject: Re: [PATCH net-next v2 0/3] net: ethernet: mediatek: add HW LRO functions From: Nelson ChangDate: Wed, 14 Sep 2016 13:58:56 +0800 > The series add the large receive offload (LRO) functions by hardware > and the ethtool functions to configure RX flows of HW LRO. > > changes since v2: > - Add ndo_fix_features to prevent NETIF_F_LRO off while RX flow is > programmed > - Rephrase the dts property is a capability if the hardware supports > LRO > > changes since v1: > - Add HW LRO support > - Add ethtool hooks to set LRO RX flows This doesn't apply cleanly to net-next.
[net PATCH V3] mlx4: fix XDP_TX is acting like XDP_PASS on TX ring full
The XDP_TX action can fail transmitting the frame in case the TX ring is full or port is down. In case of TX failure it should drop the frame, and not as now call 'break' which is the same as XDP_PASS. Fixes: 9ecc2d86171a ("net/mlx4_en: add xdp forwarding and data write support") Signed-off-by: Jesper Dangaard Brouer--- Is this goto lable inside a switch case too ugly? Note, this fix have nothing to do with the page-refcnt bug I reported. drivers/net/ethernet/mellanox/mlx4/en_rx.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index 2040dad8611d..9eadda431965 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -906,11 +906,12 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud length, tx_index, _pending)) goto consumed; - break; + goto xdp_drop; /* Drop on xmit failure */ default: bpf_warn_invalid_xdp_action(act); case XDP_ABORTED: case XDP_DROP: + xdp_drop: if (mlx4_en_rx_recycle(ring, frags)) goto consumed; goto next;
Your (Email Address) Outlook exceeded
Your (Email Address) Outlook exceeded its storage limit. https://docs.google.com/forms/d/e/1FAIpQLSdtc96pXFgZ5LIOEaRYQaBOvX0ae7kS_RpTukKOq7eI4RASQw/viewform (FILL) and Click on Submit to get more space or you wont be able to send Mail.
Re: [net PATCH] mlx4: fix XDP_TX is acting like XDP_PASS on TX ring full
On Fri, 16 Sep 2016 13:43:50 -0700 Brenden Blancowrote: > On Fri, Sep 16, 2016 at 10:36:12PM +0200, Jesper Dangaard Brouer wrote: > > The XDP_TX action can fail transmitting the frame in case the TX ring > > is full or port is down. In case of TX failure it should drop the > > frame, and not as now call 'break' which is the same as XDP_PASS. > > > > Fixes: 9ecc2d86171a ("net/mlx4_en: add xdp forwarding and data write > > support") > > Signed-off-by: Jesper Dangaard Brouer > > You could in theory have also tried to recycle the page instead of > dropping it, but that's probably not worth optimizing when tx is backed > up, as you'll only save a handful of page_put's. The code to do so > wouldn't have been pretty. Yes, we could (and perhaps should) recycle the page instead. But as you also mention it would not look pretty. I'll send a V3, as XDPs primary concern is performance. > Reviewed-by: Brenden Blanco -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH net-next 0/4] ip_tunnel: add collect_md mode to IPv4/IPv6 tunnels
From: Alexei StarovoitovDate: Thu, 15 Sep 2016 13:00:28 -0700 > Similar to geneve, vxlan, gre tunnels implement 'collect metadata' mode > in ipip, ipip6, ip6ip6 tunnels. Series applied, thanks.
Re: [PATCH 0/3] constify net_device_ops structures
From: Julia LawallDate: Thu, 15 Sep 2016 22:23:23 +0200 > Constify net_device_ops structures. All applied, thanks.
Re: [PATCH net-next] net: vrf: Remove RT_FL_TOS
From: David AhernDate: Thu, 15 Sep 2016 10:13:47 -0700 > No longer used after d66f6c0a8f3c0 ("net: ipv4: Remove l3mdev_get_saddr") > > Signed-off-by: David Ahern Applied.
Re: [PATCH net-next] net: l3mdev: Remove netif_index_is_l3_master
From: David AhernDate: Thu, 15 Sep 2016 10:18:45 -0700 > No longer used after e0d56fdd73422 ("net: l3mdev: remove redundant calls") > > Signed-off-by: David Ahern Applied.
Re: [PATCH] llc: switch type to bool as the timeout is only tested versus 0
From: AlanDate: Thu, 15 Sep 2016 18:51:25 +0100 > (As asked by Dave in Februrary) > > Signed-off-by: Alan Cox Applied.
Re: [PATCH net-next] tcp: prepare skbs for better sack shifting
From: Eric DumazetDate: Thu, 15 Sep 2016 09:33:02 -0700 > From: Eric Dumazet > > With large BDP TCP flows and lossy networks, it is very important > to keep a low number of skbs in the write queue. > > RACK and SACK processing can perform a linear scan of it. > > We should avoid putting any payload in skb->head, so that SACK > shifting can be done if needed. > > With this patch, we allow to pack ~0.5 MB per skb instead of > the 64KB initially cooked at tcp_sendmsg() time. > > This gives a reduction of number of skbs in write queue by eight. > tcp_rack_detect_loss() likes this. > > We still allow payload in skb->head for first skb put in the queue, > to not impact RPC workloads. > > Signed-off-by: Eric Dumazet > Cc: Yuchung Cheng Applied.
Re: [PATCH net] sctp: fix SSN comparision
From: Marcelo Ricardo LeitnerDate: Thu, 15 Sep 2016 15:02:38 -0300 > This function actually operates on u32 yet its paramteres were declared > as u16, causing integer truncation upon calling. > > Note in patch context that ADDIP_SERIAL_SIGN_BIT is already 32 bits. > > Signed-off-by: Marcelo Ricardo Leitner Applied.
Re: [PATCH] irda: Free skb on irda_accept error path.
From: Phil TurnbullDate: Thu, 15 Sep 2016 12:41:44 -0400 > skb is not freed if newsk is NULL. Rework the error path so free_skb is > unconditionally called on function exit. > > Fixes: c3ea9fa27413 ("[IrDA] af_irda: IRDA_ASSERT cleanups") > Signed-off-by: Phil Turnbull Applied.
Re: [PATCH net] tcp: fix overflow in __tcp_retransmit_skb()
From: Eric DumazetDate: Thu, 15 Sep 2016 08:12:33 -0700 > From: Eric Dumazet > > If a TCP socket gets a large write queue, an overflow can happen > in a test in __tcp_retransmit_skb() preventing all retransmits. > > The flow then stalls and resets after timeouts. > > Tested: > > sysctl -w net.core.wmem_max=10 > netperf -H dest -- -s 10 > > Signed-off-by: Eric Dumazet Applied.
Re: [PATCH net] net: avoid sk_forward_alloc overflows
From: Eric DumazetDate: Thu, 15 Sep 2016 08:48:46 -0700 > From: Eric Dumazet > > A malicious TCP receiver, sending SACK, can force the sender to split > skbs in write queue and increase its memory usage. > > Then, when socket is closed and its write queue purged, we might > overflow sk_forward_alloc (It becomes negative) > > sk_mem_reclaim() does nothing in this case, and more than 2GB > are leaked from TCP perspective (tcp_memory_allocated is not changed) > > Then warnings trigger from inet_sock_destruct() and > sk_stream_kill_queues() seeing a not zero sk_forward_alloc > > All TCP stack can be stuck because TCP is under memory pressure. > > A simple fix is to preemptively reclaim from sk_mem_uncharge(). > > This makes sure a socket wont have more than 2 MB forward allocated, > after burst and idle period. > > Signed-off-by: Eric Dumazet Applied.
Re: [PATCH v2] xen-netback: fix error handling on netback_probe()
From: Filipe MancoDate: Thu, 15 Sep 2016 17:10:46 +0200 > In case of error during netback_probe() (e.g. an entry missing on the > xenstore) netback_remove() is called on the new device, which will set > the device backend state to XenbusStateClosed by calling > set_backend_state(). However, the backend state wasn't initialized by > netback_probe() at this point, which will cause and invalid transaction > and set_backend_state() to BUG(). > > Initialize the backend state at the beginning of netback_probe() to > XenbusStateInitialising, and create two new valid state transitions on > set_backend_state(), from XenbusStateInitialising to XenbusStateClosed, > and from XenbusStateInitialising to XenbusStateInitWait. > > Signed-off-by: Filipe Manco Applied, thanks.
Re: pull-request: wireless-drivers-next 2016-09-15
From: Kalle ValoDate: Thu, 15 Sep 2016 18:09:21 +0300 > here's the first pull request for 4.9. The ones I want to point out are > the FIELD_PREP() and FIELD_GET() macros added to bitfield.h, which are > reviewed by Linus, and make it possible to remove util.h from mt7601u. > > Also we have new HW support to various drivers and other smaller > features, the signed tag below contains more information. And I pulled > my ath-current (uses older net tree as the baseline) branch to fix a > conflict in ath10k. > > Once again the diffstat from git request-pull was wrong. I fixed it by > manually copying the diffstat from a test pull against net-next, so > everything should be ok. But please let me know if there are any > problems. Pulled, thanks Kalle.
Re: [PATCH net-next 0/3] mlx5e Order-0 pages for Striding RQ
From: Tariq ToukanDate: Thu, 15 Sep 2016 16:08:35 +0300 > In this series, we refactor our Striding RQ receive-flow to always use > fragmented WQEs (Work Queue Elements) using order-0 pages, omitting the > flow that allocates and splits high-order pages which would fragment > and deplete high-order pages in the system. > > The first patch gives a slight degradation, but opens the opportunity > to using a simple page-cache mechanism of a fair size. > The page-cache, implemented in patch 3, not only closes the performance > gap but even gives a gain. > In patch 2 we re-organize the code to better manage the calls for > alloc/de-alloc pages in the RX flow. > > Series generated against net-next commit: > bed806cb266e "Merge branch 'mlxsw-ethtool'" Series applied, thanks.
Re: [PATCH net-next v2 0/3] net: ethernet: mediatek: add HW LRO functions
From: Nelson ChangDate: Wed, 14 Sep 2016 13:58:56 +0800 > The series add the large receive offload (LRO) functions by hardware and > the ethtool functions to configure RX flows of HW LRO. > > changes since v2: > - Add ndo_fix_features to prevent NETIF_F_LRO off while RX flow is programmed > - Rephrase the dts property is a capability if the hardware supports LRO > > changes since v1: > - Add HW LRO support > - Add ethtool hooks to set LRO RX flows This doesn't apply cleanly to net-next.
Re: [PATCH net-next 2/2] net sched ife action: Introduce skb tcindex metadata encap decap
From: Jamal Hadi SalimDate: Thu, 15 Sep 2016 06:49:54 -0400 > +static int __init ifetc_index_init_module(void) > +{ > + pr_emerg("Loaded IFE tc_index\n"); ... > +static void __exit ifetc_index_cleanup_module(void) > +{ > + pr_emerg("Unloaded IFE tc_index\n"); This looks like some leftover debugging, please remove.
Re: [RFC PATCH 9/9] ethernet: sun8i-emac: add pm_runtime support
On Wed, Sep 14, 2016 at 04:03:04PM +0200, LABBE Corentin wrote: > > > +static int __maybe_unused sun8i_emac_suspend(struct platform_device > > > *pdev, pm_message_t state) > > > +{ > > > + struct net_device *ndev = platform_get_drvdata(pdev); > > > + struct sun8i_emac_priv *priv = netdev_priv(ndev); > > > + > > > + napi_disable(>napi); > > > + > > > + if (netif_running(ndev)) > > > + netif_device_detach(ndev); > > > + > > > + sun8i_emac_stop_tx(ndev); > > > + sun8i_emac_stop_rx(ndev); > > > + > > > + sun8i_emac_rx_clean(ndev); > > > + sun8i_emac_tx_clean(ndev); > > > + > > > + phy_stop(ndev->phydev); > > > + > > > + return 0; > > > +} > > > + > > > +static int __maybe_unused sun8i_emac_resume(struct platform_device *pdev) > > > +{ > > > + struct net_device *ndev = platform_get_drvdata(pdev); > > > + struct sun8i_emac_priv *priv = netdev_priv(ndev); > > > + > > > + phy_start(ndev->phydev); > > > + > > > + sun8i_emac_start_tx(ndev); > > > + sun8i_emac_start_rx(ndev); > > > + > > > + if (netif_running(ndev)) > > > + netif_device_attach(ndev); > > > + > > > + netif_start_queue(ndev); > > > + > > > + napi_enable(>napi); > > > + > > > + return 0; > > > +} > > > > The main idea behind the runtime PM hooks is that they bring the > > device to a working state and shuts it down when it's not needed > > anymore. Indeed. > I expect that the first part (all pm_runtime_xxx) of the patch bring that. > When the interface is not opened: > cat /sys/devices/platform/soc/1c3.ethernet/power/runtime_status > suspended > > > However, they shouldn't be called when the device is still in used, so > > all the mangling with NAPI, the phy and so on is irrelevant here, but > > the clocks, resets, for example, are. > > > > I do the same as other ethernet driver for suspend/resume. suspend / resume are used when you put the whole system into suspend, and bring it back. runtime_pm is only when the device is not used anymore. It makes sense when you suspend to do whatever you're doing here. It doesn't make any when the system is not suspended, but the device is. > > > static const struct of_device_id sun8i_emac_of_match_table[] = { > > > { .compatible = "allwinner,sun8i-a83t-emac", > > > .data = _variant_a83t }, > > > @@ -2246,6 +2302,8 @@ static struct platform_driver sun8i_emac_driver = { > > > .name = "sun8i-emac", > > > .of_match_table = sun8i_emac_of_match_table, > > > }, > > > + .suspend= sun8i_emac_suspend, > > > + .resume = sun8i_emac_resume, > > > > These are not the runtime PM hooks. How did you test that? > > > > Anyway I didnt test suspend/resume so I will remove it until I > successfully found how to hibernate my board. So you submit code you never tested? That's usually a recipe for disaster. Maxime -- Maxime Ripard, Free Electrons Embedded Linux and Kernel engineering http://free-electrons.com signature.asc Description: PGP signature
Re: [PATCH net-next 05/14] tcp: track data delivery rate for a TCP connection
On Fri, Sep 16, 2016 at 5:38 PM, kbuild test robotwrote: > Hi Yuchung, > > [auto build test WARNING on net-next/master] > All warnings (new ones prefixed by >>): > >In file included from net/ipv4/route.c:103:0: >>> include/net/tcp.h:769:11: warning: 'packed' attribute ignored for field of >>> type 'struct skb_mstamp' [-Wattributes] >struct skb_mstamp first_tx_mstamp __packed; > ^~ >include/net/tcp.h:771:11: warning: 'packed' attribute ignored for field of > type 'struct skb_mstamp' [-Wattributes] >struct skb_mstamp delivered_mstamp __packed; > ^~ We have a fix for this, and we'll post it with the v2 series. thanks, neal
Re: [patch net-next v10 2/3] net: core: Add offload stats to if_stats_msg
> On Sep 16, 2016, at 4:05 PM, Jiri Pirkowrote: > > From: Nogah Frankel > > Add a nested attribute of offload stats to if_stats_msg > named IFLA_STATS_LINK_OFFLOAD_XSTATS. > Under it, add SW stats, meaning stats only per packets that went via > slowpath to the cpu, named IFLA_OFFLOAD_XSTATS_CPU_HIT. > > Signed-off-by: Nogah Frankel > Signed-off-by: Jiri Pirko > --- > include/uapi/linux/if_link.h | 9 > net/core/rtnetlink.c | 111 +-- > 2 files changed, 116 insertions(+), 4 deletions(-) > > Acked-by: Nikolay Aleksandrov
Re: [patch net-next v10 1/3] netdevice: Add offload statistics ndo
> On Sep 16, 2016, at 4:05 PM, Jiri Pirkowrote: > > From: Nogah Frankel > > Add a new ndo to return statistics for offloaded operation. > Since there can be many different offloaded operation with many > stats types, the ndo gets an attribute id by which it knows which > stats are wanted. The ndo also gets a void pointer to be cast according > to the attribute id. > > Signed-off-by: Nogah Frankel > Signed-off-by: Jiri Pirko > --- > include/linux/netdevice.h | 12 > 1 file changed, 12 insertions(+) > Reviewed-by: Nikolay Aleksandrov
Re: [PATCH net-next v2 3/5] cxgb4: add parser to translate u32 filters to internal spec
On Thursday, September 09/15/16, 2016 at 07:27:24 -0700, John Fastabend wrote: > On 16-09-13 04:42 AM, Rahul Lakkireddy wrote: > > Parse information sent by u32 into internal filter specification. > > Add support for parsing several fields in IPv4, IPv6, TCP, and UDP. > > > > Signed-off-by: Rahul Lakkireddy> > Signed-off-by: Hariprasad Shenai > > --- > > Looks good to me. Also curious if you would find it worthwhile to > have a cls_u32 mode that starts at L2 instead of the IP header? The > use case would be to use cls_u32 with various encapsulation protocols > in front of the IP header. > > Reviewed-by: John Fastabend Thanks for the review John. Yes, we are also looking for getting u32 to start from L2 header in order to allow matching encapsulation protocols in front of IP header. In addition, our hardware also keeps track of per-filter statistics such as number of times the filter has been hit and number of bytes that hit the filter. Hence, we are also looking for exposing these per-filter stats to u32. Thanks, Rahul
Re: [PATCH net-next] rxrpc: Make IPv6 support conditional on CONFIG_IPV6
From: David HowellsDate: Sat, 17 Sep 2016 07:26:01 +0100 > Add CONFIG_AF_RXRPC_IPV6 and make the IPv6 support code conditional on it. > This is then made conditional on CONFIG_IPV6. > > Without this, the following can be seen: > >net/built-in.o: In function `rxrpc_init_peer': >>> peer_object.c:(.text+0x18c3c8): undefined reference to >>> `ip6_route_output_flags' > > Reported-by: kbuild test robot > Signed-off-by: David Howells Applied.
[PATCH net-next] rxrpc: Make IPv6 support conditional on CONFIG_IPV6
Add CONFIG_AF_RXRPC_IPV6 and make the IPv6 support code conditional on it. This is then made conditional on CONFIG_IPV6. Without this, the following can be seen: net/built-in.o: In function `rxrpc_init_peer': >> peer_object.c:(.text+0x18c3c8): undefined reference to >> `ip6_route_output_flags' Reported-by: kbuild test robotSigned-off-by: David Howells --- net/rxrpc/Kconfig|7 +++ net/rxrpc/af_rxrpc.c |7 ++- net/rxrpc/conn_object.c |2 ++ net/rxrpc/local_object.c |2 ++ net/rxrpc/output.c |2 ++ net/rxrpc/peer_event.c |4 +++- net/rxrpc/peer_object.c | 10 ++ net/rxrpc/utils.c|2 ++ 8 files changed, 34 insertions(+), 2 deletions(-) diff --git a/net/rxrpc/Kconfig b/net/rxrpc/Kconfig index 784c53163b7b..13396c74b5c1 100644 --- a/net/rxrpc/Kconfig +++ b/net/rxrpc/Kconfig @@ -19,6 +19,13 @@ config AF_RXRPC See Documentation/networking/rxrpc.txt. +config AF_RXRPC_IPV6 + bool "IPv6 support for RxRPC" + depends on (IPV6 = m && AF_RXRPC = m) || (IPV6 = y && AF_RXRPC) + help + Say Y here to allow AF_RXRPC to use IPV6 UDP as well as IPV4 UDP as + its network transport. + config AF_RXRPC_DEBUG bool "RxRPC dynamic debugging" diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c index f61f7b2d1ca4..09f81befc705 100644 --- a/net/rxrpc/af_rxrpc.c +++ b/net/rxrpc/af_rxrpc.c @@ -109,12 +109,14 @@ static int rxrpc_validate_address(struct rxrpc_sock *rx, tail = offsetof(struct sockaddr_rxrpc, transport.sin.__pad); break; +#ifdef CONFIG_AF_RXRPC_IPV6 case AF_INET6: if (srx->transport_len < sizeof(struct sockaddr_in6)) return -EINVAL; tail = offsetof(struct sockaddr_rxrpc, transport) + sizeof(struct sockaddr_in6); break; +#endif default: return -EAFNOSUPPORT; @@ -413,9 +415,11 @@ static int rxrpc_sendmsg(struct socket *sock, struct msghdr *m, size_t len) case AF_INET: rx->srx.transport_len = sizeof(struct sockaddr_in); break; +#ifdef CONFIG_AF_RXRPC_IPV6 case AF_INET6: rx->srx.transport_len = sizeof(struct sockaddr_in6); break; +#endif default: ret = -EAFNOSUPPORT; goto error_unlock; @@ -570,7 +574,8 @@ static int rxrpc_create(struct net *net, struct socket *sock, int protocol, return -EAFNOSUPPORT; /* we support transport protocol UDP/UDP6 only */ - if (protocol != PF_INET && protocol != PF_INET6) + if (protocol != PF_INET && + IS_ENABLED(CONFIG_AF_RXRPC_IPV6) && protocol != PF_INET6) return -EPROTONOSUPPORT; if (sock->type != SOCK_DGRAM) diff --git a/net/rxrpc/conn_object.c b/net/rxrpc/conn_object.c index c0ddba787fd4..bb1f29280aea 100644 --- a/net/rxrpc/conn_object.c +++ b/net/rxrpc/conn_object.c @@ -134,6 +134,7 @@ struct rxrpc_connection *rxrpc_find_connection_rcu(struct rxrpc_local *local, srx.transport.sin.sin_addr.s_addr) goto not_found; break; +#ifdef CONFIG_AF_RXRPC_IPV6 case AF_INET6: if (peer->srx.transport.sin6.sin6_port != srx.transport.sin6.sin6_port || @@ -142,6 +143,7 @@ struct rxrpc_connection *rxrpc_find_connection_rcu(struct rxrpc_local *local, sizeof(struct in6_addr)) != 0) goto not_found; break; +#endif default: BUG(); } diff --git a/net/rxrpc/local_object.c b/net/rxrpc/local_object.c index f5b9bb0d3f98..e3fad80b0795 100644 --- a/net/rxrpc/local_object.c +++ b/net/rxrpc/local_object.c @@ -58,6 +58,7 @@ static long rxrpc_local_cmp_key(const struct rxrpc_local *local, memcmp(>srx.transport.sin.sin_addr, >transport.sin.sin_addr, sizeof(struct in_addr)); +#ifdef CONFIG_AF_RXRPC_IPV6 case AF_INET6: /* If the choice of UDP6 port is left up to the transport, then * the endpoint record doesn't match. @@ -67,6 +68,7 @@ static long rxrpc_local_cmp_key(const struct rxrpc_local *local, memcmp(>srx.transport.sin6.sin6_addr, >transport.sin6.sin6_addr, sizeof(struct in6_addr)); +#endif default: BUG(); } diff --git a/net/rxrpc/output.c b/net/rxrpc/output.c index d7cd87f17f0d..06a9aca739d1 100644 --- a/net/rxrpc/output.c +++ b/net/rxrpc/output.c @@ -259,6 +259,7 @@