nd testing for RDS_IN_XMIT
after lock_sock could result in a deadlock with tcp_sendmsg, this
commit fixes the race by using a new c_state, RDS_TCP_RESETTING, which
will prevent a transition to RDS_CONN_UP from rds_tcp_state_change().
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
--
sent RDS datagrams will get
retransmitted after rds_tcp_accept_one() switches sockets.
Patch 3 fixes a race window which would prematurely re-enable
rds_send_xmit() before the rds_tcp_connection setup has been
completed in rds_tcp_accept_one().
Sowmini Varadhan (3):
RDS: TCP: Add/use
(for old and new sockets) when resetting
existing callbacks.
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
net/rds/tcp.c| 65 +++--
net/rds/tcp.h|1 +
net/rds/tcp_listen.c | 13 +++---
3 files chang
When we switch a connection's sockets in rds_tcp_rest_callbacks,
any partially sent datagram must be retransmitted on the new
socket so that the receiver can correctly reassmble the RDS
datagram. Use rds_send_reset() which is designed for this purpose.
Signed-off-by: Sowmini Varadhan
/hypermail/linux/kernel/1602.3/04431.html
to get it to build- did I just miss something in my pull?
--Sowmini
Are you suggesting some type of AAA to vet the hbh
option itself?
--Sowmini
l APIs don't
> >>>>> allow setting EH. I would like to avoid that :-)
> On 21.05.2016 19:46, Sowmini Varadhan wrote:
> > Do you mean this
> > http://www.ietf.org/mail-archive/web/spud/current/msg00365.html
On (05/22/16 03:08), Hannes Frederic Sowa wrote:
> Hmm,
is is something IETF
> people probably know more about what an impact this change could have.
I think it would be ok for some options to be privileged, others to not.
We certainly have precedent for that from the classic socket APIs..
and man pages etc can document any restrictions, as we do with other
ioctls/sockopts etc.
--Sowmini
the networking separation we can't make it simply
> non-priv.
sure, I agree with that point.
--Sowmini
parsing overhead and thus decided to
> block it for ordinary users.
That's probably more likely, esp for hbh options. It may also be
interesting to find out what BSD does in these cases.
--Sowmini
TCP connection resets due to
re-execution of the connection arbitration logic.
Sowmini Varadhan (2):
RDS: TCP: rds_tcp_accept_worker() must exit gracefully when
terminating rds-tcp
RDS: TCP: Avoid rds connection churn from rogue SYNs
net/rds/tcp_listen.c | 13 +
1 files
When a rogue SYN is received after the connection arbitration
algorithm has converged, the incoming SYN should not needlessly
quiesce the transmit path, and it should not result in needless
TCP connection resets due to re-execution of the connection
arbitration logic.
Signed-off-by: Sowmini
int will encounter a
null rds_tcp_listen_sock, and must exit gracefully to allow
the RDS-TCP termination to complete successfully.
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
net/rds/tcp_listen.c |3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git
x.net>
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
net/core/skbuff.c |6 --
1 files changed, 0 insertions(+), 6 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5586be9..f2b77e5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -46
's in the skb_shared_info. The len and data_len should be
the sum-total for the whole skb, including skb_frag_t's and ->frag_list.
--Sowmini
cleared in the conn->c_flags indicating that any
threads in rds_tcp_xmit are done.
Fixes: 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()")
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
Acked-by: Santosh Shilimkar
threads in rds_tcp_xmit
have completed (otherwise a null-ptr deref may be encountered).
Patch 2 synchronizes rds_tcp_accept_one() with the rds_tcp*connect()
path.
v2: review comments from Santosh Shilimkar, other spelling corrections
Sowmini Varadhan (2):
RDS:TCP: Synchronize
rbitration logic) are UP (i.e., outgoing SYN
was SYN-ACKed by peer after it sent us the SYN) or CONNECTING (we sent
outgoing SYN before we saw incoming SYN).
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
---
v2
> rds_conn_transition(conn, RDS_CONN_DOWN, RDS_CONN_CONNECTING);
> Like patch 1/2, probably we can leverage return value of above.
:
> You probably don't need the local 'conn_state' and below should work.
> if (!rds_conn_connecting(conn) && !rds_conn_up(conn))
see explanation for comment to 1/2.
--Sowmini
ctice, its not
frequent to have syns crossing each other at "almost the same time".
--Sowmini
any threads in rds_tcp_xmit
have completed (otherwise a null-ptr deref may be encountered).
Patch 2 synchronizes rds_tcp_accept_one() with the rds_tcp*connect()
path.
Sowmini Varadhan (2):
RDS:TCP: Synchronize rds_tcp_accept_one with rds_send_xmit when
resetting t_sock
RDS: TCP: Synchrnozize
rbitration logic) are UP (i.e., outgoing SYN
was SYN-ACKed by peer after it sent us the SYN) or CONNECTING (we sent
outgoing SYN before we saw incoming SYN).
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
net/rds/tcp.c |1 +
net/rds/tcp.h |
cleared in the conn->c_flags indicating that any
threads in rds_tcp_xmit are done.
Fixes: 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()")
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
net/rds/tcp.c
rds-stress experiments with request size 256 bytes, 8K acks,
using 16 threads show a 40% improvment when pskb_extract()
replaces the {skb_clone(..); pskb_pull(..); pskb_trim(..);}
pattern in the Rx path, so we leverage the perf gain with
this commit.
Signed-off-by: Sowmini Varadhan <sowmini.va
for Rx on the PF_RDS socket.
Patch 1 of this patchset adds a pskb_extract() function that
does all this without the redundant memcpy's in pskb_expand_head()
and __pskb_pull_tail().
v2: Marcelo Leitner review comments
Sowmini Varadhan (2):
Add pskb_extract() helper function
Call
rim() is then invoked to trim clone
down to the requested to_copy bytes.
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
v2: Marcelo Leitner review comments
include/linux/skbuff.h |2 +
net/core/skbuff.c | 242
2 files cha
r RDS-TCP, the rx side does show a noticeable
benefit with rds-stress. To an extent, this is also impacted
by the packet size, and the type of test (for our DB workloads,
we use request-response tests, and the packet size tends to
typically be 8K req, 256 byt responses), so I guess ymmv.
--Sowmini
for Rx on the PF_RDS socket.
Patch 1 of this patchset adds a pskb_extract() function that
does all this without the redundant memcpy's in pskb_expand_head()
and __pskb_pull_tail().
Sowmini Varadhan (2):
Add pskb_extract() helper function
Call pskb_extract() helper function
include/linux
rds-stress experiments with request size 256 bytes, 8K acks,
using 16 threads show a 40% improvment when pskb_extract()
replaces the {skb_clone(..); pskb_pull(..); pskb_trim(..);}
pattern in the Rx path, so we leverage the perf gain with
this commit.
Signed-off-by: Sowmini Varadhan <sowmini.va
rim() is then invoked to trim clone
down to the requested to_copy bytes.
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
include/linux/skbuff.h |2 +
net/core/skbuff.c | 248
2 files changed, 250 insertions(+), 0 deletions(-)
d
rim() is then invoked to trim clone
down to the requested to_copy bytes.
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
include/linux/skbuff.h |2 +
net/core/skbuff.c | 248
2 files changed, 250 insertions(+), 0 deletions(-)
d
rds-stress experiments with request size 256 bytes, 8K acks,
using 16 threads show a 40% improvment when pskb_extract()
replaces the {skb_clone(..); pskb_pull(..); pskb_trim(..);}
pattern in the Rx path, so we leverage the perf gain with
this commit.
Signed-off-by: Sowmini Varadhan <sowmini.va
the needless copy
of trailer frags/pages that will then get trimmed away. I am
deferring that optimization for the next iteration, and would
like to get feedback on this first pass, which by itself gives
a noticeable perf boost.
Sowmini Varadhan (2):
Add pskb_extract() helper function
Call
list cases. I'm
afraid I might end up needing most of the stuff under the "Pure
masohism" (sic) comment in __pskb_pull_tail().
--Sowmini
On (04/07/16 07:16), Eric Dumazet wrote:
> Use skb split like TCP in output path ?
That almost looks like what I want, but skb_split modifies both
skb and skb1, and I want to leave skb untouched (otherwise
I will mess up the book-keeping in tcp_read_sock). But skb_split
is a good template- I
oid skb_copy_bits() above, and to
pass delta to pskb_expand_head,
- in pskb_expand_head, only do the memcpy listed above
if delta <= 0
Any other ideas for how to achieve this?
--Sowmini
Yes, this fixes it for me too!
Tested-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
ket that you see the issue. I dont know if
that provides any clues.
--Sowmini
dpage path that i40e does not anticipate.
Other drivers (ixgbe etc) work fine, so my hunch would be that this
is specific to i40e (and not a skb_linearize bug) but I could be
wrong.
--Sowmini
e if that changed info is significant?
--Sowmini
out with rds-stress on my test-pair, unfortunately, I
still see the Tx hang.
Setting up the test is quite easy- for reference, the instructions
are here:
https://sourceforge.net/p/e1000/mailman/message/34936766/
--Sowmini
RDS_TCP_DEFAULT_BUFSIZE has been unused since commit 1edd6a14d24f
("RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune").
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
v3: review comments from Santosh Shilimkar
net/rds/tcp.c |2 --
1 files changed, 0
bably doing this very rarely, and for a good
reason, and is fully aware of the cost. So there is some degree
of human control possible.
--Sowmini
of existing rds-tcp
sockets when the tunable is modified. To make sure that all
connections in the netns pick up the same value for the tunable,
we reset existing rds-tcp connections in the netns, so that
they can reconnect with the new parameters.
Signed-off-by: Sowmini Varadhan <sowmini.va
e various
lists yesterday, so he could have just told me about this.
But, whatever.
--sowmini
Patch 1 uses sysctl to create tunable socket buffer size parameters.
Patch 2 removes an unuused constant.
v2: use sysctl
v3: review comments from Santosh Shilimkar, Eric Dumazet
v4: review comments from Hannes Sowa
Sowmini Varadhan (2):
RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds
se I need to rds_tcp_sysctl_reset()
existing connections to make them use the new tunable.
--Sowmini
do this. I needed to
do the copy_from_user() + kstrtoul, and user_atoi had some
additional checks that seemed useful.
--Sowmini
enting a design extension to the daemon approach
in the future.
By itself, the sysctl support adds value and can co-exist with those
extensions.
--Sowmini
of existing rds-tcp
sockets when the tunable is modified. To make sure that all
connections in the netns pick up the same value for the tunable,
we reset existing rds-tcp connections in the netns, so that
they can reconnect with the new parameters.
Signed-off-by: Sowmini Varadhan <sowmini.va
Patch 1 uses sysctl to create tunable socket buffer size parameters.
Patch 2 removes an unuused constant.
Changes since v2: review comments from Santosh Shilimkar, Eric Dumazet
Sowmini Varadhan (2):
RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket
RDS: TCP: Remove unused
RDS_TCP_DEFAULT_BUFSIZE has been unused since commit 1edd6a14d24f
("RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune").
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
v3: review comments from Santosh Shilimkar
net/rds/tcp.c |2 --
1 files changed, 0
uite unpredictable (as you
point out, depends on the kernel version among other things).
> But again if your sysctl allows to set a value below SOCK_MIN_SNDBUF,
> that might be a problem, because stack could have a hidden bug for very
> small values of sndbuf/rcvbuf.
sure, fixing/testing it as I write this.
--Sowmini
amespace are at least as important configuration
> parameters as these two.
Those parameters can be added via sysctl if needed- we have not
seen the need for that yet.
And if something comes up that needs the netlink etc. we can
add it down the road. It's just not needed now.
Thanks
--Sowmini
>
the buffer size to 4608 (as reported by
getsockopt in my env. I think the getsockopt value is impacted by
many factors).
--Sowmini
On (03/15/16 09:38), santosh shilimkar wrote:
> >+if (rtn->sndbuf_size > 0) {
> So value of 1 is allowed as well. There should be some
> minimum default or multiple of it. Of course above check
> can remain as is as long as you validate the user input
> in handlers.
yes, just as user-space
of existing rds-tcp
sockets when the tunable is modified. To make sure that all
connections in the netns pick up the same value for the tunable,
we reset existing rds-tcp connections in the netns, so that
they can reconnect with the new parameters.
Signed-off-by: Sowmini Varadhan <sowmini.va
are not likely to correctly find IPVERSION,
or "6", at hdr.ipv4->version, but will instead just needlessly
trigger an unaligned access. (IPv4/IPv6 over LLC is almost never
implemented).
The unaligned access is thus avoidable: bail out quickly after
examining first->protocol.
Signed-off-by:
send out the patches later this week after some more cleanup
and testing.
--Sowmini
However, it would still be nice to know exactly what distribution
issues come out of modparam.
wn
the stack with the NET_IP_ALIGNment and (b) ARP is only sent over
Ethernet II (there is no LLC SAP for ARP, which is a big reason
why ipv4 is not sent over llc, despite rfc 1042).
I figured it would not hurt to pass it down, in case we decide
to do something clever with it in the future.
--Sowmini
are not likely to correctly find IPVERSION,
or "6", at hdr.ipv4->version, but will instead just needlessly
trigger an unaligned access. (IPv4/IPv6 over LLC is almost never
implemented).
The unaligned access is thus avoidable: bail out quickly after
examining skb->protocol.
Signed-off-by:
a problem
to implement at all, if it ever shows up as a requirement for customers.
--Sowmini
far simpler to just tell the cluster-setup scripts to just
refrain from an ifup of the relevant interfaces till all the config
is set up.
Besides, the basic problem remains: for an arbitrary kernel module
that has parameters that need to be customized before module init,
what are the options today?
--Sowmini
hings to get to per-net vars.
I dont know if there is a better alternative than sysctl/module_param
for achieving what I'm trying to do, which is to set up the params
for the kernel sockets before creating them. If yes, some
hints/rtfms would be helpful.
--Sowmini
On (03/11/16 11:09), Stephen Hemminger wrote:
>
> Module parameters are a problem for distributions and should only be used
> as a last resort.
I was not aware of that- out of curiosity, what is the associated problem?
What would be the alternative recommendation in this case?
--Sowmini
Some payload sizes/patterns stand to gain performance benefits by
tuning the size of the TCP socket buffers, so this commit adds
module parameters to customize those values when desired.
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
net/rds/tcp.c | 16 +
onable constraint, but is it possible to end up
with a deadlocked TCP (recv) socket- one for which the receiver
closed the window (so sender TCP cannot send the remaining
bytes of the kcm message), but cannot be drained because of #3 above?
BTW there are a couple of typos above;
s/skbuf/skbuff
s/incluing/including
--Sowmini
at no one else has noticed this. It did not
take me any effort to reproduce it on an x86.
--Sowmini
what the compiler is flagging.
--Sowmini
struct tc_to_netdev tc;
:
> + tc.type = TC_SETUP_MQPRIO;
The compiler error is for fields within the union which lacks
both a tag and a union-name. So I'm not sure how the above will
help.
--Sowmini
John,
I'm getting a failure when I try to build the latest net-next. I see
net/sched/sch_mqprio.c: In function `mqprio_init':
net/sched/sch_mqprio.c:145: error: unknown field `tc' specified in initializer
net/sched/sch_mqprio.c:145: warning: missing braces around initializer
ck 1216
sizeof request_sock 312
sizeof inet_request_sock 328
offsetof sk_policy 1 520
So it's good to know that crash does not lie.
But then it's odd that the struct sizes (esp of things like
request_sock, which are not config dependant) are not the same.
--Sowmini
Hat 4.4.7-4)"
But my question from the email remains. Unless I am missing
something subtle in the code, a struct request_sock and a
struct sock only have the sock_common part in common. So casting
a request_sock as a struct sock may have issues?
--Sowmini
intk's of the structures
in the case of the v440. And they *are* different, and the numbers
match the values dumped on the console on pnaic. So isnt there actually
a problem here?
--Sowmini
in the
> console.
Maybe you're seeing something else then.
--Sowmini
ds you are looking for?
so how is this supposed to work? (Evidently it worked for Meelis
before, but I dont know if that was before or after commit
9e17f8a475).
--Sowmini
ight, but there is a big difference between a performance degradation
> and a hard failure. It would at least be nice to know what the
> performance hit actually is, if it's acceptable then this would be a
> far simpler and much less invasive fix than the alternatives.
--Sowmini
On (02/03/16 09:07), Tom Herbert wrote:
> > Kernel unaligned access at TPC[9150dc] ipv4_neigh_lookup+0x38/0x170
>
> Sowmini,
>
> This doesn't look like a hard crash to me. Instead of trying to fix
> all the alignment issues for Sparc, can we just take the trap, fix up
>
Use sunvnet perf trace macros to monitor LDC message exchange state.
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
drivers/net/ethernet/sun/sunvnet.c | 24 ++--
1 files changed, 22 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethern
Add perf event macros for support of tracing and instrumentation
of LDC state machine
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
include/trace/events/sunvnet.h | 139
1 files changed, 139 insertions(+), 0 deletions(-)
creat
Added some perf tracepoints to help track and debug sunvnet
descriptor state at run-time.
Sowmini Varadhan (2):
sunvnet: Add support for perf LDC event tracing
sunvnet: perf tracepoint invocations to trace LDC state machine
drivers/net/ethernet/sun/sunvnet.c | 24 ++-
include/trace
%g1
I get unaligned access traps at __skb_flow_dissect+500 and
__skb_flow_dissect+512 (corresponding to saddr and daddr), once for
each interface (gretap/eth0 and eth1).
--Sowmini
On (01/31/16 13:37), Tom Herbert wrote:
>
> Call get_unaligned_be32 when we access 32-bit fields in
> __skb_flow_dissect. At the beginning check for unlikely case of
> 1-byte aligned packet.
>
> Note that flow_dissector may be asked to parse packet unaligned
> fields in two instances:
I tested
On (02/01/16 16:20), Nicolas Dichtel wrote:
> There is also the tile architecture, up to 76 cores on the board I've seen. It
> requires an alignment on 8!
> I wonder how this case may be properly handled. A simple ipip scenario fails.
Yes, I'm also observing the same thing. Simply setting up
with LLC/SNAP etc.
And in addition to all this, anything we can do to help align
nested encapsulations (i.e., Tom's concern) is also needed, and
maybe it will be cleaner to do that if we just split off deep packet
inspection for flow extraction away from eth_get_headlen.
--Sowmini
On (01/31/16 16:24), Eric Dumazet wrote:
>
> But this test is absolutely useless, what about testing arches that
> actually care ?
>
Yes, I plan to help test this out tomorrow, Tom suggested setting up
gre-teb between x86 and sparc.
--Sowmini
On (01/30/16 09:43), Tom Herbert wrote:
> That is not the only case, If a GRE TEB packet is ever received and
> flow dissector is called for any reason (like skb_get_hash) there's
> going to be problems-- and this doesn't require GRE to even be
> configured on the host.
>
> I have a patch that
On (01/29/16 19:23), Eric Dumazet wrote:
> BTW, even a memcpy(_addrs->v4addrs, >saddr, 8) could crash, as
> the compiler can certainly assume src and dst are 4 bytes aligned, and
> could use word accesses when inlining memcpy() even on Sparc.
>
> Apparently the compiler used
There is an unaligned access at __skb_flow_dissect when it calls
ip6_flowlabel() with the call stack
__skb_flow_dissect+0x2a8/0x87c
eth_get_headlen+0x5c/0xaxa4
ixgbe_clean_rx_irq+0x5cc/0xb20 [ixgbe]
ixgbe_poll+0x5a4/0x760 [ixgbe]
net_rx_action+0x13c/0x354
:
Essentially,
th
memcpy (as was happening with the saddr ref in skb_flow_dissect
that puzzled me and Eric because it did not generate any traps).
I suppose one could sprinkele a few WARN_ON's for !IS_ALIGNED
but that's not a fool-proof detection method either (in addition
to all the ugly shouting in the code).
--Sowmini
On (01/29/16 15:31), Tom Herbert wrote:
> But even within flow dissector, to be completely correct, we need to
> replace all 32-bit accesses with the mempy (flow_label, mpls label,
> keyid) and be vigilant about new ones coming in. For that matter, ..
well, one question that came to my mind when
ting me,
I dont have any l2 gre invovled (simple p2p Ethernet II + ipv6 + tcp,
no vlans, gre or other exotic stuff).
--Sowmini
On (01/29/16 11:37), Eric Dumazet wrote:
>
> I have no idea why reading iph->saddr or iph->daddr would not hit the
> problem, but accessing the 32bit ipv6 flow label would be an issue.
>
> Something is fishy.
I was wondering about this myself. Even on sparc, I only first
ran into the errors for
plains about unaligned access
in most cases, some sins may pass under the radar, and other
platforms dont even generate traps, so it's easy to not know
that there's a problem, a lot of the time.
--Sowmini
On (01/06/16 16:33), David Miller wrote:
>
> Signed-off-by: David S. Miller <da...@davemloft.net>
Acked-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
>
> As promised back in November, I'm commiting this into net-next
> now. I'd work on the conversions o
esn't have destination mac address equals to FF:FF:FF:FF:FF:FF.
>
> This commit sets RAR0 correctly to default HW mac address.
>
> Signed-off-by: Tushar Dave <tushar.n.d...@oracle.com>
Tested-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
--
To unsubscribe fro
On (12/30/15 15:42), Stas Sergeev wrote:
> 29.12.2015 18:22, Sowmini Varadhan пишет:
> > Do you have admin control over the ubuntu router?
> > If yes, you might want to check the shared_media [#] setting
> > on that router for the interfaces with overlapping subnets.
> &
IPSKB_DOREDIRECT if shared_media is turned off.
--Sowmini
[#] https://www.frozentux.net/ipsysctl-tutorial/chunkyhtml/theconfvariables.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo inf
irect. You would have to check into the configuration
and/or implementation of the router- it should not be sending
back a redirect in the above case (different netmasks) even
if the ingress and egress physical interfaces are the same.
--Sowmini
--
To unsubscribe from this list: send the line "unsubscri
I
suppose it might not hurt if the receiver can do some sanity checking
on the redirect but this might not eliminate every error, since
it might not be possible to detect netmask mismatch in every case.
--Sowmini
--
To unsubscribe from this list: send the line "unsubscribe netdev"
401 - 500 of 629 matches
Mail list logo