RDS over IB does not use multipath RDS, so the array
of additional rds_conn_path structures is always superfluous
in this case. Reduce the memory footprint of the rds module
by making this a dynamic allocation predicated on whether
the transport is mp_capable.
Signed-off-by: Sowmini Varadhan
rement c_send_gen at any time.
>*/
> - cp->cp_send_gen++;
> - send_gen = cp->cp_send_gen;
> + send_gen = READ_ONCE(cp->cp_send_gen) + 1;
> + WRITE_ONCE(cp->cp_send_gen, send_gen);
>
--Sowmini
is in the middle of an unload)
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
net/rds/connection.c | 16 ++--
net/rds/rds.h |2 +-
net/rds/tcp.c |2 +-
net/rds/tcp_connect.c |4 ++--
net/rds/tcp_send.c|2 +-
net/rds/threads.c |
ent->sk && parent->sk != sk)
#2 assumes that all the refcount book-keeping is being done
correctly (there is the danger that we end up taking 2 refs on the sk)
--Sowmini
On (07/09/17 11:49), Linus Torvalds wrote:
>
> On Sat, Jul 8, 2017 at 3:36 AM, David Miller <da...@davemloft.net> wrote:
> >
> > 8) Fix socket leak on accept() in RDS, from Sowmini Varadhan. Also
> >add a WARN_ON() to sock_graft() so other protocol stacks don't
h
expects to find a null parent->sk
Avoid these problems by calling sock_create_lite().
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
net/rds/tcp_listen.c |2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_l
sock_graft() unilaterally sets up parent->sk based on the
assumption that the existing parent->sk is null. If this
condition is not true, then the existing parent->sk would
be leaked, so add a WARN_ON() to alert callers who may fall
in this category.
Signed-off-by: Sowmini Varadhan <s
while the test is running.
Sowmini Varadhan (2):
rds: tcp: use sock_create_lite() to create the accept socket
net/sock: add WARN_ON(parent->sk) in sock_graft()
include/net/sock.h |1 +
net/rds/tcp_listen.c |2 +-
2 files changed, 2 insertions(+), 1 deletions(-)
On (06/27/17 15:59), Sowmini Varadhan wrote:
> > Why does rds-tcp need to call sock_graft() without those invariants
> > met?
>
> It would certainly help to declare "dont use sock_creeate_kern()
> if you are going to accept on this socket"- I dont see that being
&
?
It would certainly help to declare "dont use sock_creeate_kern()
if you are going to accept on this socket"- I dont see that being
mandated anywhere.
It would also help to have a BUG_ON(parent->sk) or at least a
WARN_ON(parent->sk) in sock_graft, before unilaterally assigning
it to the new sk.
--Sowmini
We're seeing a memleak when we run an infinite loop that
loads/unloads rds-tcp, and runs some traffic between each
load/unload.
Analysis shows that this is happening for the following reason:
inet_accept -> sock_graft does
parent->sk = sk
but if the parent->sk was previously pointing
Patch1 is a bug fix for correct reconnect when a connection
is restarted. Patch 2 accelerates cleanup by setting linger
to 1 and sending a RST to the peer.
Sowmini Varadhan (2):
rds: tcp: send handshake ping-probe from passive endpoint
rds: tcp: set linger to 1 when unloading a rds-tcp
net
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
Tested-by: Jenny Xu <jenny.x...@oracle.com>
---
net/rds/rds.h|1 +
net/rds/recv.c |6 +++---
net/rds/send.c | 14 ++
net/rds/tcp_listen.c |2 ++
4 files changed, 12 insertions(+),
.
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
Tested-by: Jenny Xu <jenny.x...@oracle.com>
---
net/rds/connection.c |1 +
net/rds/rds.h |3 ++-
net/rds/tcp.h |1 +
net/rds/tcp_connect.c |2 ++
net/rds/tcp_listen.c |2 +-
5 file
Found when testing between sparc and x86 machines on different
subnets, so the address comparison patterns hit the corner cases and
brought out some bugs fixed by this patch.
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
Tested-by: Imanti Mendez <imanti.men...@oracle.c
This series contains 2 bug fixes (patch2, patch3) and one bit of
code cleanup (patch1) identified during database testing
Sowmini Varadhan (3):
rds: tcp: remove cp_outgoing
rds: tcp: various endian-ness fixes
rds: tcp: Set linger when rejecting an incoming conn in
rds_tcp_accept_one
o not need to be in TIME_WAIT state, so we set linger on
the TCP socket before closing, thereby closing the socket efficiently
with a RST.
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
Tested-by: Imanti Mendez <imanti.men...@oracle.com>
Acked-by: Santosh Shilimkar <san
After commit 1a0e100fb2c9 ("RDS: TCP: Force every connection to be
initiated by numerically smaller IP address") we no longer need
the logic associated with cp_outgoing, so clean up usage of this
field.
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
Tested-b
hbor Cache entries, and is fixed
in a similar manner by this patch.
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
Reviewed-by: Julian Anastasov <j...@ssi.bg>
---
v2: kbuild-test-robot compile error
v3: do not export_symbol neigh_remove_one (David Miller comment)
v4: rea
On (06/01/17 22:34), Julian Anastasov wrote:
> > + np = >hash_buckets[hash_val];
> > + while ((n = rcu_dereference_protected(*np,
> > + lockdep_is_held(>lock))) != NULL) {
>
> checkpatch shows some warnings:
>
> scripts/checkpatch.pl --strict /tmp/file.patch
hbor Cache entries, and is fixed
in a similar manner by this patch.
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
v2: kbuild-test-robot compile error
v3: do not export_symbol neigh_remove_one (David Miller comment)
v4: rearrange locking for tbl->lock (Julian Anastasov
er us. Thus as you correctly pointed out,
we need the tbl->lock to make sure we sync with all paths that
can pull the neigh out of the table (and release the table's ref
along the way).
Thanks for catching that, patch v4 (with correct smtp timestamp!)
will be sent out shortly.
--Sowmini
s NUD_FAILED - that may potentially remove
more than one entry, but that's probably harmless?
--Sowmini
st sleep when gc_thresh1 is not reached,
> so such solution is not good enough.
Right the other drawback of relying on gc for cleanup is
that it doesn't give higher preference to cleaning up FAILED
entries first- so it can end up cleaning up other useful entries
(as I was pointing out to David Ahern)
--Sowmini
hbor Cache entries, and is fixed
in a similar manner by this patch.
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
v2: kbuild-test-robot compile error
v3: do not export_symbol neigh_remove_one (David Miller comment)
include/net/neighbour.h |1 +
net/core/neigh
hbor Cache entries, and is fixed
in a similar manner by this patch.
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
v2: fix kbuild-test-robot compile error.
include/net/neighbour.h |1 +
net/core/neighbour.c| 63 ++
re-ARP (or re-ND), whereas
the admin may just want to lose some selective static arp/NCE entries.
--Sowmini
6 Neighbor Cache entries, and is fixed
in a similar manner by this patch.
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
include/net/neighbour.h |1 +
net/core/neighbour.c| 63 ++
net/ipv4/arp.c |1 +
3 f
On (05/30/17 16:20), Stephen Hemminger wrote:
>
> Please don't copy/paste chunks of code. Instead refactor and make this
> into a helper function.
sure, I have no problems with that, and as I pointed out, I've not
tested ipv6 for this yet either. I'll do all of this after getting
some feedback
RFC patch.
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
include/net/neighbour.h |1 +
net/core/neighbour.c| 42 ++
net/ipv4/arp.c |1 +
3 files changed, 44 insertions(+), 0 deletions(-)
diff --git a/include/n
> Do you find information from a Linux allocation failure report sufficient
> for such an use case?
yes, I suppose that would cover the needed cases.
Your change looks good to me.
Acked-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
where? Note that this type of (infrequent) logging noise is useful
in some cases, e.g., with 8ce675ff, when one is trying to do the
post-mortem of where things first went wrong.
--Sowmini
rt
"definitely wrong" about the requirement of having to read entire
packet in a system call (see POSIX man page)
--Sowmini
ux?
struct ip is a BSD-ism, intended to be used if you were porting
some BSD app.
--Sowmini
On (05/02/17 09:58), Paul Wouters wrote:
>I think you want to use Opportunistic IPsec, eg see
>https://libreswan.org/wiki/HOWTO:_Opportunistic_IPsec
I dont think that what I want is opportunistic ipsec..
taking an extreme example, I can set up the ipsec tunnels with
esp-null for *.5001
umentation for
IP_XFRM_POLICY seems scant, afaict). I'd be happy to share my
udp client program, if it can provide more context to my question.
--Sowmini
Looks good to me.
--Sowmini
header.
Fix look good to me, but could you please also add the bpf_asm input
as a comment to the C code, in case we want to to read/extend this
down the road?
--Sowmini
filled up yet,
before the req.tp_retire_blk_tov (1s in your case) expires.
If there are 0 frames pending, we should not be waking up the app,
so everything seems to be behaving as it should?
--Sowmini
error: rds_conn_shutdown() should continue the
shutdown, and there is no need to log noisy messages about this event.
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
net/rds/connection.c | 10 +-
1 files changed, 9 insertions(+), 1 deletions(-)
diff --git a/n
A couple of minor bugfixes that showed up during testing
Sowmini Varadhan (2):
rds: tcp: allow progress of rds_conn_shutdown if the rds_connection
is marked ERROR by an intervening FIN
rds: tcp: canonical connection order for all paths with index > 0
net/rds/connection.c |
this, rds_connection_worker
should check that cp_index > 0.
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
net/rds/threads.c |2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/net/rds/threads.c b/net/rds/threads.c
index e36e333..3e447d0 100644
---
function.
Also note the comments above rds_conn_destroy.
Thanks
--Sowmini
ant
to reset the new (accept sock) and leave the old socket in place.
How did you test this? Did you test it with network namespaces?
--Sowmini
> net/rds/tcp_listen.c | 1 +
>
> diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
> index 5076788..58aa5bc 100644
> --- a/n
On (03/15/17 10:08), Dmitry Vyukov wrote:
> After I've applied the patch these reports stopped to happen, and I
> have not seem any other reports that look relevant.
> However, it there was one, but it looks like a different issue and it
> was probably masked by massive amounts of original
_dump).
netdev_run_todo takes the rtnl_lock and then wants lock_sock()
for the TCP/IPv4 socket.
Why is lockdep seeing a circular dependancy here? Same pattern
seems to be happening for
http://www.spinics.net/lists/netdev/msg423368.html
and maybe also http://www.spinics.net/lists/netdev/msg42
h acceptor workq
sock_release(listen socket)
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
net/rds/tcp.c| 15 ++-
net/rds/tcp.h|2 +-
net/rds/tcp_listen.c |9 +++--
3 files changed, 18 insertions(+), 8 deletions(-)
diff --git a/n
+0x1e7/0x370 [rds_tcp]
tcp_read_sock+0x96/0x1c0
rds_tcp_recv_path+0x65/0x80 [rds_tcp]
:
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
net/rds/tcp.c | 19 +--
1 files changed, 9 insertions(+), 10 deletions(-)
diff --git a/net/rds/tcp.c b/net/rds
-by: Dmitry Vyukov <dvyu...@google.com>
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
net/rds/connection.c |1 +
net/rds/rds.h|6 +++---
net/rds/tcp.c|4 ++--
3 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/net/rds/connection
address a:b:c:d:e:f blue0 type macvlan
ip link set blue0 netns blue
ip netns exec blue ip addr add 12.0.0.18/24 dev blue0
ip netns exec blue ifconfig blue0 up
sleep 3;
done
Sowmini Varadhan (3):
rds: tcp: Take explicit refcounts on struct net
rds: tcp
the crashes and check that the patches fix
> them.
I can try both, but IME reproducing such things is quite challenging.
Even with things like dtrace-chill on other OSes, it can take a loong
time to nail it. Let's give it a week, while I try out (1) at least.
--Sowmini
?
Reported-by: Dmitry Vyukov <dvyu...@google.com>
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
net/rds/connection.c |1 +
net/rds/rds.h|6 +++---
net/rds/tcp.c|4 ++--
3 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/net/rds/co
This is a test patch being supplied for a trial run on syzkaller.
Explicitly cancel the workq before releasing resources that
will allow netns deletion, so that the connect request does
not trip up on a use-after free of the netns afterward.
Signed-off-by: Sowmini Varadhan <sowmini.va
release+0x3a1/0x4d0 net/rds/af_rds.c:89
> sock_release+0x8d/0x1e0 net/socket.c:599
This is all uspace created rds sockets, and while there may be an
unrelated bug here, I'm not sure I see the netns/kernel-socket
connection.. can you please clarify if this was also seen in some netns
context?
--Sowmini
On (03/01/17 00:14), Dmitry Vyukov wrote:
>
> But the other 2 use-after-frees happened on cp->cp_send_w. Shouldn't
> we cancel it as well? And cp_recv_w?
yes, good point, I missed that. let me see if I can refactor the code
to release the netns as the last thing before free..
Just posted an RFC patch, that I'm also testing here..
hopefully we'll se the pr_info light up, and know that the problematic
situation actually happened (I'll remove the pr_info if/when this
gets submitted as a non-RFC patch).. thanks for helping with testing
this..
--Sowmini
On (02/28/17 18:45), Dmitry Vyukov wrote:
>
> Yes, I can now apply custom patches to the bots. However, it fired
> only 3 times, so it will give weak signal. But at least it will test
> that the patch does not cause other bad things.
Ok, let me do my bit of homework on this one and get back to
hat this does not show up on demand, I'm not
sure how we can check that "the fix worked"..
--Sowmini
e racing). I'll try reproducing this at my end too.
--Sowmini
tcp_conn_paths_destroy. This is done
Its not clear to me that the test is doing any of this...
is this reproducible? let me check if there is some race window where
we can restart a connection attempt when rds_tcp_kill_sock assumes
that the connect worker has been quiesced..
--Sowmini
there kernel tcp sockets to/from
the 16385 port? any hints on what else the test was doing (was it
running a userspace RDS application that triggered the kernel TCP
connection attempt in the first place)?
--Sowmini
: Zhu Yanjun <yanjun@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
t8_t in another would seem inconsistent,
at the very least.
--Sowmini
e from packet_release
before the synchronize_net() - I wonder if this could end up kfree'ing f
when there are threads in the middle of dev_queue_xmit_nit().
--Sowmini
to slow down paths..
I'm travelling at the moment but may be able to give this (try
to reproduce it reliably) next week.
--Sowmini
_bh in __dev_queue_xmit (and
rcu_read_lock in dev_queue_xmit_nit?) should make sure that the nit
packet delivery can be done safely, and the synchronize_net in
packet_release() makes sure that the Tx paths are quiesced before freeing
the socket. What is the race-hole here? Does it have to do with the
_bh and softirq context, somehow?
--Sowmini
. but how does
that sync with the Tx path for other af_inet/af_inet6 sockets?
--Sowmini
On (02/08/17 08:37), Willem de Bruijn wrote:
> Date: Wed, 8 Feb 2017 08:37:19 -0800
> From: Willem de Bruijn <willemdebruijn.ker...@gmail.com>
> To: Sowmini Varadhan <sowmini.varad...@oracle.com>
> Cc: Network Development <netdev@vger.kernel.org>, David Miller
and back into the function.
Other than that, the patchset looks good to me.
--Sowmini
e_change(vio, LDC_EVENT_UP);
> + port->rx_event = 0;
> + return 0;
> + }
>
> err = 0;
> tx_wakeup = 0;
IIRC there were timing-related situations where you can get woken up with
both UP and DATA_READY, and if my reading of your patch is
correct, we would ignore the DATA_READY, and return, right?
--Sowmini
b(sk, hlen + tlen, hlen, len, linear,
> msg->msg_flags & MSG_DONTWAIT, );
do we need a similar check in packet_sendsmg_spkt (even if it
is deprecated, would be better to get it to align with packet_snd?)
Also tpacket_fill_skb should ensure that copylen is set up like
the code above?
--Sowmini
On (02/07/17 15:57), Willem de Bruijn wrote:
>
> From: Willem de Bruijn <will...@google.com>
>
> The stack must not pass packets to device drivers that are shorter
> than the minimum link layer header length.
Acked-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
using existing driver-kernel interfaces
for offload easily, so while we may gain some perf improvment by shaving
of the sk_buff overhead, the trade-off needs to be considered.
--Sowmini
rs support some defensive code against bogus skb's in the
Tx path (the drivers will know, for sure, what's the min non-paged
len they need anyway).
--Sowmini
On (01/27/17 15:51), Willem de Bruijn wrote:
:
> - limit capable() check to drivers with with .validate callback
(aka second option below)
:
> - let privileged applications shoot themselves in the foot (change nothing).
> The second will break variable length header protocols unless
>
sting robustness of the header, so let this one
slip past dev_validate_header at all times"?
It would mean the test suites would have to change slightly.
--Sowmini
ld only take the CAP_SYS_RAWIO branch for IFF_VAR_L2HDR, right?
--Sowmini
?)
Would it make sense to only do the CAP_SYS_RAWIO branch if the
driver declares itself to have variable length L2 headers, via, e.g.,
some priv flag?
--Sowmini
BTW the http://comments.gmane.org/gmane.linux.network/401064 referred
to in commit 2793a23 is not accessible any more, not
is not needed?
still havent googled up prior discussions that led
to dev_validate_header- will probably do that tomorrow AM.
--Sowmini
On (01/26/17 15:21), Willem de Bruijn wrote:
> > If the application has provided fewer than hard_header_len bytes,
> > dev_validate_header() will zero out the skb->data as needed. This is
> > acceptable for SOCK_DGRAM/PF_PACKET sockets but in all other cases,
>
> This was added not for datagram
On (01/25/17 15:06), Paul Durrant wrote:
>
> Making netfront cope with a fully non-linear skb looks like it would
> be quite intrusive and probably not worth it so I opted for just doing
> the ETH_HLEN pull-tail if necessary. Can you check it works for you?
I tested it, and it works fine, but
ed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
include/linux/netdevice.h | 11 ++-
net/packet/af_packet.c| 27 +--
2 files changed, 27 insertions(+), 11 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3868
ok let me work with that and get back (hopefully with an
RFC patch).
--Sowmini
On (01/19/17 13:47), Sowmini Varadhan wrote:
> > Specifically I'm talking about the dev_validate_header() check.
> > That is supposed to protect us from these kinds of situations.
>
> ah, but I run my pf_packet application as root, so I have
> capable(CAP_SYS_RAWI
ations.
ah, but I run my pf_packet application as root, so I have
capable(CAP_SYS_RAWIO), so I slip through the dev_validate_header()
check.
--Sowmini
On (01/19/17 11:31), Paul Durrant wrote:
> Sowmini,
>
> Yeah, it would be useful to verify any change fixes the particular
> issue you're seeing so please share the program. For the non-empty
> non-linear case I'd hope that catching this and doing a pull of some
> sensible amoun
On (01/19/17 09:36), Paul Durrant wrote:
>
> Hi Sowmini,
>
> Sounds like a straightforward bug to me... netfront should be able
> to handle an empty skb and clearly, if it's relying on skb_headlen()
> being non-zero, that's not the case.
>
> Paul
I see. Seems like t
order to create the first_tx? That may not always be
true, how does the code recover for purely non-linear skbs?
--Sowmini
/IPv4 packets that meet the ip_len and test-character
requirements. Include the bpf_asm src as a comment, in case this
needs to be enhanced in the future
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
tools/testing/selftests/net/psock_lib.h
On (01/11/17 12:43), Eric Dumazet wrote:
>
> On Wed, 2017-01-11 at 14:59 -0500, Sowmini Varadhan wrote:
>
> > I think the RFC states somewhere that you should never ever
> > send out a v4 mapped address on the wire.
>
> Can you point the exact RFC ?
>
> ht
tch a merry little packet exchange on
the wire, where we send out an NS for :::13.0.0.1
to the solmcast of the v4 mapped address, the peer politely
responds, and the 2 nodes then have a nice chat.
I think the RFC states somewhere that you should never ever
send out a v4 mapped address on the wire.
--Sowmini
Commit 7f953ab2ba46 ("af_packet: TX_RING support for TPACKET_V3")
now makes it possible to use TX_RING with TPACKET_V3, so make the
the relevant information available via 'ss -e -a --packet'
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
net/packet/diag.c |
packets that will be passed up are
those received on loopback that pass the attached filter.
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
v2: patch reworked based on comments from Willem de Bruijn
v4: dropped patch 1/2: leave it soft;
Send patch 2/2 to the owner of tools/t
clear:
the current v3 version *is* the "hardening" part. Dan's suggestion is
that the hand-crafted version can be replaced by bpf_asm generated code
later. That would be the "cleanup" part, which I was going to do in a
later commit.
Does that help?
--Sowmini
achable here, but since 7f953ab2ba46 ("af_packet:
> TX_RING support for TPACKET_V3") it's not anymore. Fix it by filling
> the timestamp back into the ring slot.
Acked-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
--Sowmini
, the above is conformant with the comment style required for
networking:
https://www.kernel.org/doc/Documentation/networking/netdev-FAQ.txt
which seems to be used in psock_fanout.c and reuseport_bpf.c as well.
Thanks
--Sowmini
On (01/04/17 15:37), Shuah Khan wrote:
> Looks like you have to do v4 anyway, please make sure your comment
> block is one of the acceptable formats based on coding style:
I'm not sure about that. I can just keep patch 2.
thanks,
--Sowmini
the checks in the existing code.
Would it be ok to leave the extremely subjective
"make this more readable" part for you to tackle later?
Or I can just drop patch1, and you can fix it to your
satisfaction later.
--Sowmini
to parse / less readable with macros
> actually. Rest seems fine, thanks.
You think the earlier code was readable? I had to use
gcc -E, with help from the bpf(4) page, to make sense of it.
--Sowmini
re binding the socket to ETH_P_ALL and lo.
v2: patch 2 reworked based on review comments.
v3: Shuah Khan nit.
Sowmini Varadhan (2):
tools: psock_lib: tighten conditions checked in sock_setfilter
tools: psock_tpacket: block Rx until socket filter has been added and
socket has been bound t
201 - 300 of 629 matches
Mail list logo