Re: [PATCH net-next 4/5] rds: invoke socket sg filter attached to rds socket
On 9/11/18 12:38 PM, Tushar Dave wrote: RDS module sits on top of TCP (rds_tcp) and IB (rds_rdma), so messages arrive in form of skb (over TCP) and scatterlist (over IB/RDMA). However, because socket filter only deal with skb (e.g. struct skb as bpf context) we can only use socket filter for rds_tcp and not for rds_rdma. Considering one filtering solution for RDS, it seems that the common denominator between sk_buff and scatterlist is scatterlist. Therefore, this patch converts skb to sgvec and invoke sg_filter_run for rds_tcp and simply invoke sg_filter_run for IB/rds_rdma. Signed-off-by: Tushar Dave Reviewed-by: Sowmini Varadhan --- I remember acking the earlier version. Here it is again.. Acked-by: Santosh Shilimkar
Re: [PATCH 1/1] net/rds: Use rdma_read_gids to get connection SGID/DGID in IPv6
On 8/25/18 12:19 AM, Zhu Yanjun wrote: In IPv4, the newly introduced rdma_read_gids is used to read the SGID/DGID for the connection which returns GID correctly for RoCE transport as well. In IPv6, rdma_read_gids is also used. The following are why rdma_read_gids is introduced. rdma_addr_get_dgid() for RoCE for client side connections returns MAC address, instead of DGID. rdma_addr_get_sgid() for RoCE doesn't return correct SGID for IPv6 and when more than one IP address is assigned to the netdevice. So the transport agnostic rdma_read_gids() API is provided by rdma_cm module. Signed-off-by: Zhu Yanjun --- Looks ok. Acked-by: Santosh Shilimkar
Re: [PATCH v5 net-next 2/3] rds: Enable RDS IPv6 support
On 7/23/18 8:51 PM, Ka-Cheong Poon wrote: This patch enables RDS to use IPv6 addresses. For RDS/TCP, the listener is now an IPv6 endpoint which accepts both IPv4 and IPv6 connection requests. RDS/RDMA/IB uses a private data (struct rds_ib_connect_private) exchange between endpoints at RDS connection establishment time to support RDMA. This private data exchange uses a 32 bit integer to represent an IP address. This needs to be changed in order to support IPv6. A new private data struct rds6_ib_connect_private is introduced to handle this. To ensure backward compatibility, an IPv6 capable RDS stack uses another RDMA listener port (RDS_CM_PORT) to accept IPv6 connection. And it continues to use the original RDS_PORT for IPv4 RDS connections. When it needs to communicate with an IPv6 peer, it uses the RDS_CM_PORT to send the connection set up request. v5: Fixed syntax problem (David Miller). v4: Changed port history comments in rds.h (Sowmini Varadhan). v3: Added support to set up IPv4 connection using mapped address (David Miller). Added support to set up connection between link local and non-link addresses. Various review comments from Santosh Shilimkar and Sowmini Varadhan. v2: Fixed bound and peer address scope mismatched issue. Added back rds_connect() IPv6 changes. Signed-off-by: Ka-Cheong Poon --- Acked-by: Santosh Shilimkar
Re: [PATCH v5 net-next 3/3] rds: Extend RDS API for IPv6 support
On 7/23/18 8:51 PM, Ka-Cheong Poon wrote: There are many data structures (RDS socket options) used by RDS apps which use a 32 bit integer to store IP address. To support IPv6, struct in6_addr needs to be used. To ensure backward compatibility, a new data structure is introduced for each of those data structures which use a 32 bit integer to represent an IP address. And new socket options are introduced to use those new structures. This means that existing apps should work without a problem with the new RDS module. For apps which want to use IPv6, those new data structures and socket options can be used. IPv4 mapped address is used to represent IPv4 address in the new data structures. v4: Revert changes to SO_RDS_TRANSPORT Signed-off-by: Ka-Cheong Poon --- Acked-by: Santosh Shilimkar
Re: [PATCH v5 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr
On 7/23/18 8:51 PM, Ka-Cheong Poon wrote: This patch changes the internal representation of an IP address to use struct in6_addr. IPv4 address is stored as an IPv4 mapped address. All the functions which take an IP address as argument are also changed to use struct in6_addr. But RDS socket layer is not modified such that it still does not accept IPv6 address from an application. And RDS layer does not accept nor initiate IPv6 connections. v2: Fixed sparse warnings. Signed-off-by: Ka-Cheong Poon --- Acked-by: Santosh Shilimkar
Re: [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr
On 7/6/18 2:08 AM, Ka-Cheong Poon wrote: On 07/06/2018 01:58 AM, Santosh Shilimkar wrote: diff --git a/net/rds/connection.c b/net/rds/connection.c index abef75d..ca72563 100644 --- a/net/rds/connection.c +++ b/net/rds/connection.c @@ -142,9 +151,12 @@ static void __rds_conn_path_init(struct rds_connection *conn, * are torn down as the module is removed, if ever. */ static struct rds_connection *__rds_conn_create(struct net *net, - __be32 laddr, __be32 faddr, - struct rds_transport *trans, gfp_t gfp, - int is_outgoing) + const struct in6_addr *laddr, + const struct in6_addr *faddr, + struct rds_transport *trans, + gfp_t gfp, + int is_outgoing, + int dev_if) Am just wondering if we can handle local link address case differently instead of pushing the interface index everywhere. Did you think about any alternative. This can also avoid bunch of changes again and hence the question. Am trying to see if we can minimize the changes to absolute must have to support IPv6. If link local address is supported, scoped ID must be used. Unless we remove the support of link local address, we need to keep scope ID. Ok. diff --git a/net/rds/ib.h b/net/rds/ib.h index a6f4d7d..12f96b3 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h +union rds_ib_conn_priv { + struct rds_ib_connect_private ricp_v4; + struct rds6_ib_connect_private ricp_v6; }; This change was invetiable. Just add a comment saying the priviate data size for v6 vs v4 is different. Also for IPv6 priviate data, I suggest add some resrve fields for any future extensions so that things can be added without breaking wire protocol. With IPv4, we ended up in bit of mess. Will add some comments. Unfortunately the IB private data exchange has only a limited space. I don't think there is any more space left for reserved field. I think we should think of a different way to handle extensions in future. There is enough space. You can send upto 512 bytes iirc. Please add some reserve fields and ping me if you run into issues. diff --git a/net/rds/send.c b/net/rds/send.c index 94c7f74..6ed2e92 100644 --- a/net/rds/send.c +++ b/net/rds/send.c @@ -1081,27 +1085,59 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len) goto out; } - if (msg->msg_namelen) { - /* XXX fail non-unicast destination IPs? */ - if (msg->msg_namelen < sizeof(*usin) || usin->sin_family != AF_INET) { + namelen = msg->msg_namelen; + if (namelen != 0) { + if (namelen < sizeof(*usin)) { + ret = -EINVAL; + goto out; + } + switch (namelen) { + case sizeof(*usin): + if (usin->sin_family != AF_INET || + usin->sin_addr.s_addr == htonl(INADDR_ANY) || + usin->sin_addr.s_addr == htonl(INADDR_BROADCAST) || + IN_MULTICAST(ntohl(usin->sin_addr.s_addr))) { + ret = -EINVAL; The idea was to fail non-unicast IP someday but can you not add this change as part of v6 series. We can look at it later unless you need this change for v6. Again the question is mainly to add only necessary check and leave the existing behavior as is so please re-check all below checks too. This is to match the same IPv6 check. In IPv6 code, it checks if the address is a unicast address. I can remove the IPv4 checks but if an app does send to a multicast address, the connection will be stuck forever. OK. Lets keep it then. diff --git a/net/rds/transport.c b/net/rds/transport.c index 0b188dd..c9788db 100644 --- a/net/rds/transport.c +++ b/net/rds/transport.c + if (ipv6_addr_v4mapped(addr)) { Dave already commented on ipv6_addr_v4mapped(). Apart from some of those comments questions, rest of the patch looks really good and nicely done. Will also look at your subsequent two patches and try to send you comments in next few days. Do you mean using mapped address to create IPv4 connections? I already have it in my work space. Will be in v3. yeah. Thanks !!
Re: general protection fault in rds_ib_get_mr
On 5/13/18 2:10 PM, Eric Biggers wrote: On Wed, Mar 21, 2018 at 09:00:01AM -0700, syzbot wrote: [...] Still reproducible on Linus' tree (commit 66e1c94db3cd4) and linux-next (next-20180511). Here's a simplified reproducer: Thanks for the test case !! Regards, Santosh
Re: [PATCH 1/1] Revert "rds: ib: add error handle"
On 4/23/18 6:39 PM, Zhu Yanjun wrote: This reverts commit 3b12f73a5c2977153f28a224392fd4729b50d1dc. After long time discussion and investigations, it seems that there is no mem leak. So this patch is reverted. Signed-off-by: Zhu Yanjun--- Well your fix was not for any leaks but just proper labels for graceful exits. I don't know which long time discussion you are referring but there is no need to revert this change unless you see any issue with your change. Regards, Santosh
Re: [PATCH net] rds: MP-RDS may use an invalid c_path
On 4/11/18 12:57 AM, Ka-Cheong Poon wrote: rds_sendmsg() calls rds_send_mprds_hash() to find a c_path to use to send a message. Suppose the RDS connection is not yet up. In rds_send_mprds_hash(), it does if (conn->c_npaths == 0) wait_event_interruptible(conn->c_hs_waitq, (conn->c_npaths != 0)); If it is interrupted before the connection is set up, rds_send_mprds_hash() will return a non-zero hash value. Hence rds_sendmsg() will use a non-zero c_path to send the message. But if the RDS connection ends up to be non-MP capable, the message will be lost as only the zero c_path can be used. Signed-off-by: Ka-Cheong Poon <ka-cheong.p...@oracle.com> --- Thanks for posting the fix upstream as well. Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH 2/2] net: rds: drop VLA in rds_walk_conn_path_info()
On 3/11/18 2:07 PM, Salvatore Mesoraca wrote: Avoid VLA[1] by using an already allocated buffer passed by the caller. [1] https://lkml.org/lkml/2018/3/7/621 Signed-off-by: Salvatore Mesoraca <s.mesorac...@gmail.com> --- Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH 1/2] net: rds: drop VLA in rds_for_each_conn_info()
On 3/11/18 2:07 PM, Salvatore Mesoraca wrote: Avoid VLA[1] by using an already allocated buffer passed by the caller. [1] https://lkml.org/lkml/2018/3/7/621 Signed-off-by: Salvatore Mesoraca <s.mesorac...@gmail.com> --- Thanks for both VLA fixes Salvatore. FWIW, Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH][rds-next] rds: make functions rds_info_from_znotifier and rds_message_zcopy_from_user static
On 3/11/18 11:54 PM, santosh.shilim...@oracle.com wrote: On 3/11/18 10:03 AM, Colin King wrote: From: Colin Ian King <colin.k...@canonical.com> Functions rds_info_from_znotifier and rds_message_zcopy_from_user are local to the source and do not need to be in global scope, so make them static. Cleans up sparse warnins: net/rds/message.c:70:27: warning: symbol 'rds_info_from_znotifier' was not declared. Should it be static? net/rds/message.c:358:5: warning: symbol 'rds_message_zcopy_from_user' was not declared. Should it be static? Signed-off-by: Colin Ian King <colin.k...@canonical.com> --- net/rds/message.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com> This was fixed by kbuild robot and the fix is already in net-next. Regards, Santosh
Re: [PATCH][rds-next] rds: make functions rds_info_from_znotifier and rds_message_zcopy_from_user static
On 3/11/18 10:03 AM, Colin King wrote: From: Colin Ian King <colin.k...@canonical.com> Functions rds_info_from_znotifier and rds_message_zcopy_from_user are local to the source and do not need to be in global scope, so make them static. Cleans up sparse warnins: net/rds/message.c:70:27: warning: symbol 'rds_info_from_znotifier' was not declared. Should it be static? net/rds/message.c:358:5: warning: symbol 'rds_message_zcopy_from_user' was not declared. Should it be static? Signed-off-by: Colin Ian King <colin.k...@canonical.com> --- net/rds/message.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH][rds-next] rds: remove redundant variable 'sg_off'
On 3/11/18 9:27 AM, Colin King wrote: From: Colin Ian King <colin.k...@canonical.com> Variable sg_off is assigned a value but it is never read, hence it is redundant and can be removed. Cleans up clang warning: net/rds/message.c:373:2: warning: Value stored to 'sg_off' is never read Signed-off-by: Colin Ian King <colin.k...@canonical.com> --- net/rds/message.c | 2 -- 1 file changed, 2 deletions(-) Thanks Colin !! Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH v2 net] rds: Incorrect reference counting in TCP socket creation
On 3/1/18 9:07 PM, Ka-Cheong Poon wrote: Commit 0933a578cd55 ("rds: tcp: use sock_create_lite() to create the accept socket") has a reference counting issue in TCP socket creation when accepting a new connection. The code uses sock_create_lite() to create a kernel socket. But it does not do __module_get() on the socket owner. When the connection is shutdown and sock_release() is called to free the socket, the owner's reference count is decremented and becomes incorrect. Note that this bug only shows up when the socket owner is configured as a kernel module. v2: Update comments Versioning comment typically goes below "---" and not part of commit message. Signed-off-by: Ka-Cheong Poon <ka-cheong.p...@oracle.com> --- Patch looks fine. Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: BUG: spinlock bad magic (2)
On 1/30/18 2:28 PM, Eric Biggers wrote: On Mon, Dec 18, 2017 at 06:01:30PM +0100, 'Dmitry Vyukov' via syzkaller-bugs wrote: On Mon, Dec 18, 2017 at 5:46 PM, Santosh Shilimkar <santosh.shilim...@oracle.com> wrote: On 12/18/2017 4:36 AM, syzbot wrote: Hello, syzkaller hit the following crash on 6084b576dca2e898f5c101baef151f7bfdbb606d git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/master compiler: gcc (GCC) 7.1.1 20170620 .config is attached Raw console output is attached. Unfortunately, I don't have any reproducer for this bug yet. [...] BUG: unable to handle kernel NULL pointer dereference at 0028 IP: rds_send_xmit+0x80/0x930 net/rds/send.c:186 This one seems to be same bug as reported as below. BUG: unable to handle kernel NULL pointer dereference in rds_send_xmit Hi Santosh, The proper syntax to tell syzbot about dups is this (from email footer): See https://goo.gl/tpsmEJ for details. Please credit me with: Reported-by: syzbot <syzkal...@googlegroups.com> syzbot will keep track of this bug report. To mark this as a duplicate of another syzbot report, please reply with: #syz dup: exact-subject-of-another-report Note: all commands must start from beginning of the line in the email body. #syz dup: BUG: unable to handle kernel NULL pointer dereference in rds_send_xmit Ok. Noted.
Re: [PATCH] rds: fix use-after-free read in rds_find_bound
On 12/31/17 4:33 AM, Sowmini Varadhan wrote: On (12/30/17 21:09), santosh.shilim...@oracle.com wrote: Right. This was loop transport in action so xmit will just flip the direction with receive. And rds_recv_incoming() can race with socket_release. rds_find_bound() is suppose to add ref count on socket for rds_recv_incoming() but by that time socket is DEAD & freed by socket release callback. correct, that makes sense. Yea. In fact the earlier point of sk being null and rs not is also possible because socket release explicitly marks its NULL ("sock->sk = NULL"). But it just side effect of the actual race. And rds_release is suppose to be synced with rs_recv_lock. But release callback is marking the sk orphan before syncing up with receive path and updating the bind table. Probably it can pushed down further after the socket clean up buut need to think bit more. However, I'm not sure this seals the race.. according to the bug report rds_recv_incoming->rds_find_bound is being called in rds_send_worker context and the rds_find_bound code is 63 rs = rhashtable_lookup_fast(_hash_table, , ht_parms); 64 if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) 65 rds_sock_addref(rs); 66 else 67 rs = NULL; 68 Since the entire logic of rds_release can interleave between line 63 and 64, (whereas we only addref at line 65), moving the sock_orphan will not help. I see that there was an explicic synchornization via the bucket->lock before 7b5654349e. I think you need something like that, or some type or rcu-based hash list. The rhashtable already has internal bucket lock so those operation like add/remove are synced up. But yes reference addition can still race with receive since receive lock is taken after find bound. Patch below may make race-window smaller, but race window is still there. If the receive lock is taken ahead then with sock_orphan change socket release will get synchronized with receive. But preventing socket release to be getting triggered while in receive path by means ref count is better option. Moving socket_orphan down is anyway a good change but its not enough. Will think bit more about it. Thanks for the good discussion. Regards, Santosh
Re: [PATCH] rds: fix use-after-free read in rds_find_bound
On 12/30/17 2:32 PM, Sowmini Varadhan wrote: On (12/30/17 13:37), santosh.shilim...@oracle.com wrote: [...] Thats what I thought as well initially but since the reported case, the rs seems to be valid where as sk seems to be freed up as part of sock_release callback. I dont understand the statement above- how can "rs be valid, and sk be freed"? rs_sk is embedded in the struct rds_sock, it is not a pointer. I was going with order of evaluation of if () but you made good point. rs_sk isn't a pointer so sk can't be null. let's find and fix the refcount bug. See stack trace in commit comment. The socket release is happening prematurely and existing WARN_ONs are not catching it. Right. This was loop transport in action so xmit will just flip the direction with receive. And rds_recv_incoming() can race with socket_release. rds_find_bound() is suppose to add ref count on socket for rds_recv_incoming() but by that time socket is DEAD & freed by socket release callback. And rds_release is suppose to be synced with rs_recv_lock. But release callback is marking the sk orphan before syncing up with receive path and updating the bind table. Probably it can pushed down further after the socket clean up buut need to think bit more. diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c index b405f77..11e1426 100644 --- a/net/rds/af_rds.c +++ b/net/rds/af_rds.c @@ -65,7 +65,6 @@ static int rds_release(struct socket *sock) rs = rds_sk_to_rs(sk); - sock_orphan(sk); /* Note - rds_clear_recv_queue grabs rs_recv_lock, so * that ensures the recv path has completed messing * with the socket. */ @@ -85,6 +84,7 @@ static int rds_release(struct socket *sock) rds_trans_put(rs->rs_transport); + sock_orphan(sk); sock->sk = NULL; sock_put(sk); out:
Re: [PATCH] rds: fix use-after-free read in rds_find_bound
On 12/30/17 12:26 PM, Sowmini Varadhan wrote: On (12/30/17 11:36), Santosh Shilimkar wrote: socket buffer can get freed as part of sock_close callback so before adding reference check underneath socket validity. I'm not sure I understand this fix- struct rds_sock is: struct rds_sock { struct sock rs_sk; : } How can rs be non-null but rds_rs_to_sk() is null? (Note that rds_rs_to_sk just returns >rs_sk) so the changed line is identical to the original line. Well thats what the report says o.w flag test wouldn't have been attempted. - if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) + if (rs && rds_rs_to_sk(rs) && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) I think the real issue is refcount bug somewhere, Thats what I thought as well initially but since the reported case, the rs seems to be valid where as sk seems to be freed up as part of sock_release callback. Was the syzbot test run with http://patchwork.ozlabs.org/patch/852492/ this sounds like that type of bug. That fix scenario, the rs don't get inserted in hash table and in this particular bug, the lookup was successful so am not sure if these two bugs are related. But since bound address fix was still not part of the build reproduced use after free bug, $subject fix can wait for next reproduction. Unfortunately as per the report, there is no reproducer for it to test if other fix fixes this issue. Regards, Santosh
Re: KASAN: use-after-free Read in rds_find_bound
On 12/30/17 1:17 AM, syzbot wrote: Hello, syzkaller hit the following crash on fba961ab29e5ffb055592442808bb0f7962e05da git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master compiler: gcc (GCC) 7.1.1 20170620 .config is attached Raw console output is attached. Unfortunately, I don't have any reproducer for this bug yet. IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+93a5839deb3555374...@syzkaller.appspotmail.com Posted a fix[1] for above issue. Didn't test it but looks straight forward. Regards, Santosh
Re: [PATCH net] RDS: Check cmsg_len before dereferencing CMSG_DATA
On 12/21/17 8:17 PM, Avinash Repaka wrote: RDS currently doesn't check if the length of the control message is large enough to hold the required data, before dereferencing the control message data. This results in following crash: BUG: KASAN: stack-out-of-bounds in rds_rdma_bytes net/rds/send.c:1013 [inline] BUG: KASAN: stack-out-of-bounds in rds_sendmsg+0x1f02/0x1f90 net/rds/send.c:1066 Read of size 8 at addr 8801c928fb70 by task syzkaller455006/3157 CPU: 0 PID: 3157 Comm: syzkaller455006 Not tainted 4.15.0-rc3+ #161 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:53 print_address_description+0x73/0x250 mm/kasan/report.c:252 kasan_report_error mm/kasan/report.c:351 [inline] kasan_report+0x25b/0x340 mm/kasan/report.c:409 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430 rds_rdma_bytes net/rds/send.c:1013 [inline] rds_sendmsg+0x1f02/0x1f90 net/rds/send.c:1066 sock_sendmsg_nosec net/socket.c:628 [inline] sock_sendmsg+0xca/0x110 net/socket.c:638 ___sys_sendmsg+0x320/0x8b0 net/socket.c:2018 __sys_sendmmsg+0x1ee/0x620 net/socket.c:2108 SYSC_sendmmsg net/socket.c:2139 [inline] SyS_sendmmsg+0x35/0x60 net/socket.c:2134 entry_SYSCALL_64_fastpath+0x1f/0x96 RIP: 0033:0x43fe49 RSP: 002b:7fffbe244ad8 EFLAGS: 0217 ORIG_RAX: 0133 RAX: ffda RBX: 004002c8 RCX: 0043fe49 RDX: 0001 RSI: 2020c000 RDI: 0003 RBP: 006ca018 R08: R09: R10: R11: 0217 R12: 004017b0 R13: 00401840 R14: R15: To fix this, we verify that the cmsg_len is large enough to hold the data to be read, before proceeding further. Reported-by: syzbot <syzkaller-b...@googlegroups.com> Signed-off-by: Avinash Repaka <avinash.rep...@oracle.com> --- Thanks !! Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH v2 4/5] rds: Add runchecks.cfg for net/rds
On 12/16/17 10:24 AM, Joe Perches wrote: On Sat, 2017-12-16 at 09:45 -0800, Stephen Hemminger wrote: On Sat, 16 Dec 2017 15:42:29 +0100 Knut Omangwrote: +# Code simplification: +# +except ALLOC_WITH_MULTIPLY ib.c +except PREFER_PR_LEVEL ib_cm.c ib_recv.c ib_send.c rdma_transport.c threads.c transport.c +except UNNECESSARY_ELSE ib_fmr.c +except UNNECESSARY_PARENTHESES ib_rdma.c rdma.c recv.c send.c +except PRINTK_RATELIMITED ib_frmr.c +except EMBEDDED_FUNCTION_NAME ib_rdma.c + +# Style and readability: +# +except BRACES ib_cm.c ib_rdma.c ib_recv.c send.c transport.c +except OOM_MESSAGE ib.c tcp.c +except LONG_LINE_STRING ib.c ib_recv.c ib_send.c +except FUNCTION_ARGUMENTS ib.h ib_mr.h rds.h tcp.h +except OPEN_ENDED_LINE recv.c ib_recv.c + +# Candidates to leave as exceptions (don't fix): +except MULTIPLE_ASSIGNMENTS ib_send.c +except LONG_LINE_STRING connection.c +except OPEN_BRACE connection.c + Why start letting subsystems have a free-pass? Also this would mean that new patches to IB would continue the bad habits. And I don't need any free pass for RDS either. I missed V1 of this series but Knut, please don't add any exceptions for RDS and if there is something needs to be fixed, we can address it. Once your infrastructure gets merged, the subsequent fixes can be added. I agree with this comment at least for net/rds. Most of these existing messages from checkpatch should probably be inspected and corrected where possible to minimize the style differences between this subsystem and the rest of the kernel. For instance, here's a trivial patch to substitute pr_ for printks and a couple braces next to these substitutions. Thanks Joe. I actually had a similar patch a while back but since it was lot of churn, and code was already merged, never submitted it and then later forgot about it. Will look into it. btw: in ib_cm, why is one call to ib_modify_qp emitted with a -ret and the other with a positive err? Its oversight and will fix that. Regards, Santosh
Re: [PATCH] RDS: constify rhashtable_params
On 8/30/17 4:49 AM, Arvind Yadav wrote: rhashtable_params are not supposed to change at runtime. All Functions rhashtable_* working with const rhashtable_params provided by . So mark the non-const structs as const. Signed-off-by: Arvind Yadav--- This is already addressed in net-next by [1] Regards, Santosh [1] https://lkml.org/lkml/2017/8/25/482
Re: [PATCH 0/3] ARM: dts: keystone-k2g: Add DCAN instances to 66AK2G
On 8/7/17 2:31 PM, Franklin S Cooper Jr wrote: Hi Santosh, On 08/04/2017 12:07 PM, Santosh Shilimkar wrote: Hi Franklin, On 8/2/2017 1:18 PM, Franklin S Cooper Jr wrote: Add D CAN nodes to 66AK2G based SoC dtsi. Franklin S Cooper Jr (2): dt-bindings: net: c_can: Update binding for clock and power-domains property ARM: configs: keystone: Enable D_CAN driver Lokesh Vutla (1): ARM: dts: k2g: Add DCAN nodes Any DCAN driver dependency with these patchset ? If not, I can queue this up so do let me know. There aren't any dependencies. Applied. Thanks !! Regards, Santosh
Re: [PATCH] RDS: IB: NULL dereference on error in rds_ib_alloc_frmr()
On 6/14/17 3:39 AM, Dan Carpenter wrote: We accidentally return ERR_PTR(0) if ib_alloc_mr() fails. The caller is expecting error pointers so it results in a NULL dereference. Fixes: 1659185fb4d0 ("RDS: IB: Support Fastreg MR (FRMR) memory registration mode") Signed-off-by: Dan Carpenter <dan.carpen...@oracle.com> Thanks Dan for fix. Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com> You haven't copied netdev. Can you please resend the patch with my ack on netdev so that Dave can pick it up.
Re: [PATCH net-next 2/2] rds: tcp: canonical connection order for all paths with index > 0
On 3/31/17 3:56 PM, Sowmini Varadhan wrote: The rds_connect_worker() has a bug in the check that enforces the canonical connection order described in the comments of rds_tcp_state_change(). The intention is to make sure that all the multipath connections are always initiated by the smaller IP address via rds_start_mprds. To achieve this, rds_connection_worker should check that cp_index > 0. Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH net-next 1/2] rds: tcp: allow progress of rds_conn_shutdown if the rds_connection is marked ERROR by an intervening FIN
On 3/31/17 3:56 PM, Sowmini Varadhan wrote: rds_conn_shutdown() runs in workq context, and marks the rds_connection as DISCONNECTING before quiescing Tx/Rx paths. However, after all I/O has quiesced, we may still find the rds_connection state to be RDS_CONN_ERROR if an intervening FIN was processed in softirq context. This is not a fatal error: rds_conn_shutdown() should continue the shutdown, and there is no need to log noisy messages about this event. Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCHv2 1/4] rds: ib: drop unnecessary rdma_reject
On 3/12/17 12:33 PM, Leon Romanovsky wrote: On Sun, Mar 12, 2017 at 04:07:55AM -0400, Zhu Yanjun wrote: When rdma_accept fails, rdma_reject is called in it. As such, it is not necessary to execute rdma_reject again. Cc: Joe Jin <joe@oracle.com> Cc: Junxiao Bi <junxiao...@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com> Signed-off-by: Zhu Yanjun <yanjun@oracle.com> --- Change from v1 to v2: Add the acker. net/rds/ib_cm.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index ce3775a..eca3d5f 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -677,8 +677,7 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id, event->param.conn.initiator_depth); /* rdma_accept() calls rdma_reject() internally if it fails */ - err = rdma_accept(cm_id, _param); - if (err) + if (rdma_accept(cm_id, _param)) rds_ib_conn_error(conn, "rdma_accept failed (%d)\n", err); You omitted initialization of "err" variable which you print here ^. Its inited by rds_ib_setup_qp() but you are right. It will print failed with error = 0. :-) Zhu, please drop that 'err' from the message.
Re: [PATCH 2/5] rds: ib: replace spin_lock_irq with spin_lock_irqsave
On 3/11/17 6:33 PM, Yanjun Zhu wrote: Sorry. I have no test case to show some issue. But from Linux Kernel Development Second Edition by Robert Love. Yes I know the book and what the API does :D Use spin_lock_irq is dangerous since spin_unlock_irq unconditionally enables interrupts. We can assume the following scenario: --->the interrupt is disabled. spin_lock_irq(lock_ptr); <---this will disable interrupt again list_del(>ib_node); spin_unlock_irq(lock_ptr); <---this will enable interrupt >the interrupt is enabled. our code change the state of interrupt. This will make potential risk. But spin_lock_irqsave/spin_unlock_irqrestore will not make potential risk. ic is well protected for any possible race and hence I asked if you had any test case. Please re-post the series again with the subject patch dropper for Dave to pick it up. Regards, Santosh
Re: [PATCH net 0/3] rds: tcp: fix various rds-tcp issues during netns create/delete sequences
On 3/4/17 8:57 AM, Sowmini Varadhan wrote: Dmitry Vyukov reported some syszkaller panics during netns deletion. While I have not been able to reproduce those exact panics, my attempts to do so uncovered a few other problems, which are fixed patch 2 and patch 3 of this patch series. In addition, as mentioned in, https://www.spinics.net/lists/netdev/msg422997.html code-inspection points that the rds_connection needs to take an explicit refcnt on the struct net so that it is held down until all cleanup is completed for netns removal, and this is fixed by patch1. Hopefully Dmitry can try the series and see if it fixes the issue(s). The fixes looks good to me. FWIW, Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH 1/1] rds: fix memory leak error
On 2/24/17 11:12 AM, David Miller wrote: From: Santosh Shilimkar <santosh.shilim...@oracle.com> Date: Fri, 24 Feb 2017 08:49:19 -0800 On 2/24/2017 1:28 AM, Zhu Yanjun wrote: When the function register_netdevice_notifier fails, the memory allocated by kmem_cache_create should be freed by the function kmem_cache_destroy. Cc: Joe Jin <joe@oracle.com> Cc: Junxiao Bi <junxiao...@oracle.com> Signed-off-by: Zhu Yanjun <yanjun@oracle.com> --- Minor nit with subject. s/rds/RDS: TCP: Please, I definitely prefer people use all lowercase letters for the susbsystem prefix in the subject line. Note for the future. Thanks Dave !! Regards, Santosh
Re: [net-next][PATCH] RDS: keep data type consistent in the user visible header
On 2/21/17 6:29 AM, David Laight wrote: The entire file should use the proper "__uX" kernel types rather than the uint* ones. The uint* ones are part of the C standard :-) Should have been uint*_t :-)
Re: [net-next][PATCH v2] RDS: Make user visible data types consistent with rest of the kernel
On 2/20/17 7:18 PM, David Miller wrote: From: Santosh Shilimkar <santosh.shilim...@oracle.com> Date: Mon, 20 Feb 2017 14:16:53 -0800 Use "__uX/__sX" kernel types rather than current uint*/int* for entire file. Reviewed-by: Sowmini Varadhan <sowmini.varad...@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilim...@oracle.com> --- v1->v2: As suggested by David Miller, cleaned up entire header to use kernel types. Dmitry Levin already submitted a patch the other day which does this, and it is already in net-next. I see that now. Thanks Dave !! Regards, Santosh
Re: [PATCH 7/9] RDS: IB: Remove an unused structure member
On 1/10/17 4:56 PM, Bart Van Assche wrote: Signed-off-by: Bart Van Assche <bart.vanass...@sandisk.com> Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH] RDS: Simplify code
On 9/4/16 11:23 AM, Leon Romanovsky wrote: On Sun, Sep 04, 2016 at 05:57:20PM +0200, Christophe JAILLET wrote: Le 04/09/2016 à 14:20, Leon Romanovsky a écrit : On Sat, Sep 03, 2016 at 07:33:29AM +0200, Christophe JAILLET wrote: Calling 'list_splice' followed by 'INIT_LIST_HEAD' is equivalent to 'list_splice_init'. It is not 100% accurate list_splice(y, z) INIT_LIST_HEAD(y) ==> if (!list_empty(y)) __list_splice(y, z, z>next); INIT_LIST_HEAD(y) and not if (!list_empty(y)) { __list_splice(y, z, z>next); INIT_LIST_HEAD(y) } as list_splice_init will do. You are right but if you dig further you will see that calling INIT_LIST_HEAD on an empty list is a no-op (AFAIK). And if this list was not already correctly initialized, then you would have some other troubles. Thank you for the suggestion, It looks like the code after that can be skipped in case of loop_conns list is empty, the tmp_list will be empty too. 174 list_for_each_entry_safe(lc, _lc, _list, loop_node) { 175 WARN_ON(lc->conn->c_passive); 176 rds_conn_destroy(lc->conn); 177 } Thanks for trying. As already pointed, your change doesn't simplify much rather change the behavior. The loop cursor already takes care of list empty case. I don't see any reason to change that code. Regards, Santosh
Re: [PATCH] net: rds: fix coding style issues
On 6/18/16 8:46 AM, Joshua Houghton wrote: Fix coding style issues in the following files: ib_cm.c: add space loop.c: convert spaces to tabs sysctl.c: add space tcp.h:convert spaces to tabs tcp_connect.c:remove extra indentation in switch statement tcp_recv.c: convert spaces to tabs tcp_send.c: convert spaces to tabs transport.c: move brace up one line on for statement Signed-off-by: Joshua Houghton <j...@awful.name> --- Thanks for doing it. Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [net-next][PATCH 0/2] RDS: couple of fixes for 4.6
On 4/13/16 8:36 PM, David Miller wrote: From: Santosh Shilimkar <santosh.shilim...@oracle.com> Date: Fri, 8 Apr 2016 15:26:38 -0700 Patches are also available at below git tree. git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git for_4.6/net-next/rds-fixes "Bug fixes for 4.6" do not get targetted at the net-next tree, that's for 4.7 development. Sorry. Should have based it against 'net'. Just posted re-based version. Thanks !! Regards, Santosh
Re: [PATCH] RDS: sync congestion map updating
On 4/1/16 6:14 PM, Leon Romanovsky wrote: On Fri, Apr 01, 2016 at 12:47:24PM -0700, santosh shilimkar wrote: (cc-ing netdev) On 3/30/2016 7:59 PM, Wengang Wang wrote: 在 2016年03月31日 09:51, Wengang Wang 写道: 在 2016年03月31日 01:16, santosh shilimkar 写道: Hi Wengang, On 3/30/2016 9:19 AM, Leon Romanovsky wrote: On Wed, Mar 30, 2016 at 05:08:22PM +0800, Wengang Wang wrote: Problem is found that some among a lot of parallel RDS communications hang. In my test ten or so among 33 communications hang. The send requests got -ENOBUF error meaning the peer socket (port) is congested. But meanwhile, peer socket (port) is not congested. The congestion map updating can happen in two paths: one is in rds_recvmsg path and the other is when it receives packets from the hardware. There is no synchronization when updating the congestion map. So a bit operation (clearing) in the rds_recvmsg path can be skipped by another bit operation (setting) in hardware packet receving path. To be more detailed. Here, the two paths (user calls recvmsg and hardware receives data) are for different rds socks. thus the rds_sock->rs_recv_lock is not helpful to sync the updating on congestion map. For archive purpose, let me try to conclude the thread. I synced with Wengang offlist and came up with below fix. I was under impression that __set_bit_le() was atmoic version. After fixing it like patch(end of the email), the bug gets addressed. I will probably send this as fix for stable as well. From 5614b61f6fdcd6ae0c04e50b97efd13201762294 Mon Sep 17 00:00:00 2001 From: Santosh Shilimkar <santosh.shilim...@oracle.com> Date: Wed, 30 Mar 2016 23:26:47 -0700 Subject: [PATCH] RDS: Fix the atomicity for congestion map update Two different threads with different rds sockets may be in rds_recv_rcvbuf_delta() via receive path. If their ports both map to the same word in the congestion map, then using non-atomic ops to update it could cause the map to be incorrect. Lets use atomics to avoid such an issue. Full credit to Wengang <wen.gang.w...@oracle.com> for finding the issue, analysing it and also pointing out to offending code with spin lock based fix. I'm glad that you solved the issue without spinlocks. Out of curiosity, I see that this patch is needed to be sent to Dave and applied by him. Is it right? Right. I was planning send this one along with one more fix together on netdev for Dave to pick it up. ➜ linus-tree git:(master) ./scripts/get_maintainer.pl -f net/rds/cong.c Santosh Shilimkar <santosh.shilim...@oracle.com> (supporter:RDS - RELIABLE DATAGRAM SOCKETS) "David S. Miller" <da...@davemloft.net> (maintainer:NETWORKING [GENERAL]) netdev@vger.kernel.org (open list:RDS - RELIABLE DATAGRAM SOCKETS) linux-r...@vger.kernel.org (open list:RDS - RELIABLE DATAGRAM SOCKETS) rds-de...@oss.oracle.com (moderated list:RDS - RELIABLE DATAGRAM SOCKETS) linux-ker...@vger.kernel.org (open list) Signed-off-by: Wengang Wang <wen.gang.w...@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilim...@oracle.com> Reviewed-by: Leon Romanovsky <l...@leon.nu> Thanks for review.
Re: [PATCH] rds: rds-stress show all zeros after few minutes
Hi Shamir, Nice to see this one soon on the list, Just to make $subject more relevant. How about below? RDS: fix congestion map corruption for PAGE_SIZE > 8k On 3/30/16 5:50 PM, shamir rabinovitch wrote: Issue can be seen on platforms that use 8K and above page size while rds fragment size is 4K. On those platforms single page is shared between 2 or more rds fragments. Each fragment has it's own offeset and rds cong map code need to take this offset to account. Not taking this offset to account lead to reading the data fragment as congestion map fragment and hang of the rds transmit due to far cong map corruption. Reviewed-by: Wengang Wang <wen.gang.w...@oracle.com> Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchand...@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com> Tested-by: Anand Bibhuti <anand.bibh...@oracle.com> Signed-off-by: shamir rabinovitch <shamir.rabinovi...@oracle.com> --- net/rds/ib_recv.c |2 +- net/rds/iw_recv.c |2 +- net/rds/page.c|5 +++-- 3 files changed, 5 insertions(+), 4 deletions(-) diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c index 977fb86..abc8cc8 100644 --- a/net/rds/ib_recv.c +++ b/net/rds/ib_recv.c @@ -796,7 +796,7 @@ static void rds_ib_cong_recv(struct rds_connection *conn, addr = kmap_atomic(sg_page(>f_sg)); - src = addr + frag_off; + src = addr + frag->f_sg.offset + frag_off; dst = (void *)map->m_page_addrs[map_page] + map_off; for (k = 0; k < to_copy; k += 8) { /* Record ports that became uncongested, ie diff --git a/net/rds/iw_recv.c b/net/rds/iw_recv.c If you refresh the patch against 4.6-rc1, you won't need to patch iw_recv.c :-) diff --git a/net/rds/page.c b/net/rds/page.c index 5a14e6d..715cbaa 100644 --- a/net/rds/page.c +++ b/net/rds/page.c @@ -135,8 +135,9 @@ int rds_page_remainder_alloc(struct scatterlist *scat, unsigned long bytes, if (rem->r_offset != 0) rds_stats_inc(s_page_remainder_hit); - rem->r_offset += bytes; - if (rem->r_offset == PAGE_SIZE) { + /* some hw (e.g. sparc) require aligned memory */ + rem->r_offset += ALIGN(bytes, 8); + if (rem->r_offset >= PAGE_SIZE) { __free_page(rem->r_page); rem->r_page = NULL; } This hunk I missed out looks like. This doesn't belong to the $subject patch. Could you please add this in separate patch. I will need more than just "some hw (e.g. sparc) require aligned memory" Once you fix these, please repost the updated version, and I will add them to the 4.7 queue. Thanks !! Regards, Santosh
Re: [net-next][PATCH v2 01/13] RDS: Drop stale iWARP RDMA transport
On 2/28/16 11:51 AM, Or Gerlitz wrote: On Sun, Feb 28, 2016 at 4:19 AM, Santosh Shilimkar <santosh.shilim...@oracle.com> wrote: RDS iWarp support code has become stale and non testable. As indicated earlier, am dropping the support for it. If new iWarp user(s) shows up in future, we can adapat the RDS IB transprt for the special RDMA READ sink case. iWarp needs an MR for the RDMA READ sink. Signed-off-by: Santosh Shilimkar <ssant...@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilim...@oracle.com> Hi, just wondered if there's any special reason that all this series carries double S.O.B signature line with your name appearing twice on two different email addresses? Nothing special. I sign of all my patches with k.org id and have to keep Oracle id as well being a payed Oracle employee. ;-) Regards, Santosh
Re: [net-next][PATCH v2 11/13] RDS: IB: add Fastreg MR (FRMR) detection support
On 2/28/16 1:08 AM, Christoph Hellwig wrote: On Sat, Feb 27, 2016 at 06:19:48PM -0800, Santosh Shilimkar wrote: Discovere Fast Memmory Registration support using IB device IB_DEVICE_MEM_MGT_EXTENSIONS. Certain HCA might support just FRMR or FMR or both FMR and FRWR. In case both mr type are supported, default FMR is used. Default MR is still kept as FMR against what everyone else is following. Default will be changed to FRMR once the RDS performance with FRMR is comparable with FMR. The work is in progress for the same. Signed-off-by: Santosh Shilimkar <ssant...@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilim...@oracle.com> --- v2: Dropped the module parameter as suggested by David Miller This means we only use the safer method if the HCA doesn't support the other one. All other RDMA ULP that support both methods have a module_param so the veto from Dave is a bit unfortunate. Anyway, let's get the code in for now and figure out what to use later. Indeed. It wasn't really deal breaker for RDS so I agreed to drop it. Just curious: where / how do you see worse peformance using FRs? I wouldn't call it worse but its not comparable. Use case is multi-threaded RDS RDMA perf test(s). Hopefully the follow up series which am working on should minimise the gap. I am leaving the details for later, but one of the main issue I saw was contention on driver post_send() lock from send, MR reg and MR invalidation. Regards, Santosh
Re: [net-next][PATCH v2 01/13] RDS: Drop stale iWARP RDMA transport
On 2/28/16 1:05 AM, Christoph Hellwig wrote: On Sat, Feb 27, 2016 at 06:19:38PM -0800, Santosh Shilimkar wrote: RDS iWarp support code has become stale and non testable. As indicated earlier, am dropping the support for it. If new iWarp user(s) shows up in future, we can adapat the RDS IB transprt for the special RDMA READ sink case. iWarp needs an MR for the RDMA READ sink. Please take a look at the RDMA RW API series I posted yesterday - if you can adopt RDS to that you should get iWarp support for free. Will have a look. Thanks for the pointer. But having two different codebases for IB/RoCE vs iWarp was always a bad idea, so great to see the second one retired! Acked-by: Christoph HellwigThanks !!
Re: [net-next][PATCH 00/13] RDS: Major clean-up with couple of new features for 4.6
Hi Dave, On 2/26/16 9:43 PM, Santosh Shilimkar wrote: Series is generated against net-next but also applies against Linus's tip cleanly. The diff-stat looks bit scary since almost ~4K lines of code is getting removed. [...] Entire patchset is available below git tree: git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git for_4.6/net-next/rds Just noticed that, I accidentally posted the patches from older(v1) folder, instead of updated v2. Sorry about it. Please discard this entire series. Will post intended v2 right after this email. Regards, Santosh
Re: [PATCH] net: rds: don't pretend to use cpu notifiers
Hi Sebastian, On 11/27/15 8:00 AM, Sebastian Andrzej Siewior wrote: It looks like an attempt to use CPU notifier here which was never completed. Nobody tried to wire it up completely since 2k9. So I unwind this code and get rid of everything not required. Oh look! 19 lines were removed while code still does the same thing. Indeed its true that RDS doesn't actually support the hot-plug so am not surprised with the incompleteness of it. Signed-off-by: Sebastian Andrzej Siewior--- net/rds/page.c | 31 ++- 1 file changed, 6 insertions(+), 25 deletions(-) diff --git a/net/rds/page.c b/net/rds/page.c index 9005a2c920ee..5a14e6d6a926 100644 --- a/net/rds/page.c +++ b/net/rds/page.c @@ -179,37 +179,18 @@ int rds_page_remainder_alloc(struct scatterlist *scat, unsigned long bytes, } EXPORT_SYMBOL_GPL(rds_page_remainder_alloc); -static int rds_page_remainder_cpu_notify(struct notifier_block *self, -unsigned long action, void *hcpu) +void rds_page_exit(void) { - struct rds_page_remainder *rem; - long cpu = (long)hcpu; + unsigned int cpu; - rem = _cpu(rds_page_remainders, cpu); + for_each_possible_cpu(cpu) { + struct rds_page_remainder *rem; - rdsdebug("cpu %ld action 0x%lx\n", cpu, action); + rem = _cpu(rds_page_remainders, cpu); + rdsdebug("cpu %u\n", cpu); - switch (action) { - case CPU_DEAD: if (rem->r_page) __free_page(rem->r_page); rem->r_page = NULL; - break; } - - return 0; -} - -static struct notifier_block rds_page_remainder_nb = { - .notifier_call = rds_page_remainder_cpu_notify, -}; - -void rds_page_exit(void) -{ - int i; - - for_each_possible_cpu(i) - rds_page_remainder_cpu_notify(_page_remainder_nb, - (unsigned long)CPU_DEAD, - (void *)(long)i); } Thanks for the cleanup. I don't have any objections as such on the change and patch in general looks ok to me from first look. Just want to make sure it works on my setup. If all goes well, I will add this in my RDS queue for 4.5. Thanks !! Regards, Santosh -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] RDS: convert bind hash table to re-sizable hashtable
On 10/14/15 2:15 PM, Santosh Shilimkar wrote: From: Santosh Shilimkar <ssant...@kernel.org> To further improve the RDS connection scalabilty on massive systems where number of sockets grows into tens of thousands of sockets, there is a need of larger bind hashtable. Pre-allocated 8K or 16K table is not very flexible in terms of memory utilisation. The rhashtable infrastructure gives us the flexibility to grow the hashtbable based on use and also comes up with inbuilt efficient bucket(chain) handling. Cc: David Laight <david.lai...@aculab.com> Cc: David Miller <da...@davemloft.net> Signed-off-by: Santosh Shilimkar <santosh.shilim...@oracle.com> --- As promised in last series review, here is an RFC to conver RDS to make use of re-sizable hash tables. I haven't turned on auto shrinking on by purpose. Ignore the automatic_shrinking remark since patch has it enabled. net/rds/af_rds.c | 10 - net/rds/bind.c | 127 --- net/rds/rds.h| 7 ++- 3 files changed, 58 insertions(+), 86 deletions(-) diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c index 384ea1e..b5476aeb 100644 --- a/net/rds/af_rds.c +++ b/net/rds/af_rds.c @@ -573,6 +573,7 @@ static void rds_exit(void) rds_threads_exit(); rds_stats_exit(); rds_page_exit(); + rds_bind_lock_destroy(); rds_info_deregister_func(RDS_INFO_SOCKETS, rds_sock_info); rds_info_deregister_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info); } @@ -582,11 +583,14 @@ static int rds_init(void) { int ret; - rds_bind_lock_init(); + ret = rds_bind_lock_init(); + if (ret) + goto out; ret = rds_conn_init(); if (ret) - goto out; + goto out_bind; + ret = rds_threads_init(); if (ret) goto out_conn; @@ -620,6 +624,8 @@ out_conn: rds_conn_exit(); rds_cong_exit(); rds_page_exit(); +out_bind: + rds_bind_lock_destroy(); out: return ret; } diff --git a/net/rds/bind.c b/net/rds/bind.c index bc6b93e..199e4cc 100644 --- a/net/rds/bind.c +++ b/net/rds/bind.c @@ -38,54 +38,18 @@ #include #include "rds.h" -struct bind_bucket { - rwlock_tlock; - struct hlist_head head; +static struct rhashtable bind_hash_table; + +static struct rhashtable_params ht_parms = { + .nelem_hint = 768, + .key_len = sizeof(u64), + .key_offset = offsetof(struct rds_sock, rs_bound_key), + .head_offset = offsetof(struct rds_sock, rs_bound_node), + .max_size = 16384, + .min_size = 1024, + .automatic_shrinking = true, }; -#define BIND_HASH_SIZE 1024 -static struct bind_bucket bind_hash_table[BIND_HASH_SIZE]; - -static struct bind_bucket *hash_to_bucket(__be32 addr, __be16 port) -{ - return bind_hash_table + (jhash_2words((u32)addr, (u32)port, 0) & - (BIND_HASH_SIZE - 1)); -} - -/* must hold either read or write lock (write lock for insert != NULL) */ -static struct rds_sock *rds_bind_lookup(struct bind_bucket *bucket, - __be32 addr, __be16 port, - struct rds_sock *insert) -{ - struct rds_sock *rs; - struct hlist_head *head = >head; - u64 cmp; - u64 needle = ((u64)be32_to_cpu(addr) << 32) | be16_to_cpu(port); - - hlist_for_each_entry(rs, head, rs_bound_node) { - cmp = ((u64)be32_to_cpu(rs->rs_bound_addr) << 32) | - be16_to_cpu(rs->rs_bound_port); - - if (cmp == needle) { - rds_sock_addref(rs); - return rs; - } - } - - if (insert) { - /* -* make sure our addr and port are set before -* we are added to the list. -*/ - insert->rs_bound_addr = addr; - insert->rs_bound_port = port; - rds_sock_addref(insert); - - hlist_add_head(>rs_bound_node, head); - } - return NULL; -} - /* * Return the rds_sock bound at the given local address. * @@ -94,18 +58,14 @@ static struct rds_sock *rds_bind_lookup(struct bind_bucket *bucket, */ struct rds_sock *rds_find_bound(__be32 addr, __be16 port) { + u64 key = ((u64)addr << 32) | port; struct rds_sock *rs; - unsigned long flags; - struct bind_bucket *bucket = hash_to_bucket(addr, port); - read_lock_irqsave(>lock, flags); - rs = rds_bind_lookup(bucket, addr, port, NULL); - read_unlock_irqrestore(>lock, flags); - - if (rs && sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) { - rds_sock_put(rs); + rs = rhashtable_lookup_fast(_hash_table, , ht_parms); + if (rs && !sock_flag(rds_rs_to
Re: [PATCH net-next] RDS: Invoke ->laddr_check() in rds_bind() for explicitly bound transports.
On 10/11/15 1:46 PM, Sowmini Varadhan wrote: The IP address passed to rds_bind() should be vetted by the transport's ->laddr_check() for a previously bound transport. This needs to be done to avoid cases where, for example, the application has asked for an IB transport, but the IP address passed to bind is only usable on ethernet interfaces. Right. Probably it should go into stable as well. Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- net/rds/bind.c |9 - 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/net/rds/bind.c b/net/rds/bind.c index bc6b93e..6192566 100644 --- a/net/rds/bind.c +++ b/net/rds/bind.c @@ -196,7 +196,14 @@ int rds_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) goto out; if (rs->rs_transport) { /* previously bound */ - ret = 0; Minor comment. If you retain above line you can drop the below else. + trans = rs->rs_transport; + if (trans->laddr_check(sock_net(sock->sk), + sin->sin_addr.s_addr) != 0) { + ret = -ENOPROTOOPT; + rds_remove_bound(rs); + } else { + ret = 0; + } Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] RDS-TCP: Reset tcp callbacks if re-using an outgoing socket in rds_tcp_accept_one()
On 10/11/15 1:49 PM, Sowmini Varadhan wrote: Consider the following "duelling syn" sequence between two peers A and B: A B SYN1 --> <-- SYN2 SYN2ACK --> Note that the SYN/ACK has already been sent out by TCP before rds_tcp_accept_one() gets invoked as part of callbacks. If the inet_addr(A) is numerically less than inet_addr(B), the arbitration scheme in rds_tcp_accept_one() will prefer the TCP connection triggered by SYN1, and will send a CLOSE for the SYN2 (just after the SYN2ACK was sent). Since B also follows the same arbitration scheme, it will send the SYN-ACK for SYN1 that will set up a healthy ESTABLISHED connection on both sides. B will also get a CLOSE for SYN2, which should result in the cleanup of the TCP state machine for SYN2, but it should not trigger any stale RDS-TCP callbacks (such as ->writespace, ->state_change etc), that would disrupt the progress of the SYN2 based RDS-TCP connection. Thus the arbitration scheme in rds_tcp_accept_one() should restore rds_tcp callbacks for the winner before setting them up for the new accept socket, and also make sure that conn->c_outgoing is set to 0 so that we do not trigger any reconnect attempts on the passive side of the tcp socket in the future, in conformance with commit c82ac7e69efe ("net/rds: RDS-TCP: only initiate reconnect attempt on outgoing TCP socket.") Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- Looks fine to me. Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 00/14] RDS: connection scalability and performance improvements
On 10/7/15 2:16 AM, David Miller wrote: From: Santosh Shilimkar <santosh.shilim...@oracle.com> Date: Mon, 5 Oct 2015 10:56:23 -0700 [v3] Updated patch "[PATCH v2 05/14] RDS: defer the over_batch work to send worker" as per David Miller's comment [4] to avoid the magic value usage. Patch now makes use of already available but unused send_batch_count module parameter. Rest of the patches are same as earlier version v2 [3] Can you respin this again? Your repost of patch 05/14 was severely whitespace damaged. Sorry about that. I thought mailer won't eat tabs but it did. Just sent the series again. Bumped up the version to v4 from archives perspective. Thanks !! Regards, Santosh -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/14] RDS: connection scalability and performance improvements
On 10/1/15 9:19 AM, David Laight wrote: From: Santosh Shilimkar Sent: 30 September 2015 18:24 ... This is being addressed by simply using per bucket rw lock which makes the locking simple and very efficient. The hash table size is still an issue and I plan to address it by using re-sizable hash tables as suggested on the list. If the hash chains are short do you need the expense of a rw lock for each chain? Chains can be really long on larger systems with many databases. A simple spinlock may be faster. If you use the hash chain lock for the reference count on the hashed objects you should be able to release the lock before locking the object itself. Because of the shared socket nature of RDS, the chain needs to be protected for parallel accesses for add/removal/lookup. Hashing is really used to get to the bucket which holds the hlist. Just to give a bit of history, RDS bind code has evolved over few years of time. It started with a rb tree and a global rw lock which wasn't very efficient.Then it was converted to rcu hlist with spin lock to make the look ups faster. But that scheme as well exploded on larger systems with truckloads of sockets with read/write lock failures because of excessive contention. As high as almost ~25% of system load. Per bucket lock actually solved most of those issues. Bucket or chain table increase(1k to 8K) was actually relatively smaller gain though still helped to reduce the contention by almost to a nominal 1 or 2 %. Am still getting my head around with rhashtable plumbing with this usecase. Will CC you when I post the RFC patch for the rhashtable conversion. Thanks for your comments so far. Regards, Santosh -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] RDS: verify the underlying transport exists before creating a connection
On 9/4/15 12:44 PM, Sasha Levin wrote: On 09/04/2015 01:32 PM, santosh shilimkar wrote: Sasha, On 9/4/2015 9:43 AM, Sasha Levin wrote: There was no verification that an underlying transport exists when creating a connection, this would cause dereferencing a NULL ptr. Signed-off-by: Sasha Levin <sasha.le...@oracle.com> --- net/rds/connection.c |6 ++ 1 file changed, 6 insertions(+) diff --git a/net/rds/connection.c b/net/rds/connection.c index a50e652..0218d81 100644 --- a/net/rds/connection.c +++ b/net/rds/connection.c @@ -189,6 +189,12 @@ new_conn: } } +if (trans == NULL) { +kmem_cache_free(rds_conn_slab, conn); +conn = ERR_PTR(-ENODEV); +goto out; +} + Did you see the NULL oops in any tests ? The reason am asking this because callers of '__rds_conn_create()' are not passing the trans as null so that leaves with only the loopback case. In that case as well, rds_loop_transport is never going to be null. The check is good but am curious whether we have a case which will hit this scenario. This is the trace I have: Thanks. [135546.047719] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN [135546.051270] Modules linked in: [135546.051781] CPU: 4 PID: 15650 Comm: trinity-c4 Not tainted 4.2.0-next-20150902-sasha-00041-gbaa1222-dirty #2527 [135546.053217] task: 8800835bc000 ti: 8800bc708000 task.ti: 8800bc708000 [135546.054291] RIP: __rds_conn_create (net/rds/connection.c:194) [135546.055666] RSP: 0018:8800bc70fab0 EFLAGS: 00010202 [135546.056457] RAX: dc00 RBX: 0f2c RCX: 8800835bc000 [135546.057494] RDX: 0007 RSI: 8800835bccd8 RDI: 0038 [135546.058530] RBP: 8800bc70fb18 R08: 0001 R09: [135546.059556] R10: ed014d7a3a23 R11: ed014d7a3a21 R12: [135546.060614] R13: 0001 R14: 8801ec3d R15: [135546.061668] FS: 7faad4ffb700() GS:88025200() knlGS: [135546.062836] CS: 0010 DS: ES: CR0: 8005003b [135546.063682] CR2: 846a CR3: 9d137000 CR4: 06a0 [135546.064723] Stack: [135546.065048] afe2055c afe23fc1 ed00493097bf 8801ec3d0008 [135546.066247] 00d0 ac194a24c0586342 [135546.067438] 1100178e1f78 880320581b00 8800bc70fdd0 880320581b00 [135546.068629] Call Trace: [135546.069028] ? __rds_conn_create (include/linux/rcupdate.h:856 net/rds/connection.c:134) [135546.069989] ? rds_message_copy_from_user (net/rds/message.c:298) [135546.071021] rds_conn_create_outgoing (net/rds/connection.c:278) [135546.071981] rds_sendmsg (net/rds/send.c:1058) [135546.072858] ? perf_trace_lock (include/trace/events/lock.h:38) [135546.073744] ? lockdep_init (kernel/locking/lockdep.c:3298) [135546.074577] ? rds_send_drop_to (net/rds/send.c:976) [135546.075508] ? __might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3795) [135546.076349] ? __might_fault (mm/memory.c:3795) [135546.077179] ? rds_send_drop_to (net/rds/send.c:976) [135546.078114] sock_sendmsg (net/socket.c:611 net/socket.c:620) [135546.078856] SYSC_sendto (net/socket.c:1657) [135546.079596] ? SYSC_connect (net/socket.c:1628) [135546.080510] ? trace_dump_stack (kernel/trace/trace.c:1926) [135546.081397] ? ring_buffer_unlock_commit (kernel/trace/ring_buffer.c:2479 kernel/trace/ring_buffer.c:2558 kernel/trace/ring_buffer.c:2674) [135546.082390] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749) [135546.083410] ? trace_event_raw_event_sys_enter (include/trace/events/syscalls.h:16) [135546.084481] ? do_audit_syscall_entry (include/trace/events/syscalls.h:16) [135546.085438] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749) [135546.085515] rds_ib_laddr_check(): addr 36.74.25.172 ret -99 node type -1 So rds_ib_laddr_check() failed means either the app isn't doing bind() before sendmsg() or ignoring the error returned by bind() and going ahead and issuing the send. Anyway, we can't control the application so having this hole plugged is good. Please include the back trace in the commit log. With that update, Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] RDS: rds_conn_lookup() should factor in the struct net for a match
On 9/3/15 1:24 PM, Sowmini Varadhan wrote: Only return a conn if the rds_conn_net(conn) matches the struct net passed to rds_conn_lookup(). Fixes: 467fa15356ac ("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.") Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next:master 1267/1290] net/rds/ib_recv.c:382:28: sparse: incorrect type in initializer (different base types)
On 8/25/15 3:55 PM, David Miller wrote: From: kbuild test robot fengguang...@intel.com Date: Wed, 26 Aug 2015 06:42:39 +0800 sparse warnings: (new ones prefixed by ) net/rds/ib_recv.c:382:28: sparse: incorrect type in initializer (different base types) net/rds/ib_recv.c:382:28:expected int [signed] can_wait net/rds/ib_recv.c:382:28:got restricted gfp_t net/rds/ib_recv.c:828:23: sparse: cast to restricted __le64 Fixed by: [PATCH] rds: Fix improper gfp_t usage. net/rds/ib_recv.c:382:28: sparse: incorrect type in initializer (different base types) net/rds/ib_recv.c:382:28:expected int [signed] can_wait net/rds/ib_recv.c:382:28:got restricted gfp_t net/rds/ib_recv.c:828:23: sparse: cast to restricted __le64 Reported-by: kbuild test robot fengguang...@intel.com Signed-off-by: David S. Miller da...@davemloft.net --- Thanks Dave. I was just creating the patch after noticing the error from kbuild on my tree. Regards, Santosh -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html