Re: [ceph-users] Help needed porting Ceph to RSockets
On Thu, 12 Sep 2013 12:20:03 +0200 Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: 2013/9/10 Andreas Bluemle andreas.blue...@itxperts.de: Since I have added these workarounds to my version of the librdmacm library, I can at least start up ceph using LD_PRELOAD and end up in a healthy ceph cluster state. Have you seen any performance improvement by using LD_PRELOAD with ceph? Which throughput are you able to archive with rsocket and ceph? I have not yet done any performance testing. The next step I have to take is more related to setting up a larger cluster with sth. like 150 osd's without hitting any resource limitations. Regards Andreas Bluemle -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Andreas Bluemle mailto:andreas.blue...@itxperts.de ITXperts GmbH http://www.itxperts.de Balanstrasse 73, Geb. 08Phone: (+49) 89 89044917 D-81541 Muenchen (Germany) Fax: (+49) 89 89044910 Company details: http://www.itxperts.de/imprint.htm -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Help needed porting Ceph to RSockets
Hi, after some more analysis and debugging, I found workarounds for my problems; I have added these workarounds to the last version of the patch for the poll problem by Sean; see the attachment to this posting. The shutdown() operations below are all SHUT_RDWR. 1. shutdown() on side A of a connection waits for close() on side B With rsockets, when a shutdown is done on side A of a socket connection, then the shutdown will only return after side B has done a close() on its end of the connection. This is different from TCP/IP sockets: there a shutdown will cause the other end to terminate the connection at the TCP level instantly. The socket changes state into CLOSE_WAIT, which indicates that the application level close is outstanding. In the attached patch, the workaround is in rs_poll_cq(), case RS_OP_CTRL, where for a RS_CTRL_DISCONNECT the rshutdown() is called on side B; this will cause the termination of the socket connection to acknowledged to side A and the shutdown() there can now terminate. 2. double (multiple) shutdown on side A: delay on 2nd shutdown When an application does a shutdown() of side A and does a 2nd shutdown() shortly after (for whatever reason) then the return of the 2nd shutdown() is delayed by 2 seconds. The delay happens in rdma_disconnect(), when this is called from rshutdown() in the case that the rsocket state is rs_disconnected. Even if it could be considered as a bug if an application calls shutdown() twice on the same socket, it still does not make sense to delay that 2nd call to shutdown(). To workaround this, I have - introduced an additional rsocket state: rs_shutdown - switch to that new state in rshutdown() at the very end of the function. The first call to shutdown() will therefore switch to the new rsocket state rs_shutdown - and any further call to rshutdown() will not do anything any more, because every effect of rshutdown() will only happen if the rsocket state is either rs_connnected or rs_disconnected. Hence it would be better to explicitely check the rsocket state at the beginning of the function and return immediately if the state is rs_shutdown. Since I have added these workarounds to my version of the librdmacm library, I can at least start up ceph using LD_PRELOAD and end up in a healthy ceph cluster state. I would not call these workarounds a real fix, but they should point out the problems which I am trying to solve. Regards Andreas Bluemle On Fri, 23 Aug 2013 00:35:22 + Hefty, Sean sean.he...@intel.com wrote: I tested out the patch and unfortunately had the same results as Andreas. About 50% of the time the rpoll() thread in Ceph still hangs when rshutdown() is called. I saw a similar behaviour when increasing the poll time on the pre-patched version if that's of any relevance. I'm not optimistic, but here's an updated patch. I attempted to handle more shutdown conditions, but I can't say that any of those would prevent the hang that you see. I have a couple of questions: Is there any chance that the code would call rclose while rpoll is still running? Also, can you verify that the thread is in the real poll() call when the hang occurs? Signed-off-by: Sean Hefty sean.he...@intel.com --- src/rsocket.c | 35 +-- 1 files changed, 25 insertions(+), 10 deletions(-) diff --git a/src/rsocket.c b/src/rsocket.c index d544dd0..f94ddf3 100644 --- a/src/rsocket.c +++ b/src/rsocket.c @@ -1822,7 +1822,12 @@ static int rs_poll_cq(struct rsocket *rs) rs-state = rs_disconnected; return 0; } else if (rs_msg_data(msg) == RS_CTRL_SHUTDOWN) { - rs-state = ~rs_readable; + if (rs-state rs_writable) { + rs-state = ~rs_readable; + } else { + rs-state = rs_disconnected; + return 0; + } } break; case RS_OP_WRITE: @@ -2948,10 +2953,12 @@ static int rs_poll_events(struct pollfd *rfds, struct pollfd *fds, nfds_t nfds) rs = idm_lookup(idm, fds[i].fd); if (rs) { + fastlock_acquire(rs-cq_wait_lock); if (rs-type == SOCK_STREAM) rs_get_cq_event(rs); else ds_get_cq_event(rs); + fastlock_release(rs-cq_wait_lock); fds[i].revents = rs_poll_rs(rs, fds[i].events, 1, rs_poll_all); } else { fds[i
Re: [ceph-users] Help needed porting Ceph to RSockets
Hi Sean, I will re-check until the end of the week; there is some test scheduling issue with our test system, which affects my access times. Thanks Andreas On Mon, 19 Aug 2013 17:10:11 + Hefty, Sean sean.he...@intel.com wrote: Can you see if the patch below fixes the hang? Signed-off-by: Sean Hefty sean.he...@intel.com --- src/rsocket.c | 11 ++- 1 files changed, 10 insertions(+), 1 deletions(-) diff --git a/src/rsocket.c b/src/rsocket.c index d544dd0..e45b26d 100644 --- a/src/rsocket.c +++ b/src/rsocket.c @@ -2948,10 +2948,12 @@ static int rs_poll_events(struct pollfd *rfds, struct pollfd *fds, nfds_t nfds) rs = idm_lookup(idm, fds[i].fd); if (rs) { + fastlock_acquire(rs-cq_wait_lock); if (rs-type == SOCK_STREAM) rs_get_cq_event(rs); else ds_get_cq_event(rs); + fastlock_release(rs-cq_wait_lock); fds[i].revents = rs_poll_rs(rs, fds[i].events, 1, rs_poll_all); } else { fds[i].revents = rfds[i].revents; @@ -3098,7 +3100,8 @@ int rselect(int nfds, fd_set *readfds, fd_set *writefds, /* * For graceful disconnect, notify the remote side that we're - * disconnecting and wait until all outstanding sends complete. + * disconnecting and wait until all outstanding sends complete, provided + * that the remote side has not sent a disconnect message. */ int rshutdown(int socket, int how) { @@ -3138,6 +3141,12 @@ int rshutdown(int socket, int how) if (rs-state rs_connected) rs_process_cq(rs, 0, rs_conn_all_sends_done); + if (rs-state rs_disconnected) { + /* Generate event by flushing receives to unblock rpoll */ + ibv_req_notify_cq(rs-cm_id-recv_cq, 0); + rdma_disconnect(rs-cm_id); + } + if ((rs-fd_flags O_NONBLOCK) (rs-state rs_connected)) rs_set_nonblocking(rs, rs-fd_flags); -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Andreas Bluemle mailto:andreas.blue...@itxperts.de Heinrich Boell Strasse 88 Phone: (+49) 89 4317582 D-81829 Muenchen (Germany) Mobil: (+49) 177 522 0151 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Help needed porting Ceph to RSockets
Hi, I have added the patch and re-tested: I still encounter hangs of my application. I am not quite sure whether the I hit the same error on the shutdown because now I don't hit the error always, but only every now and then. WHen adding the patch to my code base (git tag v1.0.17) I notice an offset of -34 lines. Which code base are you using? Best Regards Andreas Bluemle On Tue, 20 Aug 2013 09:21:13 +0200 Andreas Bluemle andreas.blue...@itxperts.de wrote: Hi Sean, I will re-check until the end of the week; there is some test scheduling issue with our test system, which affects my access times. Thanks Andreas On Mon, 19 Aug 2013 17:10:11 + Hefty, Sean sean.he...@intel.com wrote: Can you see if the patch below fixes the hang? Signed-off-by: Sean Hefty sean.he...@intel.com --- src/rsocket.c | 11 ++- 1 files changed, 10 insertions(+), 1 deletions(-) diff --git a/src/rsocket.c b/src/rsocket.c index d544dd0..e45b26d 100644 --- a/src/rsocket.c +++ b/src/rsocket.c @@ -2948,10 +2948,12 @@ static int rs_poll_events(struct pollfd *rfds, struct pollfd *fds, nfds_t nfds) rs = idm_lookup(idm, fds[i].fd); if (rs) { + fastlock_acquire(rs-cq_wait_lock); if (rs-type == SOCK_STREAM) rs_get_cq_event(rs); else ds_get_cq_event(rs); + fastlock_release(rs-cq_wait_lock); fds[i].revents = rs_poll_rs(rs, fds[i].events, 1, rs_poll_all); } else { fds[i].revents = rfds[i].revents; @@ -3098,7 +3100,8 @@ int rselect(int nfds, fd_set *readfds, fd_set *writefds, /* * For graceful disconnect, notify the remote side that we're - * disconnecting and wait until all outstanding sends complete. + * disconnecting and wait until all outstanding sends complete, provided + * that the remote side has not sent a disconnect message. */ int rshutdown(int socket, int how) { @@ -3138,6 +3141,12 @@ int rshutdown(int socket, int how) if (rs-state rs_connected) rs_process_cq(rs, 0, rs_conn_all_sends_done); + if (rs-state rs_disconnected) { + /* Generate event by flushing receives to unblock rpoll */ + ibv_req_notify_cq(rs-cm_id-recv_cq, 0); + rdma_disconnect(rs-cm_id); + } + if ((rs-fd_flags O_NONBLOCK) (rs-state rs_connected)) rs_set_nonblocking(rs, rs-fd_flags); -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Andreas Bluemle mailto:andreas.blue...@itxperts.de Heinrich Boell Strasse 88 Phone: (+49) 89 4317582 D-81829 Muenchen (Germany) Mobil: (+49) 177 522 0151 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Help needed porting Ceph to RSockets
into consideration what granularity the OS provides) and then call ibv_poll_cq(). Keep in mind, polling will prevent your CPU from reducing power. If the real poll() is actually checking for something (e.g. checking on the RDMA channel's fd or the IB channel's fd), then you may not want to spin too much. The real poll() call is intended to block the application until a timeout occurs or an event shows up. Since increasing the spin time works for you, it makes me suspect that there is a bug in the CQ event handling in rsockets. What's particularly weird is that the monitor receives a POLLHUP event when the ceph command shuts down it's socket but the ceph command never does. When using regular sockets both sides of the connection receive a POLLIN | POLLHUP | POLRDHUP event when the sockets are shut down. It would seem like there is a bug in rsockets that causes the side that calls shutdown first not to receive the correct rpoll events. rsockets does not support POLLRDHUP. I don't think the issue is POLLRDHUP. I think the issue is POLLHUP and/or POLLIN. My impression is that on a local shutdown (r)socket, a poll for POLLIN event should at least return a POLLIN event and a subsequent read should return 0 bytes indicating EOF. But the POLLIN is not generated from the layer below rsockets (ib_uverbs.ko?) as far as I can tell. See also: http://www.greenend.org.uk/rjk/tech/poll.html Best Regards Andreas Bluemle - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Andreas Bluemle mailto:andreas.blue...@itxperts.de ITXperts GmbH http://www.itxperts.de Balanstrasse 73, Geb. 08Phone: (+49) 89 89044917 D-81541 Muenchen (Germany) Fax: (+49) 89 89044910 Company details: http://www.itxperts.de/imprint.htm -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: using rsockets via librspreload: poll() support?
Hi Sean, I begin to believe that this may be a more general problem: it seems to me that errno is not always initialized to 0 when the librspreload wrapper for a socket system call or the corresponding r*() routine from rsocket.c is called. For the poll() I have cleared the errno explicitly before polling the socket - and it is still cleared on return from poll(). Hence: where I used to encounter an EOPNOTSUPP, I now see errno 0 (i.e. Success). Best Regards Andreas Bluemle On Thu, 8 Aug 2013 17:46:29 +0200 Andreas Bluemle andreas.blue...@itxperts.de wrote: Hi Sean, I am currently testing rsockets in connection with ceph. I am using LD_PRELOAD and the librspreload.so to force the application (ceph) to use rsockets instead of regular tcp/ip sockets. All this works pretty well - until the point where an established connection is shut down: this seems to not work and never finishes (unless the application is killed...). The way ceph uses sockets is in a nonblocking mode. When reading from a socket, it polls the socket first with an event mask of POLLIN and POLLRDHUP. On the return from the poll() I see that - POLLIN and POLLHUP are set in the returned events (POLLRDHUP is *not* set) - errno is 95 (EOPNOTSUPP) (The POLLHUP makes me believe that in this case the other end has shutdown the socket already.) The EOPNOTSUPP confuses ceph quite a bit and prevents it from shutting down it's side of the socket connection properly. Question: is it possible that the POLLRDHUP causes the EOPNOTSUPP to be set by librspreload::poll() or rpoll()? Best Regards Andreas Bluemle -- Andreas Bluemle mailto:andreas.blue...@itxperts.de ITXperts GmbH http://www.itxperts.de Balanstrasse 73, Geb. 08Phone: (+49) 89 89044917 D-81541 Muenchen (Germany) Fax: (+49) 89 89044910 Company details: http://www.itxperts.de/imprint.htm -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
using rsockets via librspreload: poll() support?
Hi Sean, I am currently testing rsockets in connection with ceph. I am using LD_PRELOAD and the librspreload.so to force the application (ceph) to use rsockets instead of regular tcp/ip sockets. All this works pretty well - until the point where an established connection is shut down: this seems to not work and never finishes (unless the application is killed...). The way ceph uses sockets is in a nonblocking mode. When reading from a socket, it polls the socket first with an event mask of POLLIN and POLLRDHUP. On the return from the poll() I see that - POLLIN and POLLHUP are set in the returned events (POLLRDHUP is *not* set) - errno is 95 (EOPNOTSUPP) (The POLLHUP makes me believe that in this case the other end has shutdown the socket already.) The EOPNOTSUPP confuses ceph quite a bit and prevents it from shutting down it's side of the socket connection properly. Question: is it possible that the POLLRDHUP causes the EOPNOTSUPP to be set by librspreload::poll() or rpoll()? Best Regards Andreas Bluemle -- Andreas Bluemle mailto:andreas.blue...@itxperts.de ITXperts GmbH http://www.itxperts.de Balanstrasse 73, Geb. 08Phone: (+49) 89 89044917 D-81541 Muenchen (Germany) Fax: (+49) 89 89044910 Company details: http://www.itxperts.de/imprint.htm -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html