Re: [ceph-users] Help needed porting Ceph to RSockets
2013-10-31 Hefty, Sean sean.he...@intel.com: Can you please try the attached patch in place of all previous patches? Any updates on ceph with rsockets? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ceph-users] Help needed porting Ceph to RSockets
I would not call these workarounds a real fix, but they should point out the problems which I am trying to solve. Thanks for the update. I haven't had the time to investigate this, but did want to at least acknowledge that this hasn't gotten lost. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Help needed porting Ceph to RSockets
2013/9/10 Andreas Bluemle andreas.blue...@itxperts.de: Since I have added these workarounds to my version of the librdmacm library, I can at least start up ceph using LD_PRELOAD and end up in a healthy ceph cluster state. Have you seen any performance improvement by using LD_PRELOAD with ceph? Which throughput are you able to archive with rsocket and ceph? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Help needed porting Ceph to RSockets
On Thu, 12 Sep 2013 12:20:03 +0200 Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: 2013/9/10 Andreas Bluemle andreas.blue...@itxperts.de: Since I have added these workarounds to my version of the librdmacm library, I can at least start up ceph using LD_PRELOAD and end up in a healthy ceph cluster state. Have you seen any performance improvement by using LD_PRELOAD with ceph? Which throughput are you able to archive with rsocket and ceph? I have not yet done any performance testing. The next step I have to take is more related to setting up a larger cluster with sth. like 150 osd's without hitting any resource limitations. Regards Andreas Bluemle -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Andreas Bluemle mailto:andreas.blue...@itxperts.de ITXperts GmbH http://www.itxperts.de Balanstrasse 73, Geb. 08Phone: (+49) 89 89044917 D-81541 Muenchen (Germany) Fax: (+49) 89 89044910 Company details: http://www.itxperts.de/imprint.htm -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Help needed porting Ceph to RSockets
Hi, after some more analysis and debugging, I found workarounds for my problems; I have added these workarounds to the last version of the patch for the poll problem by Sean; see the attachment to this posting. The shutdown() operations below are all SHUT_RDWR. 1. shutdown() on side A of a connection waits for close() on side B With rsockets, when a shutdown is done on side A of a socket connection, then the shutdown will only return after side B has done a close() on its end of the connection. This is different from TCP/IP sockets: there a shutdown will cause the other end to terminate the connection at the TCP level instantly. The socket changes state into CLOSE_WAIT, which indicates that the application level close is outstanding. In the attached patch, the workaround is in rs_poll_cq(), case RS_OP_CTRL, where for a RS_CTRL_DISCONNECT the rshutdown() is called on side B; this will cause the termination of the socket connection to acknowledged to side A and the shutdown() there can now terminate. 2. double (multiple) shutdown on side A: delay on 2nd shutdown When an application does a shutdown() of side A and does a 2nd shutdown() shortly after (for whatever reason) then the return of the 2nd shutdown() is delayed by 2 seconds. The delay happens in rdma_disconnect(), when this is called from rshutdown() in the case that the rsocket state is rs_disconnected. Even if it could be considered as a bug if an application calls shutdown() twice on the same socket, it still does not make sense to delay that 2nd call to shutdown(). To workaround this, I have - introduced an additional rsocket state: rs_shutdown - switch to that new state in rshutdown() at the very end of the function. The first call to shutdown() will therefore switch to the new rsocket state rs_shutdown - and any further call to rshutdown() will not do anything any more, because every effect of rshutdown() will only happen if the rsocket state is either rs_connnected or rs_disconnected. Hence it would be better to explicitely check the rsocket state at the beginning of the function and return immediately if the state is rs_shutdown. Since I have added these workarounds to my version of the librdmacm library, I can at least start up ceph using LD_PRELOAD and end up in a healthy ceph cluster state. I would not call these workarounds a real fix, but they should point out the problems which I am trying to solve. Regards Andreas Bluemle On Fri, 23 Aug 2013 00:35:22 + Hefty, Sean sean.he...@intel.com wrote: I tested out the patch and unfortunately had the same results as Andreas. About 50% of the time the rpoll() thread in Ceph still hangs when rshutdown() is called. I saw a similar behaviour when increasing the poll time on the pre-patched version if that's of any relevance. I'm not optimistic, but here's an updated patch. I attempted to handle more shutdown conditions, but I can't say that any of those would prevent the hang that you see. I have a couple of questions: Is there any chance that the code would call rclose while rpoll is still running? Also, can you verify that the thread is in the real poll() call when the hang occurs? Signed-off-by: Sean Hefty sean.he...@intel.com --- src/rsocket.c | 35 +-- 1 files changed, 25 insertions(+), 10 deletions(-) diff --git a/src/rsocket.c b/src/rsocket.c index d544dd0..f94ddf3 100644 --- a/src/rsocket.c +++ b/src/rsocket.c @@ -1822,7 +1822,12 @@ static int rs_poll_cq(struct rsocket *rs) rs-state = rs_disconnected; return 0; } else if (rs_msg_data(msg) == RS_CTRL_SHUTDOWN) { - rs-state = ~rs_readable; + if (rs-state rs_writable) { + rs-state = ~rs_readable; + } else { + rs-state = rs_disconnected; + return 0; + } } break; case RS_OP_WRITE: @@ -2948,10 +2953,12 @@ static int rs_poll_events(struct pollfd *rfds, struct pollfd *fds, nfds_t nfds) rs = idm_lookup(idm, fds[i].fd); if (rs) { + fastlock_acquire(rs-cq_wait_lock); if (rs-type == SOCK_STREAM) rs_get_cq_event(rs); else ds_get_cq_event(rs); + fastlock_release(rs-cq_wait_lock); fds[i].revents = rs_poll_rs(rs, fds[i].events, 1, rs_poll_all); } else {
Re: [ceph-users] Help needed porting Ceph to RSockets
Hi Sean, I tested out the patch and unfortunately had the same results as Andreas. About 50% of the time the rpoll() thread in Ceph still hangs when rshutdown() is called. I saw a similar behaviour when increasing the poll time on the pre-patched version if that's of any relevance. Thanks On Tue, Aug 20, 2013 at 11:04 PM, Hefty, Sean sean.he...@intel.com wrote: I have added the patch and re-tested: I still encounter hangs of my application. I am not quite sure whether the I hit the same error on the shutdown because now I don't hit the error always, but only every now and then. I guess this is at least some progress... :/ WHen adding the patch to my code base (git tag v1.0.17) I notice an offset of -34 lines. Which code base are you using? This patch was generated against the tip of the git tree. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Help needed porting Ceph to RSockets
Hi Sean, I will re-check until the end of the week; there is some test scheduling issue with our test system, which affects my access times. Thanks Andreas On Mon, 19 Aug 2013 17:10:11 + Hefty, Sean sean.he...@intel.com wrote: Can you see if the patch below fixes the hang? Signed-off-by: Sean Hefty sean.he...@intel.com --- src/rsocket.c | 11 ++- 1 files changed, 10 insertions(+), 1 deletions(-) diff --git a/src/rsocket.c b/src/rsocket.c index d544dd0..e45b26d 100644 --- a/src/rsocket.c +++ b/src/rsocket.c @@ -2948,10 +2948,12 @@ static int rs_poll_events(struct pollfd *rfds, struct pollfd *fds, nfds_t nfds) rs = idm_lookup(idm, fds[i].fd); if (rs) { + fastlock_acquire(rs-cq_wait_lock); if (rs-type == SOCK_STREAM) rs_get_cq_event(rs); else ds_get_cq_event(rs); + fastlock_release(rs-cq_wait_lock); fds[i].revents = rs_poll_rs(rs, fds[i].events, 1, rs_poll_all); } else { fds[i].revents = rfds[i].revents; @@ -3098,7 +3100,8 @@ int rselect(int nfds, fd_set *readfds, fd_set *writefds, /* * For graceful disconnect, notify the remote side that we're - * disconnecting and wait until all outstanding sends complete. + * disconnecting and wait until all outstanding sends complete, provided + * that the remote side has not sent a disconnect message. */ int rshutdown(int socket, int how) { @@ -3138,6 +3141,12 @@ int rshutdown(int socket, int how) if (rs-state rs_connected) rs_process_cq(rs, 0, rs_conn_all_sends_done); + if (rs-state rs_disconnected) { + /* Generate event by flushing receives to unblock rpoll */ + ibv_req_notify_cq(rs-cm_id-recv_cq, 0); + rdma_disconnect(rs-cm_id); + } + if ((rs-fd_flags O_NONBLOCK) (rs-state rs_connected)) rs_set_nonblocking(rs, rs-fd_flags); -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Andreas Bluemle mailto:andreas.blue...@itxperts.de Heinrich Boell Strasse 88 Phone: (+49) 89 4317582 D-81829 Muenchen (Germany) Mobil: (+49) 177 522 0151 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Help needed porting Ceph to RSockets
Hi, I have added the patch and re-tested: I still encounter hangs of my application. I am not quite sure whether the I hit the same error on the shutdown because now I don't hit the error always, but only every now and then. WHen adding the patch to my code base (git tag v1.0.17) I notice an offset of -34 lines. Which code base are you using? Best Regards Andreas Bluemle On Tue, 20 Aug 2013 09:21:13 +0200 Andreas Bluemle andreas.blue...@itxperts.de wrote: Hi Sean, I will re-check until the end of the week; there is some test scheduling issue with our test system, which affects my access times. Thanks Andreas On Mon, 19 Aug 2013 17:10:11 + Hefty, Sean sean.he...@intel.com wrote: Can you see if the patch below fixes the hang? Signed-off-by: Sean Hefty sean.he...@intel.com --- src/rsocket.c | 11 ++- 1 files changed, 10 insertions(+), 1 deletions(-) diff --git a/src/rsocket.c b/src/rsocket.c index d544dd0..e45b26d 100644 --- a/src/rsocket.c +++ b/src/rsocket.c @@ -2948,10 +2948,12 @@ static int rs_poll_events(struct pollfd *rfds, struct pollfd *fds, nfds_t nfds) rs = idm_lookup(idm, fds[i].fd); if (rs) { + fastlock_acquire(rs-cq_wait_lock); if (rs-type == SOCK_STREAM) rs_get_cq_event(rs); else ds_get_cq_event(rs); + fastlock_release(rs-cq_wait_lock); fds[i].revents = rs_poll_rs(rs, fds[i].events, 1, rs_poll_all); } else { fds[i].revents = rfds[i].revents; @@ -3098,7 +3100,8 @@ int rselect(int nfds, fd_set *readfds, fd_set *writefds, /* * For graceful disconnect, notify the remote side that we're - * disconnecting and wait until all outstanding sends complete. + * disconnecting and wait until all outstanding sends complete, provided + * that the remote side has not sent a disconnect message. */ int rshutdown(int socket, int how) { @@ -3138,6 +3141,12 @@ int rshutdown(int socket, int how) if (rs-state rs_connected) rs_process_cq(rs, 0, rs_conn_all_sends_done); + if (rs-state rs_disconnected) { + /* Generate event by flushing receives to unblock rpoll */ + ibv_req_notify_cq(rs-cm_id-recv_cq, 0); + rdma_disconnect(rs-cm_id); + } + if ((rs-fd_flags O_NONBLOCK) (rs-state rs_connected)) rs_set_nonblocking(rs, rs-fd_flags); -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Andreas Bluemle mailto:andreas.blue...@itxperts.de Heinrich Boell Strasse 88 Phone: (+49) 89 4317582 D-81829 Muenchen (Germany) Mobil: (+49) 177 522 0151 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ceph-users] Help needed porting Ceph to RSockets
I have added the patch and re-tested: I still encounter hangs of my application. I am not quite sure whether the I hit the same error on the shutdown because now I don't hit the error always, but only every now and then. I guess this is at least some progress... :/ WHen adding the patch to my code base (git tag v1.0.17) I notice an offset of -34 lines. Which code base are you using? This patch was generated against the tip of the git tree. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ceph-users] Help needed porting Ceph to RSockets
Can you see if the patch below fixes the hang? Signed-off-by: Sean Hefty sean.he...@intel.com --- src/rsocket.c | 11 ++- 1 files changed, 10 insertions(+), 1 deletions(-) diff --git a/src/rsocket.c b/src/rsocket.c index d544dd0..e45b26d 100644 --- a/src/rsocket.c +++ b/src/rsocket.c @@ -2948,10 +2948,12 @@ static int rs_poll_events(struct pollfd *rfds, struct pollfd *fds, nfds_t nfds) rs = idm_lookup(idm, fds[i].fd); if (rs) { + fastlock_acquire(rs-cq_wait_lock); if (rs-type == SOCK_STREAM) rs_get_cq_event(rs); else ds_get_cq_event(rs); + fastlock_release(rs-cq_wait_lock); fds[i].revents = rs_poll_rs(rs, fds[i].events, 1, rs_poll_all); } else { fds[i].revents = rfds[i].revents; @@ -3098,7 +3100,8 @@ int rselect(int nfds, fd_set *readfds, fd_set *writefds, /* * For graceful disconnect, notify the remote side that we're - * disconnecting and wait until all outstanding sends complete. + * disconnecting and wait until all outstanding sends complete, provided + * that the remote side has not sent a disconnect message. */ int rshutdown(int socket, int how) { @@ -3138,6 +3141,12 @@ int rshutdown(int socket, int how) if (rs-state rs_connected) rs_process_cq(rs, 0, rs_conn_all_sends_done); + if (rs-state rs_disconnected) { + /* Generate event by flushing receives to unblock rpoll */ + ibv_req_notify_cq(rs-cm_id-recv_cq, 0); + rdma_disconnect(rs-cm_id); + } + if ((rs-fd_flags O_NONBLOCK) (rs-state rs_connected)) rs_set_nonblocking(rs, rs-fd_flags); -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ceph-users] Help needed porting Ceph to RSockets
I am looking at a multithreaded application here, and I believe that the race is between thread A calling the rpoll() for POLLIN event and thread B calling the shutdown(SHUT_RDWR) for reading and writing of the (r)socket almost immediately afterwards. I modified a test program, and I can reproduce the hang as you describe -- calling rpoll() then rshutdown() from another thread. These calls end up calling rpoll-poll followed by rshutdown-read. The read completes, which completes rshutdown, but the poll continues to wait for an event. In the kernel, poll ends up in uverbs.c::ib_uverbs_event_poll, and read in uverbs.c::ib_uverbs_event_read(). The behavior of poll/read seems reasonable, so I don't think this is a kernel issue. I'm still trying to figure out a simple solution to fixing this. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Help needed porting Ceph to RSockets
Hi, maybe some information about the environment I am working in: - CentOS 6.4 with custom kernel 3.8.13 - librdmacm / librspreload from git, tag 1.0.17 - application started with librspreload in LD_PRELOAD environment Currently, I have increased the value of the spin time by setting the default value for polling_time in the source code. I guess that the correct way to do this is via configuration in /etc/rdma/rsocket/polling_time? Concerning the rpoll() itself, some more comments/questions embedded below. On Tue, 13 Aug 2013 21:44:42 + Hefty, Sean sean.he...@intel.com wrote: I found a workaround for my (our) problem: in the librdmacm code, rsocket.c, there is a global constant polling_time, which is set to 10 microseconds at the moment. I raise this to 1 - and all of a sudden things work nicely. I am adding the linux-rdma list to CC so Sean might see this. If I understand what you are describing, the caller to rpoll() spins for up to 10 ms (10,000 us) before calling the real poll(). What is the purpose of the real poll() call? Is it simply a means to block the caller and avoid spinning? Or does it actually expect to detect an event? When the real poll() is called, an event is expected on an fd associated with the CQ's completion channel. The first question I would have is: why is the rpoll() split into these two pieces? There must have been some reason to do a busy loop on some local state information rather than just call the real poll() directly. I think we are looking at two issues here: 1. the thread structure of ceph messenger For a given socket connection, there are 3 threads of interest here: the main messenger thread, the Pipe::reader and the Pipe::writer. For a ceph client like the ceph admin command, I see the following sequence - the connection to the ceph monitor is created by the main messenger thread, the Pipe::reader and Pipe::writer are instantiated. - the requested command is sent to the ceph monitor, the answer is read and printed - at this point the Pipe::reader already has called tcp_read_wait(), polling for more data or connection termination - after the response had been printed, the main loop calls the shutdown routines which in in turn shutdown() the socket There is some time between the last two steps - and this gap is long enough to open a race: 2. rpoll, ibv and poll the rpoll implementation in rsockets is split in 2 phases: - a busy loop which checks the state of the underlying ibv queue pair - the call to real poll() system call (i.e. the uverbs(?) implementation of poll() inside the kernel) The busy loop has a maximum duration of polling_time (10 microseconds by default) - and is able detect the local shutdown and returns a POLLHUP. The poll() system call (i.e. the uverbs implementation of poll() in the kernel) does not detect the local shutdown - and only returns after the caller supplied timeout expires. It sounds like there's an issue here either with a message getting lost or a race. Given that spinning longer works for you, it sounds like an event is getting lost, not being generated correctly, or not being configured to generate. I am looking at a multithreaded application here, and I believe that the race is between thread A calling the rpoll() for POLLIN event and thread B calling the shutdown(SHUT_RDWR) for reading and writing of the (r)socket almost immediately afterwards. I think that the shutdown itself does not cause a POLLHUP event to be generated from the kernel to interupt the real poll(). (BTW: which kernel module implements the poll() for rsockets? Is that ib_uverbs.ko with ib_uverbs_poll_cq()?) Increasing the rsockets polloing_time from 10 to 1 microseconds results in the rpoll to detect the local shutdown within the busy loop. Decreasing the ceph ms tcp read timeout from the default of 900 to 5 seconds serves a similar purpose, but is much coarser. From my understanding, the real issue is neither at the ceph nor at the rsockets level: it is related to the uverbs kernel module. An alternative way to address the current problem at the rsockets level would be w re-write of the rpoll(): instead of the busy loop at the beginning followed by the reall poll() call with the full user specificed timeout value (ms tcp read timeout in our case), I would embed the real poll() into a loop, splitting the user specified timeout into smaller portions and doing the rsockets specific rs_poll_check() on every timeout of the real poll(). I have not looked at the rsocket code, so take the following with a grain of salt. If the purpose of the real poll() is to simply block the user for a specified time, then you can simply make it a short duration (taking
Re: [ceph-users] Help needed porting Ceph to RSockets
On Aug 14, 2013, at 3:21 AM, Andreas Bluemle andreas.blue...@itxperts.de wrote: Hi, maybe some information about the environment I am working in: - CentOS 6.4 with custom kernel 3.8.13 - librdmacm / librspreload from git, tag 1.0.17 - application started with librspreload in LD_PRELOAD environment Currently, I have increased the value of the spin time by setting the default value for polling_time in the source code. I guess that the correct way to do this is via configuration in /etc/rdma/rsocket/polling_time? Concerning the rpoll() itself, some more comments/questions embedded below. On Tue, 13 Aug 2013 21:44:42 + Hefty, Sean sean.he...@intel.com wrote: I found a workaround for my (our) problem: in the librdmacm code, rsocket.c, there is a global constant polling_time, which is set to 10 microseconds at the moment. I raise this to 1 - and all of a sudden things work nicely. I am adding the linux-rdma list to CC so Sean might see this. If I understand what you are describing, the caller to rpoll() spins for up to 10 ms (10,000 us) before calling the real poll(). What is the purpose of the real poll() call? Is it simply a means to block the caller and avoid spinning? Or does it actually expect to detect an event? When the real poll() is called, an event is expected on an fd associated with the CQ's completion channel. The first question I would have is: why is the rpoll() split into these two pieces? There must have been some reason to do a busy loop on some local state information rather than just call the real poll() directly. Sean can answer specifically, but this is a typical HPC technique. The worst thing you can do is handle an event and then block when the next event is available. This adds 1-3 us to latency which is unacceptable in HPC. In HPC, we poll. If we worry about power, we poll until we get no more events and then we poll a little more before blocking. Determining the little more is the fun part. ;-) Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ceph-users] Help needed porting Ceph to RSockets
The first question I would have is: why is the rpoll() split into these two pieces? There must have been some reason to do a busy loop on some local state information rather than just call the real poll() directly. As Scott mentioned in his email, this is done for performance reasons. The cost of always dropping into the kernel is too high for HPC. I am looking at a multithreaded application here, and I believe that the race is between thread A calling the rpoll() for POLLIN event and thread B calling the shutdown(SHUT_RDWR) for reading and writing of the (r)socket almost immediately afterwards. Ah - this is likely the issue. I did not assume that rshutdown() would be called simultaneously with rpoll(). I need to think about how to solve this, so that rpoll() unblocks. I think that the shutdown itself does not cause a POLLHUP event to be generated from the kernel to interupt the real poll(). (BTW: which kernel module implements the poll() for rsockets? Is that ib_uverbs.ko with ib_uverbs_poll_cq()?) The POLLHUP event in rsockets is just software indicating that such an 'event' occurred - basically when a call to rpoll() detects that the rsocket state is disconnected. I believe that the real poll() call traps into ib_uverbs_event_poll() in the kernel. The fd associated with the poll call corresponds to a 'completion channel', which is used to report events which occur on a CQ. Connection related events don't actually go to that fd - only completions for data transfers. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Help needed porting Ceph to RSockets
On Aug 13, 2013, at 10:06 AM, Andreas Bluemle andreas.blue...@itxperts.de wrote: Hi Matthew, I found a workaround for my (our) problem: in the librdmacm code, rsocket.c, there is a global constant polling_time, which is set to 10 microseconds at the moment. I raise this to 1 - and all of a sudden things work nicely. I am adding the linux-rdma list to CC so Sean might see this. If I understand what you are describing, the caller to rpoll() spins for up to 10 ms (10,000 us) before calling the real poll(). What is the purpose of the real poll() call? Is it simply a means to block the caller and avoid spinning? Or does it actually expect to detect an event? I think we are looking at two issues here: 1. the thread structure of ceph messenger For a given socket connection, there are 3 threads of interest here: the main messenger thread, the Pipe::reader and the Pipe::writer. For a ceph client like the ceph admin command, I see the following sequence - the connection to the ceph monitor is created by the main messenger thread, the Pipe::reader and Pipe::writer are instantiated. - the requested command is sent to the ceph monitor, the answer is read and printed - at this point the Pipe::reader already has called tcp_read_wait(), polling for more data or connection termination - after the response had been printed, the main loop calls the shutdown routines which in in turn shutdown() the socket There is some time between the last two steps - and this gap is long enough to open a race: 2. rpoll, ibv and poll the rpoll implementation in rsockets is split in 2 phases: - a busy loop which checks the state of the underlying ibv queue pair - the call to real poll() system call (i.e. the uverbs(?) implementation of poll() inside the kernel) The busy loop has a maximum duration of polling_time (10 microseconds by default) - and is able detect the local shutdown and returns a POLLHUP. The poll() system call (i.e. the uverbs implementation of poll() in the kernel) does not detect the local shutdown - and only returns after the caller supplied timeout expires. Increasing the rsockets polloing_time from 10 to 1 microseconds results in the rpoll to detect the local shutdown within the busy loop. Decreasing the ceph ms tcp read timeout from the default of 900 to 5 seconds serves a similar purpose, but is much coarser. From my understanding, the real issue is neither at the ceph nor at the rsockets level: it is related to the uverbs kernel module. An alternative way to address the current problem at the rsockets level would be w re-write of the rpoll(): instead of the busy loop at the beginning followed by the reall poll() call with the full user specificed timeout value (ms tcp read timeout in our case), I would embed the real poll() into a loop, splitting the user specified timeout into smaller portions and doing the rsockets specific rs_poll_check() on every timeout of the real poll(). I have not looked at the rsocket code, so take the following with a grain of salt. If the purpose of the real poll() is to simply block the user for a specified time, then you can simply make it a short duration (taking into consideration what granularity the OS provides) and then call ibv_poll_cq(). Keep in mind, polling will prevent your CPU from reducing power. If the real poll() is actually checking for something (e.g. checking on the RDMA channel's fd or the IB channel's fd), then you may not want to spin too much. Scott Best Regards Andreas Bluemle On Tue, 13 Aug 2013 07:53:12 +0200 Andreas Bluemle andreas.blue...@itxperts.de wrote: Hi Matthew, I can confirm the beahviour whichi you describe. I too believe that the problem is on the client side (ceph command). My log files show the very same symptom, i.e. the client side not being able to shutdown the pipes properly. (Q: I had problems yesterday to send a mail to ceph-users list with the log files attached to it because of the size of the attachments exceeding some limit; I hadnÄt been subscribed to the list at that point. Is the uses of pastebin.com the better way to provide such lengthy information in general? Best Regards Andreas Bluemle On Tue, 13 Aug 2013 11:59:36 +0800 Matthew Anderson manderson8...@gmail.com wrote: Moving this conversation to ceph-devel where the dev's might be able to shed some light on this. I've added some additional debug to my code to narrow the issue down a bit and the reader thread appears to be getting locked by tcp_read_wait() because rpoll never returns an event when the socket is shutdown. A hack way of proving this was to lower the timeout in rpoll to 5 seconds. When command like 'ceph osd tree' completes you can see it block for 5 seconds until rpoll times out and returns 0. The reader thread is then able to join and the
RE: [ceph-users] Help needed porting Ceph to RSockets
I found a workaround for my (our) problem: in the librdmacm code, rsocket.c, there is a global constant polling_time, which is set to 10 microseconds at the moment. I raise this to 1 - and all of a sudden things work nicely. I am adding the linux-rdma list to CC so Sean might see this. If I understand what you are describing, the caller to rpoll() spins for up to 10 ms (10,000 us) before calling the real poll(). What is the purpose of the real poll() call? Is it simply a means to block the caller and avoid spinning? Or does it actually expect to detect an event? When the real poll() is called, an event is expected on an fd associated with the CQ's completion channel. I think we are looking at two issues here: 1. the thread structure of ceph messenger For a given socket connection, there are 3 threads of interest here: the main messenger thread, the Pipe::reader and the Pipe::writer. For a ceph client like the ceph admin command, I see the following sequence - the connection to the ceph monitor is created by the main messenger thread, the Pipe::reader and Pipe::writer are instantiated. - the requested command is sent to the ceph monitor, the answer is read and printed - at this point the Pipe::reader already has called tcp_read_wait(), polling for more data or connection termination - after the response had been printed, the main loop calls the shutdown routines which in in turn shutdown() the socket There is some time between the last two steps - and this gap is long enough to open a race: 2. rpoll, ibv and poll the rpoll implementation in rsockets is split in 2 phases: - a busy loop which checks the state of the underlying ibv queue pair - the call to real poll() system call (i.e. the uverbs(?) implementation of poll() inside the kernel) The busy loop has a maximum duration of polling_time (10 microseconds by default) - and is able detect the local shutdown and returns a POLLHUP. The poll() system call (i.e. the uverbs implementation of poll() in the kernel) does not detect the local shutdown - and only returns after the caller supplied timeout expires. It sounds like there's an issue here either with a message getting lost or a race. Given that spinning longer works for you, it sounds like an event is getting lost, not being generated correctly, or not being configured to generate. Increasing the rsockets polloing_time from 10 to 1 microseconds results in the rpoll to detect the local shutdown within the busy loop. Decreasing the ceph ms tcp read timeout from the default of 900 to 5 seconds serves a similar purpose, but is much coarser. From my understanding, the real issue is neither at the ceph nor at the rsockets level: it is related to the uverbs kernel module. An alternative way to address the current problem at the rsockets level would be w re-write of the rpoll(): instead of the busy loop at the beginning followed by the reall poll() call with the full user specificed timeout value (ms tcp read timeout in our case), I would embed the real poll() into a loop, splitting the user specified timeout into smaller portions and doing the rsockets specific rs_poll_check() on every timeout of the real poll(). I have not looked at the rsocket code, so take the following with a grain of salt. If the purpose of the real poll() is to simply block the user for a specified time, then you can simply make it a short duration (taking into consideration what granularity the OS provides) and then call ibv_poll_cq(). Keep in mind, polling will prevent your CPU from reducing power. If the real poll() is actually checking for something (e.g. checking on the RDMA channel's fd or the IB channel's fd), then you may not want to spin too much. The real poll() call is intended to block the application until a timeout occurs or an event shows up. Since increasing the spin time works for you, it makes me suspect that there is a bug in the CQ event handling in rsockets. What's particularly weird is that the monitor receives a POLLHUP event when the ceph command shuts down it's socket but the ceph command never does. When using regular sockets both sides of the connection receive a POLLIN | POLLHUP | POLRDHUP event when the sockets are shut down. It would seem like there is a bug in rsockets that causes the side that calls shutdown first not to receive the correct rpoll events. rsockets does not support POLLRDHUP. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html