Re: [ceph-users] Help needed porting Ceph to RSockets

2014-02-05 Thread Gandalf Corvotempesta
2013-10-31 Hefty, Sean sean.he...@intel.com:
 Can you please try the attached patch in place of all previous patches?

Any updates on ceph with rsockets?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ceph-users] Help needed porting Ceph to RSockets

2013-09-20 Thread Hefty, Sean
 I would not call these workarounds a real fix, but they should point
 out the problems which I am trying to solve.

Thanks for the update.  I haven't had the time to investigate this, but did 
want to at least acknowledge that this hasn't gotten lost.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Help needed porting Ceph to RSockets

2013-09-12 Thread Gandalf Corvotempesta
2013/9/10 Andreas Bluemle andreas.blue...@itxperts.de:
 Since I have added these workarounds to my version of the librdmacm
 library, I can at least start up ceph using LD_PRELOAD and end up in
 a healthy ceph cluster state.

Have you seen any performance improvement by using LD_PRELOAD with ceph?
Which throughput are you able to archive with rsocket and ceph?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Help needed porting Ceph to RSockets

2013-09-12 Thread Andreas Bluemle
On Thu, 12 Sep 2013 12:20:03 +0200
Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote:

 2013/9/10 Andreas Bluemle andreas.blue...@itxperts.de:
  Since I have added these workarounds to my version of the librdmacm
  library, I can at least start up ceph using LD_PRELOAD and end up in
  a healthy ceph cluster state.
 
 Have you seen any performance improvement by using LD_PRELOAD with
 ceph? Which throughput are you able to archive with rsocket and ceph?

I have not yet done any performance testing.

The next step I have to take is more related to setting up
a larger cluster with sth. like 150 osd's without hitting any
resource limitations.

Regards

Andreas Bluemle

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
 in the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 



-- 
Andreas Bluemle mailto:andreas.blue...@itxperts.de
ITXperts GmbH   http://www.itxperts.de
Balanstrasse 73, Geb. 08Phone: (+49) 89 89044917
D-81541 Muenchen (Germany)  Fax:   (+49) 89 89044910

Company details: http://www.itxperts.de/imprint.htm
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Help needed porting Ceph to RSockets

2013-09-10 Thread Andreas Bluemle
Hi,

after some more analysis and debugging, I found
workarounds for my problems; I have added these workarounds
to the last version of the patch for the poll problem by Sean;
see the attachment to this posting.

The shutdown() operations below are all SHUT_RDWR.

1. shutdown() on side A of a connection waits for close() on side B

   With rsockets, when a shutdown is done on side A of a socket
   connection, then the shutdown will only return after side B
   has done a close() on its end of the connection.

   This is different from TCP/IP sockets: there a shutdown will cause
   the other end to terminate the connection at the TCP level
   instantly. The socket changes state into CLOSE_WAIT, which indicates
   that the application level close is outstanding.

   In the attached patch, the workaround is in rs_poll_cq(),
   case RS_OP_CTRL, where for a RS_CTRL_DISCONNECT the rshutdown()
   is called on side B; this will cause the termination of the
   socket connection to acknowledged to side A and the shutdown()
   there can now terminate.

2. double (multiple) shutdown on side A: delay on 2nd shutdown

   When an application does a shutdown() of side A and does a 2nd
   shutdown() shortly after (for whatever reason) then the
   return of the 2nd shutdown() is delayed by 2 seconds.

   The delay happens in rdma_disconnect(), when this is called
   from rshutdown() in the case that the rsocket state is
   rs_disconnected.

   Even if it could be considered as a bug if an application
   calls shutdown() twice on the same socket, it still
   does not make sense to delay that 2nd call to shutdown().

   To workaround this, I have
   - introduced an additional rsocket state: rs_shutdown
   - switch to that new state in rshutdown() at the very end
 of the function.

   The first call to shutdown() will therefore switch to the new
   rsocket state rs_shutdown - and any further call to rshutdown()
   will not do anything any more, because every effect of rshutdown()
   will only happen if the rsocket state is either rs_connnected or
   rs_disconnected. Hence it would be better to explicitely check
   the rsocket state at the beginning of the function and return
   immediately if the state is rs_shutdown.

Since I have added these workarounds to my version of the librdmacm
library, I can at least start up ceph using LD_PRELOAD and end up in
a healthy ceph cluster state.

I would not call these workarounds a real fix, but they should point
out the problems which I am trying to solve.


Regards

Andreas Bluemle




On Fri, 23 Aug 2013 00:35:22 +
Hefty, Sean sean.he...@intel.com wrote:

  I tested out the patch and unfortunately had the same results as
  Andreas. About 50% of the time the rpoll() thread in Ceph still
  hangs when rshutdown() is called. I saw a similar behaviour when
  increasing the poll time on the pre-patched version if that's of
  any relevance.
 
 I'm not optimistic, but here's an updated patch.  I attempted to
 handle more shutdown conditions, but I can't say that any of those
 would prevent the hang that you see.
 
 I have a couple of questions: 
 
 Is there any chance that the code would call rclose while rpoll
 is still running?  Also, can you verify that the thread is in the
 real poll() call when the hang occurs?
 
 Signed-off-by: Sean Hefty sean.he...@intel.com
 ---
  src/rsocket.c |   35 +--
  1 files changed, 25 insertions(+), 10 deletions(-)
 
 diff --git a/src/rsocket.c b/src/rsocket.c
 index d544dd0..f94ddf3 100644
 --- a/src/rsocket.c
 +++ b/src/rsocket.c
 @@ -1822,7 +1822,12 @@ static int rs_poll_cq(struct rsocket *rs)
   rs-state = rs_disconnected;
   return 0;
   } else if (rs_msg_data(msg) ==
 RS_CTRL_SHUTDOWN) {
 - rs-state = ~rs_readable;
 + if (rs-state  rs_writable)
 {
 + rs-state =
 ~rs_readable;
 + } else {
 + rs-state =
 rs_disconnected;
 + return 0;
 + }
   }
   break;
   case RS_OP_WRITE:
 @@ -2948,10 +2953,12 @@ static int rs_poll_events(struct pollfd
 *rfds, struct pollfd *fds, nfds_t nfds) 
   rs = idm_lookup(idm, fds[i].fd);
   if (rs) {
 + fastlock_acquire(rs-cq_wait_lock);
   if (rs-type == SOCK_STREAM)
   rs_get_cq_event(rs);
   else
   ds_get_cq_event(rs);
 + fastlock_release(rs-cq_wait_lock);
   fds[i].revents = rs_poll_rs(rs,
 fds[i].events, 1, rs_poll_all); } else {
   

Re: [ceph-users] Help needed porting Ceph to RSockets

2013-08-21 Thread Matthew Anderson
Hi Sean,

I tested out the patch and unfortunately had the same results as
Andreas. About 50% of the time the rpoll() thread in Ceph still hangs
when rshutdown() is called. I saw a similar behaviour when increasing
the poll time on the pre-patched version if that's of any relevance.

Thanks

On Tue, Aug 20, 2013 at 11:04 PM, Hefty, Sean sean.he...@intel.com wrote:
 I have added the patch and re-tested: I still encounter
 hangs of my application. I am not quite sure whether the
 I hit the same error on the shutdown because now I don't hit
 the error always, but only every now and then.

 I guess this is at least some progress... :/

 WHen adding the patch to my code base (git tag v1.0.17) I notice
 an offset of -34 lines. Which code base are you using?

 This patch was generated against the tip of the git tree.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Help needed porting Ceph to RSockets

2013-08-20 Thread Andreas Bluemle
Hi Sean,

I will re-check until the end of the week; there is
some test scheduling issue with our test system, which
affects my access times.

Thanks

Andreas


On Mon, 19 Aug 2013 17:10:11 +
Hefty, Sean sean.he...@intel.com wrote:

 Can you see if the patch below fixes the hang?
 
 Signed-off-by: Sean Hefty sean.he...@intel.com
 ---
  src/rsocket.c |   11 ++-
  1 files changed, 10 insertions(+), 1 deletions(-)
 
 diff --git a/src/rsocket.c b/src/rsocket.c
 index d544dd0..e45b26d 100644
 --- a/src/rsocket.c
 +++ b/src/rsocket.c
 @@ -2948,10 +2948,12 @@ static int rs_poll_events(struct pollfd
 *rfds, struct pollfd *fds, nfds_t nfds) 
   rs = idm_lookup(idm, fds[i].fd);
   if (rs) {
 + fastlock_acquire(rs-cq_wait_lock);
   if (rs-type == SOCK_STREAM)
   rs_get_cq_event(rs);
   else
   ds_get_cq_event(rs);
 + fastlock_release(rs-cq_wait_lock);
   fds[i].revents = rs_poll_rs(rs,
 fds[i].events, 1, rs_poll_all); } else {
   fds[i].revents = rfds[i].revents;
 @@ -3098,7 +3100,8 @@ int rselect(int nfds, fd_set *readfds, fd_set
 *writefds, 
  /*
   * For graceful disconnect, notify the remote side that we're
 - * disconnecting and wait until all outstanding sends complete.
 + * disconnecting and wait until all outstanding sends complete,
 provided
 + * that the remote side has not sent a disconnect message.
   */
  int rshutdown(int socket, int how)
  {
 @@ -3138,6 +3141,12 @@ int rshutdown(int socket, int how)
   if (rs-state  rs_connected)
   rs_process_cq(rs, 0, rs_conn_all_sends_done);
  
 + if (rs-state  rs_disconnected) {
 + /* Generate event by flushing receives to unblock
 rpoll */
 + ibv_req_notify_cq(rs-cm_id-recv_cq, 0);
 + rdma_disconnect(rs-cm_id);
 + }
 +
   if ((rs-fd_flags  O_NONBLOCK)  (rs-state 
 rs_connected)) rs_set_nonblocking(rs, rs-fd_flags);
  
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma
 in the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 



-- 
Andreas Bluemle mailto:andreas.blue...@itxperts.de
Heinrich Boell Strasse 88   Phone: (+49) 89 4317582
D-81829 Muenchen (Germany)  Mobil: (+49) 177 522 0151
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Help needed porting Ceph to RSockets

2013-08-20 Thread Andreas Bluemle
Hi,

I have added the patch and re-tested: I still encounter
hangs of my application. I am not quite sure whether the
I hit the same error on the shutdown because now I don't hit
the error always, but only every now and then.

WHen adding the patch to my code base (git tag v1.0.17) I notice
an offset of -34 lines. Which code base are you using?


Best Regards

Andreas Bluemle

On Tue, 20 Aug 2013 09:21:13 +0200
Andreas Bluemle andreas.blue...@itxperts.de wrote:

 Hi Sean,
 
 I will re-check until the end of the week; there is
 some test scheduling issue with our test system, which
 affects my access times.
 
 Thanks
 
 Andreas
 
 
 On Mon, 19 Aug 2013 17:10:11 +
 Hefty, Sean sean.he...@intel.com wrote:
 
  Can you see if the patch below fixes the hang?
  
  Signed-off-by: Sean Hefty sean.he...@intel.com
  ---
   src/rsocket.c |   11 ++-
   1 files changed, 10 insertions(+), 1 deletions(-)
  
  diff --git a/src/rsocket.c b/src/rsocket.c
  index d544dd0..e45b26d 100644
  --- a/src/rsocket.c
  +++ b/src/rsocket.c
  @@ -2948,10 +2948,12 @@ static int rs_poll_events(struct pollfd
  *rfds, struct pollfd *fds, nfds_t nfds) 
  rs = idm_lookup(idm, fds[i].fd);
  if (rs) {
  +   fastlock_acquire(rs-cq_wait_lock);
  if (rs-type == SOCK_STREAM)
  rs_get_cq_event(rs);
  else
  ds_get_cq_event(rs);
  +   fastlock_release(rs-cq_wait_lock);
  fds[i].revents = rs_poll_rs(rs,
  fds[i].events, 1, rs_poll_all); } else {
  fds[i].revents = rfds[i].revents;
  @@ -3098,7 +3100,8 @@ int rselect(int nfds, fd_set *readfds, fd_set
  *writefds, 
   /*
* For graceful disconnect, notify the remote side that we're
  - * disconnecting and wait until all outstanding sends complete.
  + * disconnecting and wait until all outstanding sends complete,
  provided
  + * that the remote side has not sent a disconnect message.
*/
   int rshutdown(int socket, int how)
   {
  @@ -3138,6 +3141,12 @@ int rshutdown(int socket, int how)
  if (rs-state  rs_connected)
  rs_process_cq(rs, 0, rs_conn_all_sends_done);
   
  +   if (rs-state  rs_disconnected) {
  +   /* Generate event by flushing receives to unblock
  rpoll */
  +   ibv_req_notify_cq(rs-cm_id-recv_cq, 0);
  +   rdma_disconnect(rs-cm_id);
  +   }
  +
  if ((rs-fd_flags  O_NONBLOCK)  (rs-state 
  rs_connected)) rs_set_nonblocking(rs, rs-fd_flags);
   
  
  
  --
  To unsubscribe from this list: send the line unsubscribe
  linux-rdma in the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
  
 
 
 



-- 
Andreas Bluemle mailto:andreas.blue...@itxperts.de
Heinrich Boell Strasse 88   Phone: (+49) 89 4317582
D-81829 Muenchen (Germany)  Mobil: (+49) 177 522 0151
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ceph-users] Help needed porting Ceph to RSockets

2013-08-20 Thread Hefty, Sean
 I have added the patch and re-tested: I still encounter
 hangs of my application. I am not quite sure whether the
 I hit the same error on the shutdown because now I don't hit
 the error always, but only every now and then.

I guess this is at least some progress... :/
 
 WHen adding the patch to my code base (git tag v1.0.17) I notice
 an offset of -34 lines. Which code base are you using?

This patch was generated against the tip of the git tree. 

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ceph-users] Help needed porting Ceph to RSockets

2013-08-19 Thread Hefty, Sean
Can you see if the patch below fixes the hang?

Signed-off-by: Sean Hefty sean.he...@intel.com
---
 src/rsocket.c |   11 ++-
 1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/src/rsocket.c b/src/rsocket.c
index d544dd0..e45b26d 100644
--- a/src/rsocket.c
+++ b/src/rsocket.c
@@ -2948,10 +2948,12 @@ static int rs_poll_events(struct pollfd *rfds, struct 
pollfd *fds, nfds_t nfds)
 
rs = idm_lookup(idm, fds[i].fd);
if (rs) {
+   fastlock_acquire(rs-cq_wait_lock);
if (rs-type == SOCK_STREAM)
rs_get_cq_event(rs);
else
ds_get_cq_event(rs);
+   fastlock_release(rs-cq_wait_lock);
fds[i].revents = rs_poll_rs(rs, fds[i].events, 1, 
rs_poll_all);
} else {
fds[i].revents = rfds[i].revents;
@@ -3098,7 +3100,8 @@ int rselect(int nfds, fd_set *readfds, fd_set *writefds,
 
 /*
  * For graceful disconnect, notify the remote side that we're
- * disconnecting and wait until all outstanding sends complete.
+ * disconnecting and wait until all outstanding sends complete, provided
+ * that the remote side has not sent a disconnect message.
  */
 int rshutdown(int socket, int how)
 {
@@ -3138,6 +3141,12 @@ int rshutdown(int socket, int how)
if (rs-state  rs_connected)
rs_process_cq(rs, 0, rs_conn_all_sends_done);
 
+   if (rs-state  rs_disconnected) {
+   /* Generate event by flushing receives to unblock rpoll */
+   ibv_req_notify_cq(rs-cm_id-recv_cq, 0);
+   rdma_disconnect(rs-cm_id);
+   }
+
if ((rs-fd_flags  O_NONBLOCK)  (rs-state  rs_connected))
rs_set_nonblocking(rs, rs-fd_flags);
 


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ceph-users] Help needed porting Ceph to RSockets

2013-08-16 Thread Hefty, Sean
 I am looking at a multithreaded application here, and I believe that
 the race is between thread A calling the rpoll() for POLLIN event and
 thread B calling the shutdown(SHUT_RDWR) for reading and writing of
 the (r)socket almost immediately afterwards.

I modified a test program, and I can reproduce the hang as you describe -- 
calling rpoll() then rshutdown() from another thread.

These calls end up calling rpoll-poll followed by rshutdown-read.  The read 
completes, which completes rshutdown, but the poll continues to wait for an 
event.  In the kernel, poll ends up in uverbs.c::ib_uverbs_event_poll, and read 
in uverbs.c::ib_uverbs_event_read().

The behavior of poll/read seems reasonable, so I don't think this is a kernel 
issue.  I'm still trying to figure out a simple solution to fixing this.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Help needed porting Ceph to RSockets

2013-08-14 Thread Andreas Bluemle
Hi,

maybe some information about the environment I am
working in:

- CentOS 6.4 with custom kernel 3.8.13
- librdmacm / librspreload from git, tag 1.0.17
- application started with librspreload in LD_PRELOAD environment

Currently, I have increased the value of the spin time by setting the
default value for polling_time in the source code.

I guess that the correct way to do this is via
configuration in /etc/rdma/rsocket/polling_time?

Concerning the rpoll() itself, some more comments/questions
embedded below.

On Tue, 13 Aug 2013 21:44:42 +
Hefty, Sean sean.he...@intel.com wrote:

   I found a workaround for my (our) problem: in the librdmacm
   code, rsocket.c, there is a global constant polling_time, which
   is set to 10 microseconds at the moment.
  
   I raise this to 1 - and all of a sudden things work nicely.
  
  I am adding the linux-rdma list to CC so Sean might see this.
  
  If I understand what you are describing, the caller to rpoll()
  spins for up to 10 ms (10,000 us) before calling the real poll().
  
  What is the purpose of the real poll() call? Is it simply a means
  to block the caller and avoid spinning? Or does it actually expect
  to detect an event?
 
 When the real poll() is called, an event is expected on an fd
 associated with the CQ's completion channel. 

The first question I would have is: why is the rpoll() split into
these two pieces? There must have been some reason to do a busy
loop on some local state information rather than just call the
real poll() directly.

   I think we are looking at two issues here:
   1. the thread structure of ceph messenger
 For a given socket connection, there are 3 threads of interest
 here: the main messenger thread, the Pipe::reader and the
 Pipe::writer.
  
 For a ceph client like the ceph admin command, I see the
   following sequence
   - the connection to the ceph monitor is created by the
 main messenger  thread, the Pipe::reader and
 Pipe::writer are instantiated.
   - the requested command is sent to the ceph monitor, the
 answer is read and printed
   - at this point the Pipe::reader already has called
 tcp_read_wait(), polling for more data or connection
   termination
   - after the response had been printed, the main loop calls the
 shutdown routines which in in turn shutdown() the socket
  
  There is some time between the last two steps - and this gap is
  long enough to open a race:
  
   2. rpoll, ibv and poll
 the rpoll implementation in rsockets is split in 2 phases:
 - a busy loop which checks the state of the underlying ibv
   queue pair
 - the call to real poll() system call (i.e. the uverbs(?)
   implementation of poll() inside the kernel)
  
 The busy loop has a maximum duration of polling_time (10
   microseconds by default) - and is able detect the local shutdown
   and returns a POLLHUP.
  
 The poll() system call (i.e. the uverbs implementation of poll()
 in the kernel) does not detect the local shutdown - and only
   returns after the caller supplied timeout expires.
 
 It sounds like there's an issue here either with a message getting
 lost or a race.  Given that spinning longer works for you, it sounds
 like an event is getting lost, not being generated correctly, or not
 being configured to generate.
 

I am looking at a multithreaded application here, and I believe that
the race is between thread A calling the rpoll() for POLLIN event and
thread B calling the shutdown(SHUT_RDWR) for reading and writing of
the (r)socket almost immediately afterwards.

I think that the shutdown itself does not cause a POLLHUP event to be
generated from the kernel to interupt the real poll().
(BTW: which kernel module implements the poll() for rsockets?
Is that ib_uverbs.ko with ib_uverbs_poll_cq()?)

   Increasing the rsockets polloing_time from 10 to 1
   microseconds results in the rpoll to detect the local shutdown
   within the busy loop.
  
   Decreasing the ceph ms tcp read timeout from the default of 900
   to 5 seconds serves a similar purpose, but is much coarser.
  
   From my understanding, the real issue is neither at the ceph nor
   at the rsockets level: it is related to the uverbs kernel module.
  
   An alternative way to address the current problem at the rsockets
   level would be w re-write of the rpoll(): instead of the busy
   loop at the beginning followed by the reall poll() call with the
   full user specificed timeout value (ms tcp read timeout in our
   case), I would embed the real poll()  into a loop, splitting the
   user specified timeout into smaller portions and doing the
   rsockets specific rs_poll_check() on every timeout of the real
   poll().
  
  I have not looked at the rsocket code, so take the following with a
  grain of salt. If the purpose of the real poll() is to simply block
  the user for a specified time, then you can simply make it a short
  duration (taking 

Re: [ceph-users] Help needed porting Ceph to RSockets

2013-08-14 Thread Atchley, Scott
On Aug 14, 2013, at 3:21 AM, Andreas Bluemle andreas.blue...@itxperts.de 
wrote:

 Hi,
 
 maybe some information about the environment I am
 working in:
 
 - CentOS 6.4 with custom kernel 3.8.13
 - librdmacm / librspreload from git, tag 1.0.17
 - application started with librspreload in LD_PRELOAD environment
 
 Currently, I have increased the value of the spin time by setting the
 default value for polling_time in the source code.
 
 I guess that the correct way to do this is via
 configuration in /etc/rdma/rsocket/polling_time?
 
 Concerning the rpoll() itself, some more comments/questions
 embedded below.
 
 On Tue, 13 Aug 2013 21:44:42 +
 Hefty, Sean sean.he...@intel.com wrote:
 
 I found a workaround for my (our) problem: in the librdmacm
 code, rsocket.c, there is a global constant polling_time, which
 is set to 10 microseconds at the moment.
 
 I raise this to 1 - and all of a sudden things work nicely.
 
 I am adding the linux-rdma list to CC so Sean might see this.
 
 If I understand what you are describing, the caller to rpoll()
 spins for up to 10 ms (10,000 us) before calling the real poll().
 
 What is the purpose of the real poll() call? Is it simply a means
 to block the caller and avoid spinning? Or does it actually expect
 to detect an event?
 
 When the real poll() is called, an event is expected on an fd
 associated with the CQ's completion channel. 
 
 The first question I would have is: why is the rpoll() split into
 these two pieces? There must have been some reason to do a busy
 loop on some local state information rather than just call the
 real poll() directly.

Sean can answer specifically, but this is a typical HPC technique. The worst 
thing you can do is handle an event and then block when the next event is 
available. This adds 1-3 us to latency which is unacceptable in HPC. In HPC, we 
poll. If we worry about power, we poll until we get no more events and then we 
poll a little more before blocking. Determining the little more is the fun 
part. ;-) 

Scott--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ceph-users] Help needed porting Ceph to RSockets

2013-08-14 Thread Hefty, Sean
 The first question I would have is: why is the rpoll() split into
 these two pieces? There must have been some reason to do a busy
 loop on some local state information rather than just call the
 real poll() directly.

As Scott mentioned in his email, this is done for performance reasons.  The 
cost of always dropping into the kernel is too high for HPC.

 I am looking at a multithreaded application here, and I believe that
 the race is between thread A calling the rpoll() for POLLIN event and
 thread B calling the shutdown(SHUT_RDWR) for reading and writing of
 the (r)socket almost immediately afterwards.

Ah - this is likely the issue.  I did not assume that rshutdown() would be 
called simultaneously with rpoll().  I need to think about how to solve this, 
so that rpoll() unblocks.
 
 I think that the shutdown itself does not cause a POLLHUP event to be
 generated from the kernel to interupt the real poll().
 (BTW: which kernel module implements the poll() for rsockets?
 Is that ib_uverbs.ko with ib_uverbs_poll_cq()?)

The POLLHUP event in rsockets is just software indicating that such an 'event' 
occurred - basically when a call to rpoll() detects that the rsocket state is 
disconnected.

I believe that the real poll() call traps into ib_uverbs_event_poll() in the 
kernel.  The fd associated with the poll call corresponds to a 'completion 
channel', which is used to report events which occur on a CQ.  Connection 
related events don't actually go to that fd - only completions for data 
transfers.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Help needed porting Ceph to RSockets

2013-08-13 Thread Atchley, Scott
On Aug 13, 2013, at 10:06 AM, Andreas Bluemle andreas.blue...@itxperts.de 
wrote:

 Hi Matthew,
 
 I found a workaround for my (our) problem: in the librdmacm
 code, rsocket.c, there is a global constant polling_time, which
 is set to 10 microseconds at the moment.
 
 I raise this to 1 - and all of a sudden things work nicely.

I am adding the linux-rdma list to CC so Sean might see this.

If I understand what you are describing, the caller to rpoll() spins for up to 
10 ms (10,000 us) before calling the real poll().

What is the purpose of the real poll() call? Is it simply a means to block the 
caller and avoid spinning? Or does it actually expect to detect an event?

 I think we are looking at two issues here:
 1. the thread structure of ceph messenger
   For a given socket connection, there are 3 threads of interest
   here: the main messenger thread, the Pipe::reader and the
   Pipe::writer.
 
   For a ceph client like the ceph admin command, I see the following
   sequence
 - the connection to the ceph monitor is created by the
   main messenger  thread, the Pipe::reader and
   Pipe::writer are instantiated.
 - the requested command is sent to the ceph monitor, the
   answer is read and printed
 - at this point the Pipe::reader already has called
   tcp_read_wait(), polling for more data or connection termination
 - after the response had been printed, the main loop calls the
   shutdown routines which in in turn shutdown() the socket
 
There is some time between the last two steps - and this gap is
long enough to open a race:
 
 2. rpoll, ibv and poll
   the rpoll implementation in rsockets is split in 2 phases:
   - a busy loop which checks the state of the underlying ibv queue pair
   - the call to real poll() system call (i.e. the uverbs(?)
 implementation of poll() inside the kernel)
 
   The busy loop has a maximum duration of polling_time (10 microseconds
   by default) - and is able detect the local shutdown and returns a
   POLLHUP.
 
   The poll() system call (i.e. the uverbs implementation of poll() 
   in the kernel) does not detect the local shutdown - and only returns
   after the caller supplied timeout expires.
 
 Increasing the rsockets polloing_time from 10 to 1 microseconds
 results in the rpoll to detect the local shutdown within the busy loop.
 
 Decreasing the ceph ms tcp read timeout from the default of 900 to 5
 seconds serves a similar purpose, but is much coarser.
 
 From my understanding, the real issue is neither at the ceph nor at the
 rsockets level: it is related to the uverbs kernel module.
 
 An alternative way to address the current problem at the rsockets level
 would be w re-write of the rpoll(): instead of the busy loop at the
 beginning followed by the reall poll() call with the full user
 specificed timeout value (ms tcp read timeout in our case), I would
 embed the real poll()  into a loop, splitting the user specified timeout
 into smaller portions and doing the rsockets specific rs_poll_check()
 on every timeout of the real poll().

I have not looked at the rsocket code, so take the following with a grain of 
salt. If the purpose of the real poll() is to simply block the user for a 
specified time, then you can simply make it a short duration (taking into 
consideration what granularity the OS provides) and then call ibv_poll_cq(). 
Keep in mind, polling will prevent your CPU from reducing power.

If the real poll() is actually checking for something (e.g. checking on the 
RDMA channel's fd or the IB channel's fd), then you may not want to spin too 
much.

Scott

 Best Regards
 
 Andreas Bluemle
 
 
 On Tue, 13 Aug 2013 07:53:12 +0200
 Andreas Bluemle andreas.blue...@itxperts.de wrote:
 
 Hi Matthew,
 
 I can confirm the beahviour whichi you describe.
 I too believe that the problem is on the client side (ceph command).
 My log files show the very same symptom, i.e. the client side
 not being able to shutdown the pipes properly.
 
 (Q: I had problems yesterday to send a mail to ceph-users list
 with the log files attached to it because of the size of 
 the attachments exceeding some limit; I hadnÄt been subscribed
 to the list at that point. Is the uses of pastebin.com the better
 way to provide such lengthy information in general?
 
 
 Best Regards
 
 Andreas Bluemle
 
 On Tue, 13 Aug 2013 11:59:36 +0800
 Matthew Anderson manderson8...@gmail.com wrote:
 
 Moving this conversation to ceph-devel where the dev's might be able
 to shed some light on this.
 
 I've added some additional debug to my code to narrow the issue down
 a bit and the reader thread appears to be getting locked by
 tcp_read_wait() because rpoll never returns an event when the socket
 is shutdown. A hack way of proving this was to lower the timeout in
 rpoll to 5 seconds. When command like 'ceph osd tree' completes you
 can see it block for 5 seconds until rpoll times out and returns 0.
 The reader thread is then able to join and the 

RE: [ceph-users] Help needed porting Ceph to RSockets

2013-08-13 Thread Hefty, Sean
  I found a workaround for my (our) problem: in the librdmacm
  code, rsocket.c, there is a global constant polling_time, which
  is set to 10 microseconds at the moment.
 
  I raise this to 1 - and all of a sudden things work nicely.
 
 I am adding the linux-rdma list to CC so Sean might see this.
 
 If I understand what you are describing, the caller to rpoll() spins for up to
 10 ms (10,000 us) before calling the real poll().
 
 What is the purpose of the real poll() call? Is it simply a means to block the
 caller and avoid spinning? Or does it actually expect to detect an event?

When the real poll() is called, an event is expected on an fd associated with 
the CQ's completion channel.
 
  I think we are looking at two issues here:
  1. the thread structure of ceph messenger
For a given socket connection, there are 3 threads of interest
here: the main messenger thread, the Pipe::reader and the
Pipe::writer.
 
For a ceph client like the ceph admin command, I see the following
sequence
  - the connection to the ceph monitor is created by the
main messenger  thread, the Pipe::reader and
Pipe::writer are instantiated.
  - the requested command is sent to the ceph monitor, the
answer is read and printed
  - at this point the Pipe::reader already has called
tcp_read_wait(), polling for more data or connection termination
  - after the response had been printed, the main loop calls the
shutdown routines which in in turn shutdown() the socket
 
 There is some time between the last two steps - and this gap is
 long enough to open a race:
 
  2. rpoll, ibv and poll
the rpoll implementation in rsockets is split in 2 phases:
- a busy loop which checks the state of the underlying ibv queue pair
- the call to real poll() system call (i.e. the uverbs(?)
  implementation of poll() inside the kernel)
 
The busy loop has a maximum duration of polling_time (10 microseconds
by default) - and is able detect the local shutdown and returns a
POLLHUP.
 
The poll() system call (i.e. the uverbs implementation of poll()
in the kernel) does not detect the local shutdown - and only returns
after the caller supplied timeout expires.

It sounds like there's an issue here either with a message getting lost or a 
race.  Given that spinning longer works for you, it sounds like an event is 
getting lost, not being generated correctly, or not being configured to 
generate.

  Increasing the rsockets polloing_time from 10 to 1 microseconds
  results in the rpoll to detect the local shutdown within the busy loop.
 
  Decreasing the ceph ms tcp read timeout from the default of 900 to 5
  seconds serves a similar purpose, but is much coarser.
 
  From my understanding, the real issue is neither at the ceph nor at the
  rsockets level: it is related to the uverbs kernel module.
 
  An alternative way to address the current problem at the rsockets level
  would be w re-write of the rpoll(): instead of the busy loop at the
  beginning followed by the reall poll() call with the full user
  specificed timeout value (ms tcp read timeout in our case), I would
  embed the real poll()  into a loop, splitting the user specified timeout
  into smaller portions and doing the rsockets specific rs_poll_check()
  on every timeout of the real poll().
 
 I have not looked at the rsocket code, so take the following with a grain of
 salt. If the purpose of the real poll() is to simply block the user for a
 specified time, then you can simply make it a short duration (taking into
 consideration what granularity the OS provides) and then call ibv_poll_cq().
 Keep in mind, polling will prevent your CPU from reducing power.
 
 If the real poll() is actually checking for something (e.g. checking on the
 RDMA channel's fd or the IB channel's fd), then you may not want to spin too
 much.

The real poll() call is intended to block the application until a timeout 
occurs or an event shows up.  Since increasing the spin time works for you, it 
makes me suspect that there is a bug in the CQ event handling in rsockets.
 
  What's particularly weird is that the monitor receives a POLLHUP
  event when the ceph command shuts down it's socket but the ceph
  command never does. When using regular sockets both sides of the
  connection receive a POLLIN | POLLHUP | POLRDHUP event when the
  sockets are shut down. It would seem like there is a bug in rsockets
  that causes the side that calls shutdown first not to receive the
  correct rpoll events.

rsockets does not support POLLRDHUP.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html