Re: IB SR-IOV

2013-08-13 Thread Or Gerlitz

On 12/08/2013 10:01, Ernst Gunnar Gran wrote:

I’m about to start experimenting with SR-IOV and IB (ConnectX-2 G2, HP DL360p 
Gen8, E5-2609). Is it OK to just download the latest OFED release and start 
playing, or are there any known issues that I should know of when it comes to 
SR-IOV with the current release?


I would recommend using the IB/RDMA stack from the upstream kernel or 
the one provided with one of the distros that picked SRIOV support from 
upstream into the bits they ship. Note that IB and Ethernet are 
supported for ConnectX VFs, but not yet RoCE.


Or.

Also note you need to do some preparations at the node (bios, kernel 
boot command line) level and card (re-burn the firmware with sriov 
support enabled, see the web or your mellanox rep) level.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IPoIB: Fix race in deleting ipoib_neigh entries

2013-08-13 Thread Or Gerlitz

On 09/08/2013 03:44, Jim Foraker wrote:

In several places, this snippet is used when removing neigh entries:

list_del(neigh-list);
ipoib_neigh_free(neigh);

The list_del() removes neigh from the associated struct ipoib_path, while
ipoib_neigh_free() removes neigh from the device's neigh entry lookup
table.  Both of these operations are protected by the priv-lock
spinlock.  The table however is also protected via RCU, and so naturally
the lock is not held when doing reads.

This leads to a race condition, in which a thread may successfully look
up a neigh entry that has already been deleted from neigh-list.  Since
the previous deletion will have marked the entry with poison, a second
list_del() on the object will cause a panic:

   #5 [8802338c3c70] general_protection at 815108c5
  [exception RIP: list_del+16]
  RIP: 81289020  RSP: 8802338c3d20  RFLAGS: 00010082
  RAX: dead00200200  RBX: 880433e60c88  RCX: 9e6c
  RDX: 0246  RSI: 8806012ca298  RDI: 880433e60c88
  RBP: 8802338c3d30   R8: 8806012ca2e8   R9: 
  R10: 0001  R11:   R12: 8804346b2020
  R13: 88032a3e7540  R14: 8804346b26e0  R15: 0246
  ORIG_RAX:   CS: 0010  SS: 
   #6 [8802338c3d38] ipoib_cm_tx_handler at a066fe0a [ib_ipoib]
   #7 [8802338c3d98] cm_process_work at a05149a7 [ib_cm]
   #8 [8802338c3de8] cm_work_handler at a05161aa [ib_cm]
   #9 [8802338c3e38] worker_thread at 81090e10
  #10 [8802338c3ee8] kthread at 81096c66
  #11 [8802338c3f48] kernel_thread at 8100c0ca

We move the list_del() into ipoib_neigh_free(), so that deletion happens
only once, after the entry has been successfully removed from the lookup
table.  This same behavior is already used in ipoib_del_neighs_by_gid()
and __ipoib_reap_neigh().

Signed-off-by: Jim Forakerforak...@llnl.gov


Hi Jim,

I have reviewed the patch with Shlomo who did the neighboring changes in 
IPoIB -- impressive analysis and debugging, a good and proper fix to the 
issue presented e.g here marc.info/?l=linux-rdmam=136881939301117w=2



Or.


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IPoIB: Fix race in deleting ipoib_neigh entries

2013-08-13 Thread Jack Wang
On 08/13/2013 09:54 AM, Or Gerlitz wrote:
 On 09/08/2013 03:44, Jim Foraker wrote:
 In several places, this snippet is used when removing neigh entries:

 list_del(neigh-list);
 ipoib_neigh_free(neigh);

 The list_del() removes neigh from the associated struct ipoib_path, while
 ipoib_neigh_free() removes neigh from the device's neigh entry lookup
 table.  Both of these operations are protected by the priv-lock
 spinlock.  The table however is also protected via RCU, and so naturally
 the lock is not held when doing reads.

 This leads to a race condition, in which a thread may successfully look
 up a neigh entry that has already been deleted from neigh-list.  Since
 the previous deletion will have marked the entry with poison, a second
 list_del() on the object will cause a panic:

#5 [8802338c3c70] general_protection at 815108c5
   [exception RIP: list_del+16]
   RIP: 81289020  RSP: 8802338c3d20  RFLAGS: 00010082
   RAX: dead00200200  RBX: 880433e60c88  RCX: 9e6c
   RDX: 0246  RSI: 8806012ca298  RDI: 880433e60c88
   RBP: 8802338c3d30   R8: 8806012ca2e8   R9: 
   R10: 0001  R11:   R12: 8804346b2020
   R13: 88032a3e7540  R14: 8804346b26e0  R15: 0246
   ORIG_RAX:   CS: 0010  SS: 
#6 [8802338c3d38] ipoib_cm_tx_handler at a066fe0a
 [ib_ipoib]
#7 [8802338c3d98] cm_process_work at a05149a7 [ib_cm]
#8 [8802338c3de8] cm_work_handler at a05161aa [ib_cm]
#9 [8802338c3e38] worker_thread at 81090e10
   #10 [8802338c3ee8] kthread at 81096c66
   #11 [8802338c3f48] kernel_thread at 8100c0ca

 We move the list_del() into ipoib_neigh_free(), so that deletion happens
 only once, after the entry has been successfully removed from the lookup
 table.  This same behavior is already used in ipoib_del_neighs_by_gid()
 and __ipoib_reap_neigh().

 Signed-off-by: Jim Forakerforak...@llnl.gov
 
 Hi Jim,
 
 I have reviewed the patch with Shlomo who did the neighboring changes in
 IPoIB -- impressive analysis and debugging, a good and proper fix to the
 issue presented e.g here marc.info/?l=linux-rdmam=136881939301117w=2
 
 
 Or.
 
 
Looks nice for me too.

Jack
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IPoIB: Fix race in deleting ipoib_neigh entries

2013-08-13 Thread Shlomo Pongratz

On 8/13/2013 11:11 AM, Jack Wang wrote:

On 08/13/2013 09:54 AM, Or Gerlitz wrote:

On 09/08/2013 03:44, Jim Foraker wrote:

In several places, this snippet is used when removing neigh entries:

 list_del(neigh-list);
 ipoib_neigh_free(neigh);

The list_del() removes neigh from the associated struct ipoib_path, while
ipoib_neigh_free() removes neigh from the device's neigh entry lookup
table.  Both of these operations are protected by the priv-lock
spinlock.  The table however is also protected via RCU, and so naturally
the lock is not held when doing reads.

This leads to a race condition, in which a thread may successfully look
up a neigh entry that has already been deleted from neigh-list.  Since
the previous deletion will have marked the entry with poison, a second
list_del() on the object will cause a panic:

#5 [8802338c3c70] general_protection at 815108c5
   [exception RIP: list_del+16]
   RIP: 81289020  RSP: 8802338c3d20  RFLAGS: 00010082
   RAX: dead00200200  RBX: 880433e60c88  RCX: 9e6c
   RDX: 0246  RSI: 8806012ca298  RDI: 880433e60c88
   RBP: 8802338c3d30   R8: 8806012ca2e8   R9: 
   R10: 0001  R11:   R12: 8804346b2020
   R13: 88032a3e7540  R14: 8804346b26e0  R15: 0246
   ORIG_RAX:   CS: 0010  SS: 
#6 [8802338c3d38] ipoib_cm_tx_handler at a066fe0a
[ib_ipoib]
#7 [8802338c3d98] cm_process_work at a05149a7 [ib_cm]
#8 [8802338c3de8] cm_work_handler at a05161aa [ib_cm]
#9 [8802338c3e38] worker_thread at 81090e10
   #10 [8802338c3ee8] kthread at 81096c66
   #11 [8802338c3f48] kernel_thread at 8100c0ca

We move the list_del() into ipoib_neigh_free(), so that deletion happens
only once, after the entry has been successfully removed from the lookup
table.  This same behavior is already used in ipoib_del_neighs_by_gid()
and __ipoib_reap_neigh().

Signed-off-by: Jim Forakerforak...@llnl.gov

Hi Jim,

I have reviewed the patch with Shlomo who did the neighboring changes in
IPoIB -- impressive analysis and debugging, a good and proper fix to the
issue presented e.g here marc.info/?l=linux-rdmam=136881939301117w=2


Or.



Looks nice for me too.

Jack

Looks good to me, thanks.
S.P.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Help needed porting Ceph to RSockets

2013-08-13 Thread Atchley, Scott
On Aug 13, 2013, at 10:06 AM, Andreas Bluemle andreas.blue...@itxperts.de 
wrote:

 Hi Matthew,
 
 I found a workaround for my (our) problem: in the librdmacm
 code, rsocket.c, there is a global constant polling_time, which
 is set to 10 microseconds at the moment.
 
 I raise this to 1 - and all of a sudden things work nicely.

I am adding the linux-rdma list to CC so Sean might see this.

If I understand what you are describing, the caller to rpoll() spins for up to 
10 ms (10,000 us) before calling the real poll().

What is the purpose of the real poll() call? Is it simply a means to block the 
caller and avoid spinning? Or does it actually expect to detect an event?

 I think we are looking at two issues here:
 1. the thread structure of ceph messenger
   For a given socket connection, there are 3 threads of interest
   here: the main messenger thread, the Pipe::reader and the
   Pipe::writer.
 
   For a ceph client like the ceph admin command, I see the following
   sequence
 - the connection to the ceph monitor is created by the
   main messenger  thread, the Pipe::reader and
   Pipe::writer are instantiated.
 - the requested command is sent to the ceph monitor, the
   answer is read and printed
 - at this point the Pipe::reader already has called
   tcp_read_wait(), polling for more data or connection termination
 - after the response had been printed, the main loop calls the
   shutdown routines which in in turn shutdown() the socket
 
There is some time between the last two steps - and this gap is
long enough to open a race:
 
 2. rpoll, ibv and poll
   the rpoll implementation in rsockets is split in 2 phases:
   - a busy loop which checks the state of the underlying ibv queue pair
   - the call to real poll() system call (i.e. the uverbs(?)
 implementation of poll() inside the kernel)
 
   The busy loop has a maximum duration of polling_time (10 microseconds
   by default) - and is able detect the local shutdown and returns a
   POLLHUP.
 
   The poll() system call (i.e. the uverbs implementation of poll() 
   in the kernel) does not detect the local shutdown - and only returns
   after the caller supplied timeout expires.
 
 Increasing the rsockets polloing_time from 10 to 1 microseconds
 results in the rpoll to detect the local shutdown within the busy loop.
 
 Decreasing the ceph ms tcp read timeout from the default of 900 to 5
 seconds serves a similar purpose, but is much coarser.
 
 From my understanding, the real issue is neither at the ceph nor at the
 rsockets level: it is related to the uverbs kernel module.
 
 An alternative way to address the current problem at the rsockets level
 would be w re-write of the rpoll(): instead of the busy loop at the
 beginning followed by the reall poll() call with the full user
 specificed timeout value (ms tcp read timeout in our case), I would
 embed the real poll()  into a loop, splitting the user specified timeout
 into smaller portions and doing the rsockets specific rs_poll_check()
 on every timeout of the real poll().

I have not looked at the rsocket code, so take the following with a grain of 
salt. If the purpose of the real poll() is to simply block the user for a 
specified time, then you can simply make it a short duration (taking into 
consideration what granularity the OS provides) and then call ibv_poll_cq(). 
Keep in mind, polling will prevent your CPU from reducing power.

If the real poll() is actually checking for something (e.g. checking on the 
RDMA channel's fd or the IB channel's fd), then you may not want to spin too 
much.

Scott

 Best Regards
 
 Andreas Bluemle
 
 
 On Tue, 13 Aug 2013 07:53:12 +0200
 Andreas Bluemle andreas.blue...@itxperts.de wrote:
 
 Hi Matthew,
 
 I can confirm the beahviour whichi you describe.
 I too believe that the problem is on the client side (ceph command).
 My log files show the very same symptom, i.e. the client side
 not being able to shutdown the pipes properly.
 
 (Q: I had problems yesterday to send a mail to ceph-users list
 with the log files attached to it because of the size of 
 the attachments exceeding some limit; I hadnÄt been subscribed
 to the list at that point. Is the uses of pastebin.com the better
 way to provide such lengthy information in general?
 
 
 Best Regards
 
 Andreas Bluemle
 
 On Tue, 13 Aug 2013 11:59:36 +0800
 Matthew Anderson manderson8...@gmail.com wrote:
 
 Moving this conversation to ceph-devel where the dev's might be able
 to shed some light on this.
 
 I've added some additional debug to my code to narrow the issue down
 a bit and the reader thread appears to be getting locked by
 tcp_read_wait() because rpoll never returns an event when the socket
 is shutdown. A hack way of proving this was to lower the timeout in
 rpoll to 5 seconds. When command like 'ceph osd tree' completes you
 can see it block for 5 seconds until rpoll times out and returns 0.
 The reader thread is then able to join and the 

RE: [PATCH 13/16] qib: clean up unnecessary MSI/MSI-X capability find

2013-08-13 Thread Marciniszyn, Mike
 Subject: [PATCH 13/16] qib: clean up unnecessary MSI/MSI-X capability find
 

Thanks for the patch!

Acked-by: Mike Marciniszyn mike.marcinis...@intel.com
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 for-next 2/4] IB/core: Infra-structure to support verbs extensions through uverbs

2013-08-13 Thread Roland Dreier
 Add Infra-structure to support extended uverbs capabilities in a 
 forward/backward
 manner. Uverbs command opcodes which are based on the verbs extensions 
 approach should
 be greater or equal to IB_USER_VERBS_CMD_THRESHOLD. They have new header 
 format
 and processed a bit differently.

Previously I wrote the following about an earlier version of the
patch. (Quite a while before you sent the new version).  You seem to
have ignored it (and also ignored my correction about how to write
infrastructure):

  I think you missed the feedback I gave to the previous version of this patch:

   This patch at least doesn't have a sufficient changelog.  I don't
   understand what extended capabilities are or why we need to change
   the header format.

   What is the verbs extensions approach?  Why does the kernel need to
   know about it?  What is different about the processing?

Yes, you have tried to answer those questions separately but do you
think git writes changelogs magically when a patch is applied?  If you
don't supply a good enough changelog, you're expecting me to write it
for you.

 +struct ib_uverbs_cmd_hdr_ex {
 +   __u32 command;
 +   __u16 in_words;
 +   __u16 out_words;
 +   __u16 provider_in_words;
 +   __u16 provider_out_words;
 +   __u32 cmd_hdr_reserved;
 +};
 +

If I understand the vague explanations and the analogy to the
presentation about userspace, then cmd_hdr_reserved is supposed to be
used as a mask field.  Should a kernel that doesn't understand any
mask components make sure that this reserved field is 0, and return an
error if it isn't?  I don't see any code to do that, and it seems to
risk new userspace silently getting wrong answers on an old kernel.

Is there any reason not to name it as a mask field from the start?

 - R.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/qib: fix sparse warnings in qib_user_sdma.c

2013-08-13 Thread Mike Marciniszyn
Commit fd1b423 (IB/qib: Improve SDMA performance) introduced
some new sparse warnings.

This patch eliminates all sparse warnings in this file.

Reported-by: Fengguang Wu fengguang...@intel.com
Reviewed-by: Dean Luick dean.lu...@intel.com
Signed-off-by: Mike Marciniszyn mike.marcinis...@intel.com
---
 drivers/infiniband/hw/qib/qib_user_sdma.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib_user_sdma.c 
b/drivers/infiniband/hw/qib/qib_user_sdma.c
index 0544166..d0a0ea0 100644
--- a/drivers/infiniband/hw/qib/qib_user_sdma.c
+++ b/drivers/infiniband/hw/qib/qib_user_sdma.c
@@ -222,7 +222,7 @@ static int qib_user_sdma_page_to_frags(const struct 
qib_devdata *dd,
   struct page *page, u16 put,
   u16 offset, u16 len, void *kvaddr)
 {
-   u16 *pbc16;
+   __le16 *pbc16;
void *pbcvaddr;
struct qib_message_header *hdr;
u16 newlen, pbclen, lastdesc, dma_mapped;
@@ -610,7 +610,7 @@ static int qib_user_sdma_pin_pages(const struct qib_devdata 
*dd,
(PAGE_SIZE - fofs) : tlen;
 
ret = qib_user_sdma_page_to_frags(dd, pq, pkt,
-   pages[i], 1, fofs, flen, 0);
+   pages[i], 1, fofs, flen, NULL);
if (ret  0) {
/* current page has beed taken
 * care of inside above call.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ceph-users] Help needed porting Ceph to RSockets

2013-08-13 Thread Hefty, Sean
  I found a workaround for my (our) problem: in the librdmacm
  code, rsocket.c, there is a global constant polling_time, which
  is set to 10 microseconds at the moment.
 
  I raise this to 1 - and all of a sudden things work nicely.
 
 I am adding the linux-rdma list to CC so Sean might see this.
 
 If I understand what you are describing, the caller to rpoll() spins for up to
 10 ms (10,000 us) before calling the real poll().
 
 What is the purpose of the real poll() call? Is it simply a means to block the
 caller and avoid spinning? Or does it actually expect to detect an event?

When the real poll() is called, an event is expected on an fd associated with 
the CQ's completion channel.
 
  I think we are looking at two issues here:
  1. the thread structure of ceph messenger
For a given socket connection, there are 3 threads of interest
here: the main messenger thread, the Pipe::reader and the
Pipe::writer.
 
For a ceph client like the ceph admin command, I see the following
sequence
  - the connection to the ceph monitor is created by the
main messenger  thread, the Pipe::reader and
Pipe::writer are instantiated.
  - the requested command is sent to the ceph monitor, the
answer is read and printed
  - at this point the Pipe::reader already has called
tcp_read_wait(), polling for more data or connection termination
  - after the response had been printed, the main loop calls the
shutdown routines which in in turn shutdown() the socket
 
 There is some time between the last two steps - and this gap is
 long enough to open a race:
 
  2. rpoll, ibv and poll
the rpoll implementation in rsockets is split in 2 phases:
- a busy loop which checks the state of the underlying ibv queue pair
- the call to real poll() system call (i.e. the uverbs(?)
  implementation of poll() inside the kernel)
 
The busy loop has a maximum duration of polling_time (10 microseconds
by default) - and is able detect the local shutdown and returns a
POLLHUP.
 
The poll() system call (i.e. the uverbs implementation of poll()
in the kernel) does not detect the local shutdown - and only returns
after the caller supplied timeout expires.

It sounds like there's an issue here either with a message getting lost or a 
race.  Given that spinning longer works for you, it sounds like an event is 
getting lost, not being generated correctly, or not being configured to 
generate.

  Increasing the rsockets polloing_time from 10 to 1 microseconds
  results in the rpoll to detect the local shutdown within the busy loop.
 
  Decreasing the ceph ms tcp read timeout from the default of 900 to 5
  seconds serves a similar purpose, but is much coarser.
 
  From my understanding, the real issue is neither at the ceph nor at the
  rsockets level: it is related to the uverbs kernel module.
 
  An alternative way to address the current problem at the rsockets level
  would be w re-write of the rpoll(): instead of the busy loop at the
  beginning followed by the reall poll() call with the full user
  specificed timeout value (ms tcp read timeout in our case), I would
  embed the real poll()  into a loop, splitting the user specified timeout
  into smaller portions and doing the rsockets specific rs_poll_check()
  on every timeout of the real poll().
 
 I have not looked at the rsocket code, so take the following with a grain of
 salt. If the purpose of the real poll() is to simply block the user for a
 specified time, then you can simply make it a short duration (taking into
 consideration what granularity the OS provides) and then call ibv_poll_cq().
 Keep in mind, polling will prevent your CPU from reducing power.
 
 If the real poll() is actually checking for something (e.g. checking on the
 RDMA channel's fd or the IB channel's fd), then you may not want to spin too
 much.

The real poll() call is intended to block the application until a timeout 
occurs or an event shows up.  Since increasing the spin time works for you, it 
makes me suspect that there is a bug in the CQ event handling in rsockets.
 
  What's particularly weird is that the monitor receives a POLLHUP
  event when the ceph command shuts down it's socket but the ceph
  command never does. When using regular sockets both sides of the
  connection receive a POLLIN | POLLHUP | POLRDHUP event when the
  sockets are shut down. It would seem like there is a bug in rsockets
  that causes the side that calls shutdown first not to receive the
  correct rpoll events.

rsockets does not support POLLRDHUP.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html