Re: [openib-general] Re: IBM eHCA testing..
On Thu, 2005-10-13 at 18:46, Troy Benjegerdes wrote: > I'm also attaching part of an opensm log file. > > (the full copy is at http://scl.ameslab.gov/~troy/osm-ehca.log ) > > The IBM galaxy adapters are at: > Initial path: [0][1][16] > Initial path: [0][1][13] > The OpenSM is just saying that a SMP transaction it issued (in this case, SM Get P_KeyTable) is timing out (no response made it back to OpenSM). BTW, what svn rev is OpenSM up to ? -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer
Helen> Not in realtime. My observations were made after the fact. Helen> I supose I can launch another test and watch the cunter in Helen> realtime if you believe that is necessary? That might be interesting. Assuming the HCA continues to work fine, and IPoIB recovers, the only theory I can come up is that something is causing interrupts to be held off for a long time, so the IPoIB driver doesn't get to see sends completing. But I don't know what such a workload might be. Perhaps something else you're running (Lustre?, iSCSI?) holds a lock for a long time and causes the timeout. But it's not clear to me why the TX watchdog would get to run if the interrupt handler doesn't get to run. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] [SA Query] Change sa_query MAD allocation
Roland Dreier wrote: Thanks, I'll read this over. What's the motivation here? To shift over to ib_create_send_mad() so that all the MAD-related DMA mapping stuff is in one place, to make it easier to fix? Yes - the motivation is to fix the DMA mapping issue that you pointed out by changing ib_post_send_mad() to take an ib_mad_send_buf as input. There are three places that I see where ib_post_send_mad() is called without using ib_create_mad_send(): sa_query, mthca_mad, and agent. (Their implementation pre-dates the call.) My intent was to patch each of these separately to use ib_create_mad_send(), then apply a patch to convert the API. If the API does not change to take an ib_mad_send_buf, then it's your call whether to apply the patch. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer
Roland, >From [EMAIL PROTECTED] Thu Oct 13 16:19:30 2005 > >Helen> BTW, the state of the IPoIB network seemed fine after the >Helen> failed test, nd the mthca counters are moving up nicely. > >Even on the server on3-ib? Yes, even on the server on3-ib. > >Helen> Do you still think this is a crash of the HCA firmware? >Helen> Should I call Mellanox? > >Not if IPoIB is working on the systems printing the TX time out >messages. However, if everything stops working on one of your >systems, then yes, an HCA crash is likely. > >I'm still a unclear on what is happening. Do you see TX time >out messages on a particular server, but IPoIB and mthca counters >still work fine on that same server? Or is it just the rest of the >fabric that continues working? > Not in realtime. My observations were made after the fact. I supose I can launch another test and watch the cunter in realtime if you believe that is necessary? >Thanks, > Roland Thank you so much for the speedy fix. I will apply the patch and stress test it as soon as possible. Helen :-) ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH] [SA Query] Change sa_query MAD allocation
Thanks, I'll read this over. What's the motivation here? To shift over to ib_create_send_mad() so that all the MAD-related DMA mapping stuff is in one place, to make it easier to fix? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] [SA Query] Change sa_query MAD allocation
This patch changes sa_query to allocate MADs using the ib_create_send_mad() routine. The intent behind this change was to eventually change ib_post_send_mad() to take an ib_send_mad_buf as input, but see the "DMA mapping abuses in MAD layer" thread. We may want to go with an alternate solution. However, I'm posting the patch since it's usable even without changes to ib_post_send_mad(). Signed-off-by: Sean Hefty <[EMAIL PROTECTED]> Index: sa_query.c === --- sa_query.c (revision 3692) +++ sa_query.c (working copy) @@ -74,9 +74,8 @@ struct ib_sa_query { void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *); void (*release)(struct ib_sa_query *); struct ib_sa_port *port; - struct ib_sa_mad *mad; + struct ib_mad_send_buf *mad_buf; struct ib_sa_sm_ah *sm_ah; - DECLARE_PCI_UNMAP_ADDR(mapping) int id; }; @@ -426,6 +425,7 @@ void ib_sa_cancel_query(int id, struct i { unsigned long flags; struct ib_mad_agent *agent; + u64 wr_id; spin_lock_irqsave(&idr_lock, flags); if (idr_find(&query_idr, id) != query) { @@ -433,9 +433,10 @@ void ib_sa_cancel_query(int id, struct i return; } agent = query->port->agent; + wr_id = (unsigned long) query->mad_buf; spin_unlock_irqrestore(&idr_lock, flags); - ib_cancel_mad(agent, id); + ib_cancel_mad(agent, wr_id); } EXPORT_SYMBOL(ib_sa_cancel_query); @@ -455,73 +456,51 @@ static void init_mad(struct ib_sa_mad *m spin_unlock_irqrestore(&tid_lock, flags); } +static void acquire_ah(struct ib_sa_port *port, struct ib_sa_query *query) +{ + unsigned long flags; + + spin_lock_irqsave(&port->ah_lock, flags); + kref_get(&port->sm_ah->ref); + query->sm_ah = port->sm_ah; + spin_unlock_irqrestore(&port->ah_lock, flags); +} + static int send_mad(struct ib_sa_query *query, int timeout_ms) { struct ib_sa_port *port = query->port; + struct ib_send_wr *bad_wr; unsigned long flags; - int ret; - struct ib_sge gather_list; - struct ib_send_wr *bad_wr, wr = { - .opcode = IB_WR_SEND, - .sg_list = &gather_list, - .num_sge = 1, - .send_flags = IB_SEND_SIGNALED, - .wr = { -.ud = { -.mad_hdr = &query->mad->mad_hdr, -.remote_qpn = 1, -.remote_qkey = IB_QP1_QKEY, -.timeout_ms = timeout_ms, -} -} - }; + int ret, id; retry: if (!idr_pre_get(&query_idr, GFP_ATOMIC)) return -ENOMEM; spin_lock_irqsave(&idr_lock, flags); - ret = idr_get_new(&query_idr, query, &query->id); + ret = idr_get_new(&query_idr, query, &id); spin_unlock_irqrestore(&idr_lock, flags); if (ret == -EAGAIN) goto retry; if (ret) return ret; - wr.wr_id = query->id; - - spin_lock_irqsave(&port->ah_lock, flags); - kref_get(&port->sm_ah->ref); - query->sm_ah = port->sm_ah; - wr.wr.ud.ah = port->sm_ah->ah; - spin_unlock_irqrestore(&port->ah_lock, flags); - - gather_list.addr = dma_map_single(port->agent->device->dma_device, - query->mad, - sizeof (struct ib_sa_mad), - DMA_TO_DEVICE); - gather_list.length = sizeof (struct ib_sa_mad); - gather_list.lkey = port->agent->mr->lkey; - pci_unmap_addr_set(query, mapping, gather_list.addr); + query->mad_buf->send_wr.wr.ud.timeout_ms = timeout_ms; + query->mad_buf->context[0] = query; + query->id = id; - ret = ib_post_send_mad(port->agent, &wr, &bad_wr); + ret = ib_post_send_mad(port->agent, &query->mad_buf->send_wr, &bad_wr); if (ret) { - dma_unmap_single(port->agent->device->dma_device, -pci_unmap_addr(query, mapping), -sizeof (struct ib_sa_mad), -DMA_TO_DEVICE); - kref_put(&query->sm_ah->ref, free_sm_ah); spin_lock_irqsave(&idr_lock, flags); - idr_remove(&query_idr, query->id); + idr_remove(&query_idr, id); spin_unlock_irqrestore(&idr_lock, flags); } /* * It's not safe to dereference query any more, because the * send may already have completed and freed the query in -* another context. So use wr.wr_id, which has a copy of the -* query's id. +* another context. */ - return ret ? ret
Re: [openib-general] DMA mapping abuses in MAD layer
Sean> Any preference to pursuing this change or modifying Sean> ib_post_send_mad to take an ib_mad_send_buf? I think it's going to be confusing to cast a virtual address to a long and then ignore the lkey field. So I would go with a new interface not built on ib_sge. On the other hand, maybe struct sg_list is what we should be using?? (Just thinking out loud here, so to speak) - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer
Helen> BTW, the state of the IPoIB network seemed fine after the Helen> failed test, nd the mthca counters are moving up nicely. Even on the server on3-ib? Helen> Do you still think this is a crash of the HCA firmware? Helen> Should I call Mellanox? Not if IPoIB is working on the systems printing the TX time out messages. However, if everything stops working on one of your systems, then yes, an HCA crash is likely. I'm still a unclear on what is happening. Do you see TX time out messages on a particular server, but IPoIB and mthca counters still work fine on that same server? Or is it just the rest of the fabric that continues working? Thanks, Roland ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: IBM eHCA testing..
Thanks. It's strange the copy-paste gave an extra 1. Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] DMA mapping abuses in MAD layer
Sean Hefty wrote: Does anyone else have any other ideas on how to fix this issue? The current MAD interface requires the user to have code similar to this: send_buf->sge.addr = dma_map_single(mad_agent->device->dma_device, buf, buf_size, DMA_TO_DEVICE); pci_unmap_addr_set(send_buf, mapping, send_buf->sge.addr); This is consistent with how an ib_send_wr would be formatted for other QPs. Another possibility, however, is to let the user do: send_buf->sge.addr = (unsigned long) buf; And then have the MAD layer perform the mapping/unmapping immediately before and after posting to the QP. This keeps the syntax of the current interface, but still requires user changes. Any preference to pursuing this change or modifying ib_post_send_mad to take an ib_mad_send_buf? - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer
Roland, Ci So you are right, it is not a moving target. After repeating the IOZONE tests several times, I narrowed down the culprit, server on3-ib. Parallel I/O had made it a bit difficult to chase it down :-( BTW, the state of the IPoIB network seemed fine after the failed test, nd the mthca counters are moving up nicely. Do you still think this is a crash of the HCA firmware? Should I call Mellanox? Thanks, Helen -- Original Message - >From [EMAIL PROTECTED] Thu Oct 13 15:13:16 2005 > >Helen> It doesn't seem like shrinking the TCP window had helped. >Helen> I captured the Dmesg log from Lustre server and associated >Helen> client reporting IOZONE error. > >What is the state of the system after you start seeing the ib0 >transmit time out messages? Does IPoIB work at all? Is the HCA >responsive at all -- for example what do you see if you do > > cat /sys/class/infiniband/mthca0/ports/1/state > >or > > cat /sys/class/infiniband/mthca0/ports/1/counters/* > >Helen> BTW, this problem is a moving target so it is hard to >Helen> believe that it is hardware related(?) BTW, I am using the >Helen> mellanox DDR switch and HCA. > >Not sure what you mean by a moving target... the symptoms really look >like a crash of the HCA firmware to me. > >Thanks, > Roland > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[PATCH, please test] IPoIB: recycle RX bufs (was: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer)
Roland> My plan is to change the receive handling of IPoIB Roland> slightly, so that if it can't allocate a new receive Roland> buffer, it reposts the old buffer and drops the packet it Roland> just received. Here's a patch that changes IPoIB to use this scheme. This should be much more robust when the system gets low on GFP_ATOMIC memory. I'd appreciate it if people could stress test and benchmark this. It works well for me, but I'm wondering if this patch has any effect on performance (either better or worse). Helen, it would be especially interesting if you could run your test with this patch and without increasing min_free_kbytes, since you are able to reproduce GFP_ATOMIC failures. I'd be curious to know what you see in /sys/class/net/ib0/statistics/rx_dropped after running the test. Thanks, Roland --- infiniband/ulp/ipoib/ipoib_main.c (revision 3707) +++ infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -729,7 +729,7 @@ int ipoib_dev_init(struct net_device *de /* Allocate RX/TX "rings" to hold queued skbs */ - priv->rx_ring = kmalloc(IPOIB_RX_RING_SIZE * sizeof (struct ipoib_buf), + priv->rx_ring = kmalloc(IPOIB_RX_RING_SIZE * sizeof (struct ipoib_rx_buf), GFP_KERNEL); if (!priv->rx_ring) { printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n", @@ -737,9 +737,9 @@ int ipoib_dev_init(struct net_device *de goto out; } memset(priv->rx_ring, 0, - IPOIB_RX_RING_SIZE * sizeof (struct ipoib_buf)); + IPOIB_RX_RING_SIZE * sizeof (struct ipoib_rx_buf)); - priv->tx_ring = kmalloc(IPOIB_TX_RING_SIZE * sizeof (struct ipoib_buf), + priv->tx_ring = kmalloc(IPOIB_TX_RING_SIZE * sizeof (struct ipoib_tx_buf), GFP_KERNEL); if (!priv->tx_ring) { printk(KERN_WARNING "%s: failed to allocate TX ring (%d entries)\n", @@ -747,7 +747,7 @@ int ipoib_dev_init(struct net_device *de goto out_rx_ring_cleanup; } memset(priv->tx_ring, 0, - IPOIB_TX_RING_SIZE * sizeof (struct ipoib_buf)); + IPOIB_TX_RING_SIZE * sizeof (struct ipoib_tx_buf)); /* priv->tx_head & tx_tail are already 0 */ --- infiniband/ulp/ipoib/ipoib.h(revision 3726) +++ infiniband/ulp/ipoib/ipoib.h(working copy) @@ -100,7 +100,12 @@ struct ipoib_pseudoheader { struct ipoib_mcast; -struct ipoib_buf { +struct ipoib_rx_buf { + struct sk_buff *skb; + dma_addr_t mapping; +}; + +struct ipoib_tx_buf { struct sk_buff *skb; DECLARE_PCI_UNMAP_ADDR(mapping) }; @@ -150,14 +155,14 @@ struct ipoib_dev_priv { unsigned int admin_mtu; unsigned int mcast_mtu; - struct ipoib_buf *rx_ring; + struct ipoib_rx_buf *rx_ring; - spinlock_ttx_lock; - struct ipoib_buf *tx_ring; - unsigned tx_head; - unsigned tx_tail; - struct ib_sge tx_sge; - struct ib_send_wr tx_wr; + spinlock_t tx_lock; + struct ipoib_tx_buf *tx_ring; + unsigned tx_head; + unsigned tx_tail; + struct ib_sgetx_sge; + struct ib_send_wrtx_wr; struct ib_wc ibwc[IPOIB_NUM_WC]; --- infiniband/ulp/ipoib/ipoib_ib.c (revision 3726) +++ infiniband/ulp/ipoib/ipoib_ib.c (working copy) @@ -95,57 +95,65 @@ void ipoib_free_ah(struct kref *kref) } } -static inline int ipoib_ib_receive(struct ipoib_dev_priv *priv, - unsigned int wr_id, - dma_addr_t addr) -{ - struct ib_sge list = { - .addr= addr, - .length = IPOIB_BUF_SIZE, - .lkey= priv->mr->lkey, - }; - struct ib_recv_wr param = { - .wr_id = wr_id | IPOIB_OP_RECV, - .sg_list= &list, - .num_sge= 1, - }; +static int ipoib_ib_post_receive(struct net_device *dev, int id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_sge list; + struct ib_recv_wr param; struct ib_recv_wr *bad_wr; + int ret; + + list.addr = priv->rx_ring[id].mapping; + list.length = IPOIB_BUF_SIZE; + list.lkey = priv->mr->lkey; + + param.next= NULL; + param.wr_id = id | IPOIB_OP_RECV; + param.sg_list = &list; + param.num_sge = 1; + + ret = ib_post_recv(priv->qp, ¶m, &bad_wr); + if (unlikely(ret)) { + ipoib_warn(priv, "receive failed for buf %d (%d)\n", id, ret); + dma_unmap_single(priv->ca->dma_device, +priv->rx_ring[id].mapping, +IPOIB_BUF_SIZE, DMA_FROM_DEVICE); + dev_kfree_skb_any(priv->rx_ring[id].skb); + priv->rx_ri
Re: [openib-general] Re: IBM eHCA testing..
> http://ozlabs.org/pipermail/linuxppc64-dev/2005-July/004662.html1 delete the '1' from the end of the URL... - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [RFC] Kernel uverbs changes for PathScale merge
Robert> Since the rest of the patch needed to get this working Robert> isn't applied to either the trunk or the ipath branch yet Robert> (and since the branch will be going away shortly), can you Robert> just apply this patch to the trunk when you do the merge? Sure, no problem. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: IBM eHCA testing..
I am not sure whether something related to dma_addr_t. Could you please try below patch? > http://ozlabs.org/pipermail/linuxppc64-dev/2005-July/004662.html1 Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [RFC] Kernel uverbs changes for PathScale merge
> And here's a patch to ipath to make it work with the uverbs command mask... Roland, Since the rest of the patch needed to get this working isn't applied to either the trunk or the ipath branch yet (and since the branch will be going away shortly), can you just apply this patch to the trunk when you do the merge? Regards, Robert. -- Robert Walsh Email: [EMAIL PROTECTED] PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043 signature.asc Description: This is a digitally signed message part ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: IBM eHCA testing..
On Wed, Oct 12, 2005 at 01:04:37PM +0200, IBMEHCA DD wrote: > I just released the ehca2_0028 which uses svn 3615 on > https://sourceforge.net/projects/ibmehcad/ > As you might notice the license already has changed to the openib.org > license. > > With 2.6.13 we had the non-issue that our maun focus was on 2.6.5-7.191 > and we're only now moving to the latest kernel. I just built against svn 3774, and 2.6.13.3, with the timeout set to 120 seconds. There's some bad interaction going on with OpenSM. p5l2:~# modprobe hcad_mod ehca_nr_ports=1 [ 6186.855237] eBus Device Driver [ 6186.907578] eHCA Infiniband Device Driver (Rel.: EHCA2_0028) [ 6186.912203] xics_enable_irq: irq=36868: ibm_int_on returned fffd p5l2:~# modprobe ib_ipoib hang for awhile.. entries appear in osm.log *** [ 6309.683651] PU0003 00060103:ehca_parse_ec EHCA port 1 is available. [ 6310.253303] kernel BUG in dma_map_single at arch/ppc64/kernel/dma.c:86! [ 6310.253320] Oops: Exception in kernel mode, sig: 5 [#1] [ 6310.253339] SMP NR_CPUS=8 NUMA PSERIES LPAR [ 6310.253364] Modules linked in: ib_mad hcad_mod ib_core ebus [ 6310.253383] NIP: C000FA10 XER: 0020 LR: C000F9B0 CTR: C000F980 [ 6310.253400] REGS: cf3bb770 TRAP: 0700 Not tainted (2.6.13.3-power5) [ 6310.253421] MSR: 80029032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 CR: 24002444 [ 6310.253436] DAR: DSISR: [ 6310.253471] TASK: c209f060[1874] 'modprobe' THREAD: cf3b8000CPU: 7 [ 6310.253492] GPR00: C04B3660 CF3BB9F0 C05EE948 C001DBEC5C18 [ 6310.253513] GPR04: C003CB5B1D0C 0128 0002 0008 [ 6310.253532] GPR08: C003CBD5EEE8 CF67FC00 C000F980 [ 6310.253553] GPR12: D00621D0 C04B7800 10017078 [ 6310.253609] GPR16: 0001 0001 [ 6310.253665] GPR20: C8DE7800 0002 0001 CF67FDC8 [ 6310.253688] GPR24: CF67FD40 0002 C001DBEC5C18 0002 [ 6310.253708] GPR28: 0128 C003CB5B1D0C D006EB00 C003CB5B1C80 [ 6310.253731] NIP [c000fa10] .dma_map_single+0x90/0xc0 [ 6310.253753] LR [c000f9b0] .dma_map_single+0x30/0xc0 [ 6310.253778] Call Trace: [ 6310.253797] [cf3bb9f0] [c8de7800] 0xc8de7800 (unreliable) [ 6310.253838] [cf3bba90] [d005aee8] .ib_mad_post_receive_mads+0xb8/0x270 [ib_mad] [ 6310.253880] [cf3bbb80] [d005c840] .ib_mad_init_device+0x350/0x660 [ib_mad] [ 6310.253905] [cf3bbc70] [d004d0bc] .ib_register_client+0xdc/0x150 [ib_core] [ 6310.253936] [cf3bbd00] [d0061e6c] .ib_mad_init_module+0x8c/0xf0 [ib_mad] [ 6310.253999] [cf3bbd90] [c0070720] .sys_init_module+0x1e0/0x4d0 [ 6310.254030] [cf3bbe30] [c000d300] syscall_exit+0x0/0x18 [ 6310.254045] Instruction dump: [ 6310.254053] 4e800421 e8410028 382100a0 e8010010 eb41ffd0 eb61ffd8 eb81ffe0 eba1ffe8 [ 6310.254089] 7c0803a6 4e800020 6000 6000 <0fe0> 382100a0 3860e8010010 [ 6310.254206] Segmentation fault I'm also attaching part of an opensm log file. (the full copy is at http://scl.ameslab.gov/~troy/osm-ehca.log ) The IBM galaxy adapters are at: Initial path: [0][1][16] Initial path: [0][1][13] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Oct 13 10:42:05 978875 [42FFF970] -> umad_receiver: ERR 5409: send completed with error (method=1 attr=16) -- dropping. Oct 13 10:42:05 978883 [42FFF970] -> umad_receiver: ERR 5411: DR SMP hop ptr 0 hop count 2 DR SLID 0x0 DR DLID 0x0 Oct 13 10:42:05 978892 [42FFF970] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT). Oct 13 10:42:05 978925 [42FFF970] -> SMP dump: base_ver0x1 mgmt_class..0x81 class_ver...0x1 method..0x1 (SubnGet) D bit...0x0 status..0x0 hop_ptr.0x0 hop_count...0x2 trans_id0x1810 attr_id.0x16 (P_KeyTable) resv0x0 attr_mod0x3E m_key...0x dr_slid.0x dr_dlid.
Re: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer
Helen> It doesn't seem like shrinking the TCP window had helped. Helen> I captured the Dmesg log from Lustre server and associated Helen> client reporting IOZONE error. What is the state of the system after you start seeing the ib0 transmit time out messages? Does IPoIB work at all? Is the HCA responsive at all -- for example what do you see if you do cat /sys/class/infiniband/mthca0/ports/1/state or cat /sys/class/infiniband/mthca0/ports/1/counters/* Helen> BTW, this problem is a moving target so it is hard to Helen> believe that it is hardware related(?) BTW, I am using the Helen> mellanox DDR switch and HCA. Not sure what you mean by a moving target... the symptoms really look like a crash of the HCA firmware to me. Thanks, Roland ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer
Roland, It doesn't seem like shrinking the TCP window had helped. I captured the Dmesg log from Lustre server and associated client reporting IOZONE error. BTW, this problem is a moving target so it is hard to believe that it is hardware related(?) BTW, I am using the mellanox DDR switch and HCA. Thanks, Helen --- Dmesg from Lustre server -- NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 1638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 2638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 3638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 4638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 5638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 6638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 7638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 8638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 9638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 10638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 11638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 12638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 13638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 14638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 15638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 16638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 17638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 18638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 19638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 20638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 21638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 22638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 23638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 24638 LustreError: 12471:0:(ost_handler.c:735:ost_brw_write()) @@@ timeout on bulk GET [EMAIL PROTECTED] x20249/t0 o4->@:-1 lens 328/288 ref 0 fl Interpret:/0/0 rc 0/0 LustreError: 12485:0:(ost_handler.c:822:ost_brw_write()) on3-ost2: bulk IO comm error evicting [EMAIL PROTECTED] id 192.168.2.73-12345 LustreError: 12468:0:(ost_handler.c:735:ost_brw_write()) @@@ timeout on bulk GET [EMAIL PROTECTED] x20359/t0 o4->@:-1 lens 328/288 ref 0 fl Interpret:/0/0 rc 0/0 LustreError: 12468:0:(ost_handler.c:735:ost_brw_write()) previously skipped 1 similar messages LustreError: 12477:0:(ost_handler.c:822:ost_brw_write()) on3-ost2: bulk IO comm error evicting [EMAIL PROTECTED] id 192.168.2.78-12345 LustreError: 12477:0:(filter.c:1728:filter_grant_sanity_check()) filter_disconnect: tot_granted 48570368 != fo_tot_granted 49618944 LustreError: 12477:0:(filter.c:1731:filter_grant_sanity_check()) filter_disconnect: tot_pending 7340032 != fo_tot_pending 8388608 Lustre: A connection with 192.168.2.80 timed out; the network or that node may be down. LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a80250 ip 192.168.2.80:1022 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 25638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 26638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 27638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 28638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 29638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 30638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 31638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 32638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 33638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 34638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 35638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 36638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 37638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 38638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 39638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 40638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 41638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 42638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 43638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 44638 NETDEV WATCHDOG: ib0: transmit timed out ib0: tra
Re: [openib-general] Re: [RFC] Kernel uverbs changes for PathScale merge
And here's a patch to ipath to make it work with the uverbs command mask... Index: infiniband/hw/ipath/ib_ipath/ipath_openib.c === --- infiniband/hw/ipath/ib_ipath/ipath_openib.c (revision 3758) +++ infiniband/hw/ipath/ib_ipath/ipath_openib.c (working copy) @@ -5733,6 +5733,32 @@ static int ipath_register_ib_device(cons strlcpy(dev->name, "infinipath_ib%d", IB_DEVICE_NAME_MAX); dev->uverbs_abi_ver = IPATH_UVERBS_ABI_VERSION; + dev->uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE)| + (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | + (1ull << IB_USER_VERBS_CMD_ALLOC_PD)| + (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_CREATE_AH) | + (1ull << IB_USER_VERBS_CMD_DESTROY_AH) | + (1ull << IB_USER_VERBS_CMD_REG_MR) | + (1ull << IB_USER_VERBS_CMD_DEREG_MR)| + (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | + (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | + (1ull << IB_USER_VERBS_CMD_POLL_CQ) | + (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) | + (1ull << IB_USER_VERBS_CMD_CREATE_QP) | + (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | + (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | + (1ull << IB_USER_VERBS_CMD_POST_SEND) | + (1ull << IB_USER_VERBS_CMD_POST_RECV) | + (1ull << IB_USER_VERBS_CMD_ATTACH_MCAST)| + (1ull << IB_USER_VERBS_CMD_DETACH_MCAST)| + (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) | + (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) | + (1ull << IB_USER_VERBS_CMD_POST_SRQ_RECV); dev->node_type = IB_NODE_CA; dev->phys_port_cnt = 1; dev->dma_device = ipath_layer_get_pcidev(t); ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [RFC] Kernel uverbs changes for PathScale merge
OK, here's a new patch that adds a mask of allowed userspace commands set by the kernel low-level driver. Thanks, good catch Michael... - R. --- include/rdma/ib_user_verbs.h(revision 3707) +++ include/rdma/ib_user_verbs.h(working copy) @@ -1,6 +1,7 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -88,8 +89,11 @@ enum { * Make sure that all structs defined in this file remain laid out so * that they pack the same way on 32-bit and 64-bit architectures (to * avoid incompatibility between 32-bit userspace and 64-bit kernels). - * In particular do not use pointer types -- pass pointers in __u64 - * instead. + * Specifically: + * - Do not use pointer types -- pass pointers in __u64 instead. + * - Make sure that any structure larger than 4 bytes is padded to a + *multiple of 8 bytes. Otherwise the structure size will be + *different between 32-bit and 64-bit architectures. */ struct ib_uverbs_async_event_desc { @@ -261,6 +265,42 @@ struct ib_uverbs_create_cq_resp { __u32 cqe; }; +struct ib_uverbs_poll_cq { + __u64 response; + __u32 cq_handle; + __u32 ne; + __u64 wc; +}; + +struct ib_uverbs_wc { + __u64 wr_id; + __u32 status; + __u32 opcode; + __u32 vendor_err; + __u32 byte_len; + __u32 imm_data; + __u32 qp_num; + __u32 src_qp; + __u32 wc_flags; + __u16 pkey_index; + __u16 slid; + __u8 sl; + __u8 dlid_path_bits; + __u8 port_num; + __u8 reserved; +}; + +struct ib_uverbs_poll_cq_resp { + __u32 count; + __u32 reserved; + struct ib_uverbs_wc wc[0]; +}; + +struct ib_uverbs_req_notify_cq { + __u32 cq_handle; + __u32 solicited_only; +}; + struct ib_uverbs_destroy_cq { __u64 response; __u32 cq_handle; @@ -358,6 +398,127 @@ struct ib_uverbs_destroy_qp_resp { __u32 events_reported; }; +/* + * Note: the ib_uverbs_sge structure isn't used anywhere, as the ib_sge + * structure is packed the same way on 32-bit and 64-bit architectures + * in both kernel and user space. It's just here to document the ABI. + */ + +struct ib_uverbs_sge { + __u64 addr; + __u32 length; + __u32 lkey; +}; + +struct ib_uverbs_send_wr { + __u64 wr_id; + __u32 num_sge; + __u32 opcode; + __u32 send_flags; + __u32 imm_data; + union { + struct { + __u64 remote_addr; + __u32 rkey; + __u32 reserved; + } rdma; + struct { + __u64 remote_addr; + __u64 compare_add; + __u64 swap; + __u32 rkey; + __u32 reserved; + } atomic; + struct { + __u32 ah; + __u32 remote_qpn; + __u32 remote_qkey; + __u32 reserved; + } ud; + } wr; +}; + +struct ib_uverbs_post_send { + __u64 response; + __u32 qp_handle; + __u32 wr_count; + __u32 sge_count; + __u32 wqe_size; + struct ib_uverbs_send_wr send_wr[0]; +}; + +struct ib_uverbs_post_send_resp { + __u32 bad_wr; +}; + +struct ib_uverbs_recv_wr { + __u64 wr_id; + __u32 num_sge; + __u32 reserved; +}; + +struct ib_uverbs_post_recv { + __u64 response; + __u32 qp_handle; + __u32 wr_count; + __u32 sge_count; + __u32 wqe_size; + struct ib_uverbs_recv_wr recv_wr[0]; +}; + +struct ib_uverbs_post_recv_resp { + __u32 bad_wr; +}; + +struct ib_uverbs_post_srq_recv { + __u64 response; + __u32 srq_handle; + __u32 wr_count; + __u32 sge_count; + __u32 wqe_size; + struct ib_uverbs_recv_wr recv[0]; +}; + +struct ib_uverbs_post_srq_recv_resp { + __u32 bad_wr; +}; + +struct ib_uverbs_global_route { + __u8 dgid[16]; + __u32 flow_label; + __u8 sgid_index; + __u8 hop_limit; + __u8 traffic_class; + __u8 reserved; +}; + +struct ib_uverbs_ah_attr { + struct ib_uverbs_global_route grh; + __u16 dlid; + __u8 sl; + __u8 src_path_bits; + __u8 static_rate; + __u8 is_global; + __u8 port_num; + __u8 reserved; +}; + +struct ib_uverbs_create_ah { + __u64 response; + __u64 user_handle; + __u32 pd_handle; + __u32 reserved; + struct ib_uverbs_ah_attr attr; +}; + +struct ib_uverbs_create_ah_resp { + __u32 ah_handle; +}; + +struct ib_uverbs_destroy_ah { + __u32 ah_handle; +}; + struct i
[openib-general] Re: [RFC] Kernel uverbs changes for PathScale merge
Michael> What prevents the user from passing e.g. poll cq command Michael> on mthca device? If that happens, it seems that Michael> ib_poll_cq will then crash. Michael> Is there a mask somewhere that lets the device specify Michael> which uverbs commands are allowed for it? Hmm, excellent point. A mask would be one way to avoid this -- let me think about whether there's a better way to handle this. Thanks, Roland ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [RFC] libibverbs changes for PathScale merge
Robert> Since qp_type is now in ibv_qp, it probably no longer Robert> needs to be in mthca_qp. This is just a minor Robert> optimization. Yep, I'll make that change too. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [RFC] libibverbs changes for PathScale merge
> @@ -488,6 +489,7 @@ struct ibv_qp { > uint32_thandle; > uint32_tqp_num; > enum ibv_qp_state state; > + enum ibv_qp_typeqp_type; > > pthread_mutex_t mutex; > pthread_cond_t cond; Since qp_type is now in ibv_qp, it probably no longer needs to be in mthca_qp. This is just a minor optimization. Regards, Robert. -- Robert Walsh Email: [EMAIL PROTECTED] PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043 signature.asc Description: This is a digitally signed message part ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [RFC] Kernel uverbs changes for PathScale merge
Quoting r. Roland Dreier <[EMAIL PROTECTED]>: > Subject: [RFC] Kernel uverbs changes for PathScale merge > > Here are the changes to the kernel part of userspace verbs required to > support PathScale's driver. I'm now happy with them and ready to > commit them to the svn trunk and queue them for 2.6.15. This will > allow the PathScale hardware-specific driver to be move to the trunk > as well, although quite a bit of cleanup is necessary before merging > the driver upstream. > > Does anyone have any comments on these changes before I commit? What prevents the user from passing e.g. poll cq command on mthca device? If that happens, it seems that ib_poll_cq will then crash. Is there a mask somewhere that lets the device specify which uverbs commands are allowed for it? > --- infiniband/core/uverbs_cmd.c (revision 3707) > +++ infiniband/core/uverbs_cmd.c (working copy) > @@ -665,6 +665,93 @@ err: > return ret; > } > > +ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file, > + const char __user *buf, int in_len, > + int out_len) > +{ > + struct ib_uverbs_poll_cq cmd; > + struct ib_uverbs_poll_cq_resp *resp; > + struct ib_cq *cq; > + struct ib_wc *wc; > + intret = 0; > + inti; > + intrsize; > + > + if (copy_from_user(&cmd, buf, sizeof cmd)) > + return -EFAULT; > + > + wc = kmalloc(cmd.ne * sizeof *wc, GFP_KERNEL); > + if (!wc) > + return -ENOMEM; > + > + rsize = sizeof *resp + cmd.ne * sizeof(struct ib_uverbs_wc); > + resp = kmalloc(rsize, GFP_KERNEL); > + if (!resp) { > + ret = -ENOMEM; > + goto out_wc; > + } > + > + down(&ib_uverbs_idr_mutex); > + cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle); > + if (!cq || cq->uobject->context != file->ucontext) { > + ret = -EINVAL; > + goto out; > + } > + > + resp->count = ib_poll_cq(cq, cmd.ne, wc); > + > + for (i = 0; i < resp->count; i++) { > + resp->wc[i].wr_id = wc[i].wr_id; > + resp->wc[i].status = wc[i].status; > + resp->wc[i].opcode = wc[i].opcode; > + resp->wc[i].vendor_err = wc[i].vendor_err; > + resp->wc[i].byte_len = wc[i].byte_len; > + resp->wc[i].imm_data = wc[i].imm_data; > + resp->wc[i].qp_num = wc[i].qp_num; > + resp->wc[i].src_qp = wc[i].src_qp; > + resp->wc[i].wc_flags = wc[i].wc_flags; > + resp->wc[i].pkey_index = wc[i].pkey_index; > + resp->wc[i].slid = wc[i].slid; > + resp->wc[i].sl = wc[i].sl; > + resp->wc[i].dlid_path_bits = wc[i].dlid_path_bits; > + resp->wc[i].port_num = wc[i].port_num; > + } > + > + if (copy_to_user((void __user *) (unsigned long) cmd.response, resp, > rsize)) > + ret = -EFAULT; > + > +out: > + up(&ib_uverbs_idr_mutex); > + kfree(resp); > + > +out_wc: > + kfree(wc); > + return ret ? ret : in_len; > +} -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [RFC] libibverbs changes for PathScale merge
Here are the changes to libibverbs required to support PathScale's driver. Again, I'm happy with them and would just like to get comments on them before I commit them to svn. Thanks, Roland --- libibverbs/include/infiniband/driver.h (revision 3774) +++ libibverbs/include/infiniband/driver.h (working copy) @@ -1,6 +1,7 @@ /* * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -92,6 +93,8 @@ extern int ibv_cmd_create_cq(struct ibv_ int comp_vector, struct ibv_cq *cq, struct ibv_create_cq *cmd, size_t cmd_size, struct ibv_create_cq_resp *resp, size_t resp_size); +extern int ibv_cmd_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc); +extern int ibv_cmd_req_notify_cq(struct ibv_cq *cq, int solicited); extern int ibv_cmd_destroy_cq(struct ibv_cq *cq); extern int ibv_cmd_create_srq(struct ibv_pd *pd, @@ -111,6 +114,15 @@ extern int ibv_cmd_modify_qp(struct ibv_ enum ibv_qp_attr_mask attr_mask, struct ibv_modify_qp *cmd, size_t cmd_size); extern int ibv_cmd_destroy_qp(struct ibv_qp *qp); +extern int ibv_cmd_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, +struct ibv_send_wr **bad_wr); +extern int ibv_cmd_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr, +struct ibv_recv_wr **bad_wr); +extern int ibv_cmd_post_srq_recv(struct ibv_srq *srq, struct ibv_recv_wr *wr, +struct ibv_recv_wr **bad_wr); +extern int ibv_cmd_create_ah(struct ibv_pd *pd, struct ibv_ah *ah, +struct ibv_ah_attr *attr); +extern int ibv_cmd_destroy_ah(struct ibv_ah *ah); extern int ibv_cmd_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); extern int ibv_cmd_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); --- libibverbs/include/infiniband/verbs.h (revision 3774) +++ libibverbs/include/infiniband/verbs.h (working copy) @@ -2,6 +2,7 @@ * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. * Copyright (c) 2004 Intel Corporation. All rights reserved. * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -488,6 +489,7 @@ struct ibv_qp { uint32_thandle; uint32_tqp_num; enum ibv_qp_state state; + enum ibv_qp_typeqp_type; pthread_mutex_t mutex; pthread_cond_t cond; @@ -513,6 +515,7 @@ struct ibv_cq { struct ibv_ah { struct ibv_context *context; struct ibv_pd *pd; + uint32_thandle; }; struct ibv_device; --- libibverbs/include/infiniband/kern-abi.h(revision 3774) +++ libibverbs/include/infiniband/kern-abi.h(working copy) @@ -1,6 +1,7 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -93,8 +94,11 @@ enum { * Make sure that all structs defined in this file remain laid out so * that they pack the same way on 32-bit and 64-bit architectures (to * avoid incompatibility between 32-bit userspace and 64-bit kernels). - * In particular do not use pointer types -- pass pointers in __u64 - * instead. + * Specifically: + * - Do not use pointer types -- pass pointers in __u64 instead. + * - Make sure that any structure larger than 4 bytes is padded to a + *multiple of 8 bytes. Otherwise the structure size will be + *different between 32-bit and 64-bit architectures. */ struct ibv_kern_async_event { @@ -298,6 +302,47 @@ struct ibv_create_cq_resp { __u32 cqe; }; +struct ibv_kern_wc { +__u64 wr_id; +__u32 status; +__u32 opcode; +__u32 vendor_err; +__u32 byte_len; +__u32 imm_data; +__u32 qp_num; +__u32 src_qp; +__u32 wc_flags; +__u16 pkey_index; +__u16 slid; +__u8 sl; +__u8 dlid_path_bits; + __u8 port_num; + __u8 reserved; +}; + +struct ibv_poll_cq { + __u32 command; + __u16 in_words; + __u16 out_words; + __u64 response; + __u32 cq_handle; + __u32 ne; +};
[openib-general] [RFC] Kernel uverbs changes for PathScale merge
Here are the changes to the kernel part of userspace verbs required to support PathScale's driver. I'm now happy with them and ready to commit them to the svn trunk and queue them for 2.6.15. This will allow the PathScale hardware-specific driver to be move to the trunk as well, although quite a bit of cleanup is necessary before merging the driver upstream. Does anyone have any comments on these changes before I commit? Thanks, Roland --- infiniband/include/rdma/ib_user_verbs.h (revision 3707) +++ infiniband/include/rdma/ib_user_verbs.h (working copy) @@ -1,6 +1,7 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -88,8 +89,11 @@ enum { * Make sure that all structs defined in this file remain laid out so * that they pack the same way on 32-bit and 64-bit architectures (to * avoid incompatibility between 32-bit userspace and 64-bit kernels). - * In particular do not use pointer types -- pass pointers in __u64 - * instead. + * Specifically: + * - Do not use pointer types -- pass pointers in __u64 instead. + * - Make sure that any structure larger than 4 bytes is padded to a + *multiple of 8 bytes. Otherwise the structure size will be + *different between 32-bit and 64-bit architectures. */ struct ib_uverbs_async_event_desc { @@ -261,6 +265,42 @@ struct ib_uverbs_create_cq_resp { __u32 cqe; }; +struct ib_uverbs_poll_cq { + __u64 response; + __u32 cq_handle; + __u32 ne; + __u64 wc; +}; + +struct ib_uverbs_wc { + __u64 wr_id; + __u32 status; + __u32 opcode; + __u32 vendor_err; + __u32 byte_len; + __u32 imm_data; + __u32 qp_num; + __u32 src_qp; + __u32 wc_flags; + __u16 pkey_index; + __u16 slid; + __u8 sl; + __u8 dlid_path_bits; + __u8 port_num; + __u8 reserved; +}; + +struct ib_uverbs_poll_cq_resp { + __u32 count; + __u32 reserved; + struct ib_uverbs_wc wc[0]; +}; + +struct ib_uverbs_req_notify_cq { + __u32 cq_handle; + __u32 solicited_only; +}; + struct ib_uverbs_destroy_cq { __u64 response; __u32 cq_handle; @@ -358,6 +398,127 @@ struct ib_uverbs_destroy_qp_resp { __u32 events_reported; }; +/* + * Note: the ib_uverbs_sge structure isn't used anywhere, as the ib_sge + * structure is packed the same way on 32-bit and 64-bit architectures + * in both kernel and user space. It's just here to document the ABI. + */ + +struct ib_uverbs_sge { + __u64 addr; + __u32 length; + __u32 lkey; +}; + +struct ib_uverbs_send_wr { + __u64 wr_id; + __u32 num_sge; + __u32 opcode; + __u32 send_flags; + __u32 imm_data; + union { + struct { + __u64 remote_addr; + __u32 rkey; + __u32 reserved; + } rdma; + struct { + __u64 remote_addr; + __u64 compare_add; + __u64 swap; + __u32 rkey; + __u32 reserved; + } atomic; + struct { + __u32 ah; + __u32 remote_qpn; + __u32 remote_qkey; + __u32 reserved; + } ud; + } wr; +}; + +struct ib_uverbs_post_send { + __u64 response; + __u32 qp_handle; + __u32 wr_count; + __u32 sge_count; + __u32 wqe_size; + struct ib_uverbs_send_wr send_wr[0]; +}; + +struct ib_uverbs_post_send_resp { + __u32 bad_wr; +}; + +struct ib_uverbs_recv_wr { + __u64 wr_id; + __u32 num_sge; + __u32 reserved; +}; + +struct ib_uverbs_post_recv { + __u64 response; + __u32 qp_handle; + __u32 wr_count; + __u32 sge_count; + __u32 wqe_size; + struct ib_uverbs_recv_wr recv_wr[0]; +}; + +struct ib_uverbs_post_recv_resp { + __u32 bad_wr; +}; + +struct ib_uverbs_post_srq_recv { + __u64 response; + __u32 srq_handle; + __u32 wr_count; + __u32 sge_count; + __u32 wqe_size; + struct ib_uverbs_recv_wr recv[0]; +}; + +struct ib_uverbs_post_srq_recv_resp { + __u32 bad_wr; +}; + +struct ib_uverbs_global_route { + __u8 dgid[16]; + __u32 flow_label; + __u8 sgid_index; + __u8 hop_limit; + __u8 traffic_class; + __u8 reserved; +}; + +struct ib_uverbs_ah_attr { + struct ib_uverbs_global_route grh; + __u16 dlid; + __u8 sl; + __u8 src_path_bits; + __u8 static_rate; + __u8 is_global; + __u8 port_num; + __u8 reserved;
Re: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer
Roland, >From [EMAIL PROTECTED] Thu Oct 13 13:53:05 2005 > >Helen> Roland, Thank you for your response. That fixed my initial >Helen> buffer allocation failure. After we tuned the Lustre and >Helen> reran same IOZONE tests again, we got the following >Helen> problem. Was there an actual network interrupt? If so, the >Helen> problem is not obvious now; the two nodes are pinging over >Helen> IPoIB. Please advice. > >That's very odd. This message: > >Helen> NETDEV WATCHDOG: ib0: transmit timed out >Helen> ib0: transmit timeout: latency 1846 > >says that we are not seeing send completions from the HCA. However, >are you saying that even when you are seeing this message, ping over >IPoIB is working? > No, I didn't know there were any problem until IOZONE reported read error from the Lustre Client. BTW, the backend storage is iSCSI over 10 GbE using jumbo frame. This pl\roblem only appeared after our tuning errfor: we increased the iSCSI payload to 1 MB, and increased the TCP window to 512 KB from 256 KB. I will shrink my TCP window and see if the problem goes away. Thanks, Helen ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer
Helen> Roland, Thank you for your response. That fixed my initial Helen> buffer allocation failure. After we tuned the Lustre and Helen> reran same IOZONE tests again, we got the following Helen> problem. Was there an actual network interrupt? If so, the Helen> problem is not obvious now; the two nodes are pinging over Helen> IPoIB. Please advice. That's very odd. This message: Helen> NETDEV WATCHDOG: ib0: transmit timed out Helen> ib0: transmit timeout: latency 1846 says that we are not seeing send completions from the HCA. However, are you saying that even when you are seeing this message, ping over IPoIB is working? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] perftest/rdma_bw; add support for RDMA read and starting PSN
Michael S. Tsirkin wrote: Quoting r. Arlin Davis <[EMAIL PROTECTED]>: Subject: [PATCH] perftest/rdma_bw; add support for RDMA read and starting PSN Michael, The patch adds command line options for RDMA reads and starting PSN. I used these modifications to help isolate the RDMA read performance degradation with 4.6.2 firmware. -arlin Thanks Arlin. I plan to look into integrating this. One question: for which psn values do you see performance drop on 4.6.0 FW? A quick run at 1 and then 0x10 dropped from 682MB/s to 49MB/s for 32KB buffers. What is really strange is that it takes a couple runs to start seeing the drop in performance. PSN=1 no problems... [EMAIL PROTECTED] perftest]$ ./rdma_bw -P 0x1 -s 32768 -r iclust-19 local address: LID 0x02, QPN 0x20406, PSN 0x0001 RKey 0x0c0032 VAddr 0x514000 RDMA_READ remote address: LID 0x05, QPN 0x20406, PSN 0x0001 RKey 0x0c0032 VAddr 0x513000 RDMA_READ Bandwidth peak (#0 to #999): 682.504 MB/sec Bandwidth average: 682.501 MB/sec Service Demand peak (#0 to #999): 5138 cycles/KB Service Demand Avg : 5138 cycles/KB [EMAIL PROTECTED] perftest]$ ./rdma_bw -P 0x1 -s 32768 -r iclust-19 local address: LID 0x02, QPN 0x30406, PSN 0x0001 RKey 0x120032 VAddr 0x514000 RDMA_READ remote address: LID 0x05, QPN 0x30406, PSN 0x0001 RKey 0x120032 VAddr 0x513000 RDMA_READ Bandwidth peak (#0 to #990): 682.496 MB/sec Bandwidth average: 682.496 MB/sec Service Demand peak (#0 to #990): 5138 cycles/KB Service Demand Avg : 5138 cycles/KB [EMAIL PROTECTED] perftest]$ ./rdma_bw -P 0x1 -s 32768 -r iclust-19 local address: LID 0x02, QPN 0x40406, PSN 0x0001 RKey 0x180032 VAddr 0x514000 RDMA_READ remote address: LID 0x05, QPN 0x40406, PSN 0x0001 RKey 0x180032 VAddr 0x513000 RDMA_READ Bandwidth peak (#0 to #990): 682.5 MB/sec Bandwidth average: 682.499 MB/sec Service Demand peak (#0 to #990): 5138 cycles/KB Service Demand Avg : 5138 cycles/KB PSN=0x10 (start to see problems after first run) [EMAIL PROTECTED] perftest]$ ./rdma_bw -P 0x10 -s 32768 -r iclust-19 local address: LID 0x02, QPN 0xb0406, PSN 0x10 RKey 0x420032 VAddr 0x514000 RDMA_READ remote address: LID 0x05, QPN 0x90406, PSN 0x10 RKey 0x360032 VAddr 0x513000 RDMA_READ Bandwidth peak (#0 to #996): 682.5 MB/sec Bandwidth average: 682.499 MB/sec Service Demand peak (#0 to #996): 5138 cycles/KB Service Demand Avg : 5138 cycles/KB [EMAIL PROTECTED] perftest]$ ./rdma_bw -P 0x10 -s 32768 -r iclust-19 local address: LID 0x02, QPN 0xc0406, PSN 0x10 RKey 0x480032 VAddr 0x514000 RDMA_READ remote address: LID 0x05, QPN 0xa0406, PSN 0x10 RKey 0x3c0032 VAddr 0x513000 RDMA_READ Bandwidth peak (#0 to #0): 48.5441 MB/sec Bandwidth average: 47.4502 MB/sec Service Demand peak (#0 to #0): 72244 cycles/KB Service Demand Avg : 73909 cycles/KB [EMAIL PROTECTED] perftest]$ ./rdma_bw -P 0x10 -s 32768 -r iclust-19 local address: LID 0x02, QPN 0xd0406, PSN 0x10 RKey 0x4e0032 VAddr 0x514000 RDMA_READ remote address: LID 0x05, QPN 0xb0406, PSN 0x10 RKey 0x420032 VAddr 0x513000 RDMA_READ Bandwidth peak (#0 to #0): 48.4803 MB/sec Bandwidth average: 47.4501 MB/sec Service Demand peak (#0 to #0): 72339 cycles/KB Service Demand Avg : 73909 cycles/KB PSN = 1 (first run is bad, and then it is back to normal) [EMAIL PROTECTED] perftest]$ ./rdma_bw -P 0x1 -s 32768 -r iclust-19 local address: LID 0x02, QPN 0xe0406, PSN 0x0001 RKey 0x540032 VAddr 0x514000 RDMA_READ remote address: LID 0x05, QPN 0xc0406, PSN 0x0001 RKey 0x480032 VAddr 0x513000 RDMA_READ Bandwidth peak (#0 to #0): 48.5798 MB/sec Bandwidth average: 47.4502 MB/sec Service Demand peak (#0 to #0): 72190 cycles/KB Service Demand Avg : 73909 cycles/KB [EMAIL PROTECTED] perftest]$ ./rdma_bw -P 0x1 -s 32768 -r iclust-19 local address: LID 0x02, QPN 0xf0406, PSN 0x0001 RKey 0x5a0032 VAddr 0x514000 RDMA_READ remote address: LID 0x05, QPN 0xd0406, PSN 0x0001 RKey 0x4e0032 VAddr 0x513000 RDMA_READ Bandwidth peak (#0 to #990): 682.492 MB/sec Bandwidth average: 682.49 MB/sec Service Demand peak (#0 to #990): 5138 cycles/KB Service Demand Avg : 5138 cycles/KB -arlin ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer
Roland, Thank you for your response. That fixed my initial buffer allocation failure. After we tuned the Lustre and reran same IOZONE tests again, we got the following problem. Was there an actual network interrupt? If so, the problem is not obvious now; the two nodes are pinging over IPoIB. Please advice. Thanks, Helen Dmesg Report from Lustre server - NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 1846 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 2846 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 3846 Lustre: A connection with 192.168.2.79 timed out; the network or that node may be down. LustreError: 10501:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a8024f ip 192.168.2.79:1021 LustreError: 10793:0:(ldlm_lib.c:506:target_handle_reconnect()) 460e5_lov2_7d3910bb5c reconnecting - Dmesg from Lustre client (192.168.2.79) -- NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 1965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 2965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 3965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 4965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 5965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 6965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 7965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 8965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 9965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 10965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 11965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 12965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 13965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 14965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 15965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 16965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 17965 Lustre: 10035:0:(socknal_cb.c:1326:ksocknal_process_receive()) [f6256000] EOF from 0xc0a80253 ip 192.168.2.83:988 LustreError: 10169:0:(client.c:568:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR, err == -107 [EMAIL PROTECTED] x13853/t0 o400->[EMAIL PROTECTED]:6 lens 64/64 ref 1 fl Rpc:RN/0/0 rc 0/-107 LustreError: Connection to service on5-ost2 via nid 192.168.2.76 was lost; in progress operations using this service will wait for recovery to complete. Lustre: 10169:0:(import.c:142:ptlrpc_set_import_discon()) OSC_on8_on5-ost2_MNT_on8-ib_2: connection lost to [EMAIL PROTECTED] LustreError: This client was evicted by on5-ost2; in progress operations using this service will fail. LustreError: 10413:0:(rw.c:1253:ll_readpage()) page c1538cc0 map f6193328 index 825344 flags 20001023 count 3 priv e91da940: lock match failed: rc -5 LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x13862/t0 o3->[EMAIL PROTECTED]:6 lens 328/280 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x13868/t0 o3->[EMAIL PROTECTED]:6 lens 328/280 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) previously skipped 4 similar messages LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x13880/t0 o3->[EMAIL PROTECTED]:6 lens 328/280 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) previously skipped 11 similar messages Lustre: A connection with 192.168.2.75 timed out; the network or that node may be down. LustreError: 10041:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a8024b ip 192.168.2.75:988 Lustre: Connection restored to service on5-ost2 using nid 192.168.2.76. Lustre: 10496:0:(import.c:714:ptlrpc_import_recovery_state_machine()) OSC_on8_on5-ost2_MNT_on8-ib_2: connection restored to [EMAIL PROTECTED] LustreError: 10169:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout (sent at 1129234515, 101s ago) [EMAIL PROTECTED] x13850/t0 o400->[EMAIL PROTECTED]:12 lens 64/64 ref 1 fl Rpc:N/0/0 rc 0/0 LustreError: Connection to service on12-mds2 via nid 192.168.2.83 was lost; in progress operations using this service will wait for recovery to complete. Lustre: 10169:0:(import.c:142:ptlrpc_set_import_discon()) MDC_on8_on12-mds2_MNT_on8-ib_2: connection lost to [EMAIL PROTECTED] Lustre: Connection restored to service on3-ost2 using nid 192.168.2.74. Lustre: 10170:0:(import.c:714:ptlrpc_import_recovery_state_machine()) OSC_on8_on3-ost2_MNT_on8-ib_2: connection restored to [EMAIL PROTECTED] _
[openib-general] Re: [PATCH] uDAPL async QP/CQ error handling fixed
On Thu, 13 Oct 2005, Arlin Davis wrote: > James, > > Patch will fix the async error handling and callback mappings. QP/CQ > error mappings were totally screwed up. Updated TODO list. > > -arlin Committed in revision 3774. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer
Quoting r. Roland Dreier <[EMAIL PROTECTED]>: > Subject: Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to > allocate receive buffer > > Michael> Yes, it seems that if such an allocation fails IPoIB may > Michael> never repost the receive buffer. Is that right? > > I think so. > > My plan is to change the receive handling of IPoIB slightly, so that > if it can't allocate a new receive buffer, it reposts the old buffer > and drops the packet it just received. Sounds like a good idea. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer
Michael> Yes, it seems that if such an allocation fails IPoIB may Michael> never repost the receive buffer. Is that right? I think so. My plan is to change the receive handling of IPoIB slightly, so that if it can't allocate a new receive buffer, it reposts the old buffer and drops the packet it just received. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH] perftest/rdma_bw; add support for RDMA read and starting PSN
Quoting r. Arlin Davis <[EMAIL PROTECTED]>: > Subject: [PATCH] perftest/rdma_bw; add support for RDMA read and starting PSN > > Michael, > > The patch adds command line options for RDMA reads and starting PSN. I > used these modifications to > help isolate the RDMA read performance degradation with 4.6.2 firmware. > > -arlin Thanks Arlin. I plan to look into integrating this. One question: for which psn values do you see performance drop on 4.6.0 FW? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer
Quoting r. Roland Dreier <[EMAIL PROTECTED]>: > IPoIB's handling of these allocation errors can definitely be improved Yes, it seems that if such an allocation fails IPoIB may never repost the receive buffer. Is that right? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] uDAPL async QP/CQ error handling fixed
James, Patch will fix the async error handling and callback mappings. QP/CQ error mappings were totally screwed up. Updated TODO list. -arlin Signed-off by: Arlin Davis <[EMAIL PROTECTED]> Index: dapl/openib/TODO === --- dapl/openib/TODO(revision 3768) +++ dapl/openib/TODO(working copy) @@ -1,12 +1,10 @@ IB Verbs: - CQ resize -- mulitple CQ event support - memory window support DAPL: - reinit EP needs a QP timewait completion notification -- direct cq_wait_object when multi-CQ verbs event support arrives - shared receive queue support Under discussion: Index: dapl/openib/dapl_ib_util.c === --- dapl/openib/dapl_ib_util.c (revision 3768) +++ dapl/openib/dapl_ib_util.c (working copy) @@ -214,8 +214,11 @@ DAT_RETURN dapls_ib_open_hca ( /* Get list of all IB devices, find match, open */ dev_list = ibv_get_devices(); dlist_start(dev_list); - dlist_for_each_data(dev_list,hca_ptr->ib_trans.ib_dev,struct ibv_device) { - if (!strcmp(ibv_get_device_name(hca_ptr->ib_trans.ib_dev),hca_name)) + dlist_for_each_data(dev_list, + hca_ptr->ib_trans.ib_dev, + struct ibv_device) { + if (!strcmp(ibv_get_device_name(hca_ptr->ib_trans.ib_dev), + hca_name)) break; } @@ -226,20 +229,22 @@ DAT_RETURN dapls_ib_open_hca ( return DAT_INTERNAL_ERROR; } - dapl_dbg_log (DAPL_DBG_TYPE_UTIL," open_hca: Found dev %s %016llx\n", - ibv_get_device_name(hca_ptr->ib_trans.ib_dev), - (unsigned long long)bswap_64(ibv_get_device_guid(hca_ptr->ib_trans.ib_dev))); + dapl_dbg_log ( + DAPL_DBG_TYPE_UTIL," open_hca: Found dev %s %016llx\n", + ibv_get_device_name(hca_ptr->ib_trans.ib_dev), + (unsigned long long) + bswap_64(ibv_get_device_guid(hca_ptr->ib_trans.ib_dev))); hca_ptr->ib_hca_handle = ibv_open_device(hca_ptr->ib_trans.ib_dev); if (!hca_ptr->ib_hca_handle) { dapl_dbg_log (DAPL_DBG_TYPE_ERR, " open_hca: IB dev open failed for %s\n", - ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); + ibv_get_device_name(hca_ptr->ib_trans.ib_dev)); return DAT_INTERNAL_ERROR; } hca_ptr->ib_trans.ib_ctx = hca_ptr->ib_hca_handle; - /* set inline max with enviromment or default, get local lid and gid 0 */ + /* set inline max with env or default, get local lid and gid 0 */ hca_ptr->ib_trans.max_inline_send = dapl_os_get_env_val("DAPL_MAX_INLINE", INLINE_SEND_DEFAULT); @@ -253,15 +258,17 @@ DAT_RETURN dapls_ib_open_hca ( } dapl_dbg_log(DAPL_DBG_TYPE_UTIL, -" open_hca: GID subnet %016llx id %016llx\n", -(unsigned long long)bswap_64(hca_ptr->ib_trans.gid.global.subnet_prefix), -(unsigned long long)bswap_64(hca_ptr->ib_trans.gid.global.interface_id) ); + " open_hca: GID subnet %016llx id %016llx\n", + (unsigned long long) + bswap_64(hca_ptr->ib_trans.gid.global.subnet_prefix), + (unsigned long long) + bswap_64(hca_ptr->ib_trans.gid.global.interface_id)); /* get the IP address of the device using GID */ if (dapli_get_hca_addr(hca_ptr)) { dapl_dbg_log (DAPL_DBG_TYPE_ERR, " open_hca: ERR ib_at_ips_by_gid for %s\n", - ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); + ibv_get_device_name(hca_ptr->ib_trans.ib_dev)); goto bail; } @@ -310,15 +317,23 @@ DAT_RETURN dapls_ib_open_hca ( write(g_ib_pipe[1], "w", sizeof "w"); dapl_os_unlock(&g_hca_lock); - dapl_dbg_log (DAPL_DBG_TYPE_UTIL, - " open_hca: %s, port %d, %s %d.%d.%d.%d INLINE_MAX=%d\n", - ibv_get_device_name(hca_ptr->ib_trans.ib_dev), hca_ptr->port_num, - ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_family == AF_INET ? "AF_INET":"AF_INET6", - ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_addr.s_addr >> 0 & 0xff, - ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_addr.s_addr >> 8 & 0xff, - ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_addr.s_addr >> 16 & 0xff, - ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_addr.s_addr >> 24 & 0xff, - hca_ptr->ib_trans.max_inline_send ); + dapl_dbg_log ( +
RE: [openib-general] [RFC] IB address translation using ARP
I agree with Mike's analysis. But I'd also like to point out that even when source compatability is not a requirement, source familiarity is. That is, even when recoding is feasible the API should only introduce new concepts as required to improve efficiency. The shift from socket model to QP/CQ is challenging enough as is. It's also where the benefit is. Changing how the application requests and accepts connections is just piling on more things for the developers to learn onto an already very full plate, and with nowhere near the same benefit. The simple, IP/DNS-centric methods that Mike outlined will work on either iWARP or IB, and are very easily understood by those familiar with existing sockets/IP network development. The more complex models provide minor enhancements for very corner cases at the very heavy concept of requiring the developer to understand a lot more about network topology. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [RFC] IB address translation using ARP
At 03:14 PM 10/12/2005, Caitlin Bestler wrote: > -Original Message- > From: [EMAIL PROTECTED] > [ mailto:[EMAIL PROTECTED]] On Behalf Of Sean Hefty > Sent: Wednesday, October 12, 2005 2:36 PM > To: Michael Krause > Cc: openib-general@openib.org > Subject: Re: [openib-general] [RFC] IB address translation using ARP > > Michael Krause wrote: > > 1. Applications want to use existing API to identify remote > endnodes / > > services. > > To clarify, the applications want to use IP based addressing > to identify remote endnotes. The connection API is under development. > No, I think Mike's comment was dead on. Applications want to use the existing API. They want to use the existing API even when the API is clearly defective. Note that there are several generations of host-resolution APIs for the IP world, with the earlier ones clearly being heavily inferior (not thread safe, not IPv4/IPv6 neutral, etc). But they have not been eliminated. Why, because applications want to use the existing API. If application developers were rationale and totally open to adopt new ideas instantly then the active side would ask to make a connection to a *service*, not to a host with a service qualifier. A new API may be under development to meet new needs. But keep in mind that the application developers expect it to be as close to what they are used to as possible, and will grumble that it is not 100% compatible. This all comes down to economics which is why some ULP such as SDP are created. Let's examine SDP for a moment. The purpose of SDP to enable synchronous and asynchronous Sockets applications to transparently run unmodified over a RDMA capable interconnect. Unmodified means no source code changes and no recompile required (this is possible if the Sockets library is a shared library and dynamically linked). The first part of unmodified means that the existing address / service resolution API calls work (further, no change to the address family, etc. is required to make this work either). Hence, pick any of the get* API calls that are in use today and they should just work. How does this work? The SDP implementation takes on the burden for the application developer. For iWARP, there really isn't anything special that has to be done as these calls all should provide the necessary information. The port mapper protocol would be invoked which would map to the actual RDMA listen QP and target RNIC. For IB, there is some additional work both in using SID as well as resolving the IP address to the IB address vector but the work isn't that hard to implement (we know this because this has all been implemented on various OS within the industry). The same will be true for NFS/RDMA and iSER - again all use the existing interfaces to identify the address / service and map to an address vector (and again, all of this has been implemented on various OS within the industry). The above makes ISV and customers very happy as they can take advantage of RDMA technologies without having to go through the lengthy and expensive qualification process that comes when any application is modified / recompiled. This keeps costs low and improves TTM. As for the RDMA connection API, that is simply attempting to abstract to a common interface that any ULP implementation can use to access either iWARP or IB. The RDMA connection API should not be viewed as something end application developers will use but towards middleware developers. This allows everyone to use IP addresses, port spaces, etc. through the existing application API while allowing RDMA to transparently add some intelligence to the process and eventually enable new capabilities like policy management (e.g. how best to map ULP QoS needs to a given path, service rate,etc.) without permuting everything above. Keeping things transparent is best for all. Attempting to require end application developers to modify their code will result in slower adoption and reduced utilization of RDMA technologies within the industry. It really is all about economics and re-using the existing ecosystem / infrastructure. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] perftest/rdma_bw; add support for RDMA read and starting PSN
Michael, The patch adds command line options for RDMA reads and starting PSN. I used these modifications to help isolate the RDMA read performance degradation with 4.6.2 firmware. -arlin Signed-off by: Arlin Davis <[EMAIL PROTECTED]> Index: rdma_bw.c === --- rdma_bw.c (revision 3768) +++ rdma_bw.c (working copy) @@ -304,7 +304,9 @@ static struct pingpong_context *pp_init_ * The Consumer is not allowed to assign Remote Write or Remote Atomic to * a Memory Region that has not been assigned Local Write. */ ctx->mr = ibv_reg_mr(ctx->pd, ctx->buf, size * 2, -IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_LOCAL_WRITE); +IBV_ACCESS_REMOTE_WRITE | +IBV_ACCESS_REMOTE_READ | +IBV_ACCESS_LOCAL_WRITE); if (!ctx->mr) { fprintf(stderr, "Couldn't allocate MR\n"); return NULL; @@ -345,7 +347,9 @@ static struct pingpong_context *pp_init_ attr.qp_state= IBV_QPS_INIT; attr.pkey_index = 0; attr.port_num= port; - attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE; + attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE | + IBV_ACCESS_REMOTE_READ | + IBV_ACCESS_LOCAL_WRITE; if (ibv_modify_qp(ctx->qp, &attr, IBV_QP_STATE | @@ -370,7 +374,7 @@ static int pp_connect_ctx(struct pingpon attr.path_mtu = IBV_MTU_2048; attr.dest_qp_num= dest->qpn; attr.rq_psn = dest->psn; - attr.max_dest_rd_atomic = 1; + attr.max_dest_rd_atomic = 4; attr.min_rnr_timer = 12; attr.ah_attr.is_global = 0; attr.ah_attr.dlid = dest->lid; @@ -394,7 +398,7 @@ static int pp_connect_ctx(struct pingpon attr.retry_cnt = 7; attr.rnr_retry = 7; attr.sq_psn = my_psn; - attr.max_rd_atomic = 1; + attr.max_rd_atomic = 4; if (ibv_modify_qp(ctx->qp, &attr, IBV_QP_STATE | IBV_QP_TIMEOUT| @@ -417,6 +421,7 @@ static void usage(const char *argv0) printf("\n"); printf("Options:\n"); printf(" -p, --port= listen on/connect to port (default 18515)\n"); + printf(" -P, --starting_psn starting sequence on QP (default random)\n"); printf(" -d, --ib-dev= use IB device (default first device found)\n"); printf(" -i, --ib-port= use port of IB device (default 1)\n"); printf(" -s, --size= size of message to exchange (default 65536)\n"); @@ -487,6 +492,8 @@ int main(int argc, char *argv[]) int scnt, ccnt; int sockfd; int duplex = 0; + int rdma_read = 0; + int starting_psn = 0; struct ibv_qp *qp; cycles_t*tposted; @@ -498,16 +505,18 @@ int main(int argc, char *argv[]) static struct option long_options[] = { { .name = "port", .has_arg = 1, .val = 'p' }, + { .name = "starting_psn", .has_arg = 1, .val = 'P' }, { .name = "ib-dev", .has_arg = 1, .val = 'd' }, { .name = "ib-port",.has_arg = 1, .val = 'i' }, { .name = "size", .has_arg = 1, .val = 's' }, { .name = "iters", .has_arg = 1, .val = 'n' }, { .name = "tx-depth", .has_arg = 1, .val = 't' }, { .name = "bidirectional", .has_arg = 0, .val = 'b' }, + { .name = "rdma_read", .has_arg = 0, .val = 'r' }, { 0 } }; - c = getopt_long(argc, argv, "p:d:i:s:n:t:b", long_options, NULL); + c = getopt_long(argc, argv, "p:P:d:i:s:n:t:br", long_options, NULL); if (c == -1) break; @@ -520,6 +529,14 @@ int main(int argc, char *argv[]) } break; + case 'P': + starting_psn = strtol(optarg, NULL, 0); + if (port <= 0) { + usage(argv[0]); + return 1; + } + break; + case 'd': ib_devname = strdupa(optarg); break; @@ -567,6 +584,10 @@ int main(int argc, char *argv[]) duplex = 1; break; +
RE: [openib-general] QP with large starting sequence adds latencyto RDMA READ???
> From: Arlin Davis [mailto:[EMAIL PROTECTED] > Sent: Thursday, October 13, 2005 9:42 AM > > Sean Hefty wrote: > > > Arlin Davis wrote: > > > >> I just noticed some RDMA read performance issues that seem to be > >> related to the QP starting sequence number. If I set the starting > >> sequence to 1 then all is fine but if I set it to 0x1 then it > >> seems to add ~40us to my 32KB RDMA read operation (polling for > >> completions). Has anyone seen anything like this? > > > > > > Has anyone else noticed this issue? You could try to reproduce this > > by using the rdma_bw test and changing the PSN. > > > > - Sean > > > > I added a starting PSN and RDMA READ option to the rdma_bw test and was > able to reproduce on a PCI-E adapter with 4.6.2 firmware. I retried on a > system with 4.7.0 and it looks like the problem is fixed. However, I > see nothing about this problem in the "bug fix" list in the release > notes. Can someone at Mellanox confirm this problem with RDMA reads and > add to release notes as a fix so it is documented somewhere? > > http://www.mellanox.com/products/fw_images/fw-25208-4_7_0-release_notes.pdf Note that I have seen similar behavior (drop in bandwidth) correlated to starting PSN using Winsock Direct under Windows, so this doesn't seem to be a uDAPL or Linux issue. As for Arlin, the issue disappeared in firmware 4.7.0, and I too would like to see some confirmation that there was an issue and that it was fixed. Thanks, - Fab ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] QP with large starting sequence adds latency to RDMA READ???
Sean Hefty wrote: Arlin Davis wrote: I just noticed some RDMA read performance issues that seem to be related to the QP starting sequence number. If I set the starting sequence to 1 then all is fine but if I set it to 0x1 then it seems to add ~40us to my 32KB RDMA read operation (polling for completions). Has anyone seen anything like this? Has anyone else noticed this issue? You could try to reproduce this by using the rdma_bw test and changing the PSN. - Sean I added a starting PSN and RDMA READ option to the rdma_bw test and was able to reproduce on a PCI-E adapter with 4.6.2 firmware. I retried on a system with 4.7.0 and it looks like the problem is fixed. However, I see nothing about this problem in the "bug fix" list in the release notes. Can someone at Mellanox confirm this problem with RDMA reads and add to release notes as a fix so it is documented somewhere? http://www.mellanox.com/products/fw_images/fw-25208-4_7_0-release_notes.pdf -arlin ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] mvapich-gen2 IA64 compile problem
Sayantan, Thanks for the reply. I was just using make in the mvapich-gen2 directory, that may call the script I don't know. I'll take a look at the doc you suggested and go through the troule shooting in there. John Sayantan Sur wrote: Hi John, * On Oct,6 John Partridge<[EMAIL PROTECTED]> wrote : Roland, Actually, I just checked (and reinstalled in case there was a problem) and libibverbs is installed OK and I still get the problem. The mvapich.make.[gcc,icc,pgi] script in the top level directory of MVAPICH-Gen2 includes all the library paths and appropriate -l's. Can you please tell us if you are using this script? There is a user guide in the distribution too (called: mvapich.user_guide.pdf), which lists some common troubleshooting issues when installing/using MVAPICH. Thanks, Sayantan. -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: [EMAIL PROTECTED] ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Migration Solution
On Thu, 2005-10-13 at 03:10, Mohit Katiyar, Noida wrote: > Hi all, > If anyone can suggest some good possible solution for migrating from > Clients FC Switch -> SAN connection > To > Clients---> IB network---> SAN Connection It depends on your storage. There are two choices here: iSER based IB storage and SRP based IB storage. > The most economical I can think of is > Clients -> IB Switch > IB FC gateway---> FC > Switch> SAN > But performance enhancement is doubtful > The Expensive but high performance will be > Clients > IB Switch-> SAN Yes, this is more direct and is higher performance but is this more expensive ? The tradeoff is the cost of the IB FC gateway versus the cost delta of the native IB v. FC based storage. The main issue is the availability of the native IB storage solutions (I think several are emerging) and the initiator side (there are iSER and SRP initiators available for OpenIB). > Does anyone having any other ideas or any other middleway? Not that I am aware of. -- Hal > Thanks > Mohit > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH] [ADDR] return gateway GID for non-local IP addresses
On Wed, 2005-10-12 at 19:39, Sean Hefty wrote: > The following patch returns the GID of the IP gateway for non-local > subnet IP addresses. > > Hal, does this change look correct to you? I don't have an easy way > to test this fully. Yes, this looks right. I think the address resolution part can be tested without a real gateway for the connection by just adding a route off the IPoIB subnet to some other endnode and trying to connect to something on that remote destination subnet. You should at least see the ARP complete for that next hop and the connect (perhaps) fail depending on the discrimination in the passive side on the IP address passed in the private data. -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Migration Solution
Hi all, If anyone can suggest some good possible solution for migrating from Clients FC Switch -> SAN connection To Clients---> IB network---> SAN Connection The most economical I can think of is Clients -> IB Switch > IB FC gateway---> FC Switch> SAN But performance enhancement is doubtful The Expensive but high performance will be Clients > IB Switch-> SAN Does anyone having any other ideas or any other middleway? Thanks Mohit ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general