Re: [openib-general] IPoIB odd loopback packet from arp
Todd, This does not look like an error. The first arp is a broadcast (qpn=ff) so it is received in at the sending interface and is dropped. The second on is a unicast arp (qpn=0x000404) so it is not received at the local interface. On Mon, 2006-10-23 at 13:48 -0600, Todd Bowman wrote: > Using the OFED 1.0 and OFED 1.1 stack I have notice some rcvswrelay > errors. I have tracked it down to the arp request. I can reproduce > the problem with the following steps: > > ( I have used both 2.6.14.14 and 2.6.18.1 kernels) > > ib109> arp -d ib110 > ib109> ping ib110 -c 2 > > # ib_ipoib module debug > 13:15:46 ib109 kernel: ib0: sending packet, length=60 address=f6187200 > qpn=0xff > 13:15:46 ib109 kernel: ib0: called: id 34, op 0, status: 0 > 13:15:46 ib109 kernel: ib0: send complete, wrid 34 > 13:15:46 ib109 kernel: ib0: called: id -2147483623, op 128, status: 0 > 13:15:46 ib109 kernel: ib0: received 100 bytes, SLID 0x0369 > 13:15:46 ib109 kernel: ib0: dropping loopback packet > 13:15:46 ib109 kernel: ib0: called: id -2147483622, op 128, status: 0 > 13:15:46 ib109 kernel: ib0: received 100 bytes, SLID 0x016d > 13:15:46 ib109 kernel: ib0: sending packet, length=88 address=f6e57520 > qpn=0x000404 > 13:15:46 ib109 kernel: ib0: called: id 35, op 0, status: 0 > 13:15:46 ib109 kernel: ib0: send complete, wrid 35 > 13:15:46 ib109 kernel: ib0: called: id -2147483621, op 128, status: 0 > 13:15:46 ib109 kernel: ib0: received 128 bytes, SLID 0x016d > 13:15:47 ib109 kernel: ib0: sending packet, length=88 address=f6e57520 > qpn=0x000404 > 13:15:47 ib109 kernel: ib0: called: id 36, op 0, status: 0 > 13:15:47 ib109 kernel: ib0: send complete, wrid 36 > 13:15:47 ib109 kernel: ib0: called: id -2147483620, op 128, status: 0 > 13:15:47 ib109 kernel: ib0: received 128 bytes, SLID 0x016d > 13:15:51 ib109 kernel: ib0: called: id -2147483619, op 128, status: 0 > 13:15:51 ib109 kernel: ib0: received 100 bytes, SLID 0x016d > 13:15:51 ib109 kernel: ib0: sending packet, length=60 address=f6e57520 > qpn=0x000404 > 13:15:51 ib109 kernel: ib0: called: id 37, op 0, status: 0 > 13:15:51 ib109 kernel: ib0: send complete, wrid 37 > > # tcpdump -i ib0 > 13:15:46.977578 arp who-has ib110 tell ib109 hardware #32 > 13:15:46.977682 arp reply ib110 is-at > 00:00:04:04:fe:80:00:00:00:00:00:00:00:08:f1:04:03:96:11:59 hardware > #32 > 13:15:46.977710 IP ib109 > ib110: icmp 64: echo request seq 0 > 13:15:46.977790 IP ib110 > ib109: icmp 64: echo reply seq 0 > 13:15:47.92 IP ib109 > ib110: icmp 64: echo request seq 1 > 13:15:47.977892 IP ib110 > ib109: icmp 64: echo reply seq 1 > 13:15:51.977076 arp who-has ib109 tell ib110 hardware #32 > 13:15:51.977094 arp reply ib109 is-at > 00:02:00:14:fe:80:00:00:00:00:00:00:00:02:c9:02:00:00:3b:31 hardware > #32 > > # error dump > rcvswrelayerrors:1 MT47396 Infiniscale-III 0x2c9010b022090[1] > <> ib109 HCA-1 0x2c9023b30[1] > > 1) The ping is successful and the arp table is populated so Is this > really a problem or a false positive? > 2) The second arp does not generate an error (the error dump reports > all new errors in switches). Why? > > Any ideas? > > Thanks in advance. > > Todd > > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
On Mon, 2006-10-16 at 09:48 -0700, Roland Dreier wrote: > Eli> Please diff to see my comments. Generaly it looks like the > Eli> condition on netif_rx_reschedule() should be inverted. > > Why? A return value of 0 means that the reschedule failed (probably > because the poll routine is already running somewhere else) and the > poll routine should just return. I think the code is correct as it stands. > > Eli> Also ou need to set max to some large value since you don't > Eli> know if how many completions you missed and you want to make > Eli> sure you get all the ones the sneaked from the last poll to > Eli> the request notify. > > Why? max is there to limit us from doing more work than the quota > passed in from the networking stack. If we fail to drain the CQ > because we exhaust max, then the poll routine will return 1 and will > remain scheduled, so the networking stack will call the poll routine > again to continue grabbing completions. > > - R. OK I see what you mean. So I guess it's OK then. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
On Sun, 2006-10-15 at 09:39 -0700, Roland Dreier wrote: > I've been meaning to mention this... I have a preliminary version in > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git > ipoib-napi > > There are further changes I would like to add on top of that, but > comments on the two patches there would be appreciated. And also > benchmarks would be good. Please diff to see my comments. Generaly it looks like the condition on netif_rx_reschedule() should be inverted. Also ou need to set max to some large value since you don't know if how many completions you missed and you want to make sure you get all the ones the sneaked from the last poll to the request notify. int ipoib_poll(struct net_device *dev, int *budget) { struct ipoib_dev_priv *priv = netdev_priv(dev); int max = min(*budget, dev->quota); int done; int t; int empty; int missed_event; int n, i; repoll: done = 0; empty = 0; while (max) { t = min(IPOIB_NUM_WC, max); n = ib_poll_cq(priv->cq, t, priv->ibwc); for (i = 0; i < n; ++i) { if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) { ++done; --max; ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); } else ipoib_ib_handle_tx_wc(dev, priv->ibwc + i); } if (n != t) { empty = 1; break; } } dev->quota -= done; *budget-= done; if (empty) { netif_rx_complete(dev); ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP, &missed_event); if (missed_event && !netif_rx_reschedule(dev, 0)) { max = 1000; goto repoll; } return 0; } return 1; } ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] IPOIB NAPI
Hi Roland, can you tell when you are going to push your NAPI patch to ipoib? Is there anything I can do to help making this happen? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] IB/ipoib - possible deadlock in path query and join
When a path query or a join fails immediately, we want to call complete(&obj->done). This is required to avoid a deadlock that can occur if there is no farther attempt to the operation (query or join) and a wait_for_completion() is called on obj->done due to a flush operation. Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> --- Index: openib-1.1/drivers/infiniband/ulp/ipoib/ipoib_main.c === --- openib-1.1.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-10-05 09:09:27.0 +0200 +++ openib-1.1/drivers/infiniband/ulp/ipoib/ipoib_main.c2006-10-05 09:12:00.0 +0200 @@ -504,6 +504,7 @@ static int path_rec_start(struct net_dev if (path->query_id < 0) { ipoib_warn(priv, "ib_sa_path_rec_get failed\n"); path->query = NULL; + complete(&path->done); return path->query_id; } Index: openib-1.1/drivers/infiniband/ulp/ipoib/ipoib_multicast.c === --- openib-1.1.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-10-05 09:09:23.0 +0200 +++ openib-1.1/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-10-05 12:23:25.0 +0200 @@ -371,6 +371,7 @@ static int ipoib_mcast_sendonly_join(str if (ret < 0) { ipoib_warn(priv, "ib_sa_mcmember_rec_set failed (ret = %d)\n", ret); + complete(&mcast->done); } else { ipoib_dbg_mcast(priv, "no multicast record for " IPOIB_GID_FMT ", starting join\n", @@ -501,6 +502,7 @@ static void ipoib_mcast_join(struct net_ if (ret < 0) { ipoib_warn(priv, "ib_sa_mcmember_rec_set failed, status %d\n", ret); + complete(&mcast->done); mcast->backoff *= 2; if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] IB/ipoib: NAPI
On Tue, 2006-09-26 at 17:35 -0400, Bernard King-Smith wrote: > Has anyone run the RR test in Netperf to look at latency? What 1 byte > RR rates did you see before and after applying the patch. > netperf TCP_RR: rate [tran/sec] --- without NAPI: 39600 with NAPI: 39470 As you can see there is a minor difference. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] heads-up - ipoib NAPI
On Tue, 2006-09-26 at 21:34 -0700, Shirley Ma wrote: > NAPI patch moves ipoib poll from hardware interrupt context to softirq > context. It would reduce the hardware interrupts, reduce hardware > latency and induce some network latency. It might reduce cpu > utilization. But I still question about the BW improvement. I did see > various performance with the same test under the same condition. > When you open just one connection you can see around 10% of variations in BW measure. But then you don't utilize all the CPU power you have and you don't get to the threshold where NAPI becomes effective. Using multiple connections utilizes all CPUs in the system, increases send rate, and increases the chances of the receiver to poll CQEs up to its quota and be scheduled again without re-enabling interrupts. > Have you tested this patch with different message sizes, different > socket sizes? Are these results consistent better? > I used large socket sizes but I think with that with multiple connections this parameter does not have significant effect. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] heads-up - ipoib NAPI
On Thu, 2006-09-21 at 19:31 -0700, harish wrote: > How did the CPU utilizations compare for the NAPI vs. no NAPI case? > What are your thoughts on what bottleneck you are hitting? > The CPU utilization reported by netperf is not accurate since it is not reported on a per cpu basis and I don't have one number to reliably describe utilization. I think the bottleneck is that the CPU that handles the softirq that handles the rx sk_buffs is 100% utilized and it dictates the limit. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] IB/ipoib: NAPI
I experimented with your patch and could not find any noticeable change in BW or interrupt rate so I guess we can use your ideas. Still, note that NAPI_howto.txt does not read *budget to limit the number of polls and the code bellow from kernel 2.6.17.7 takes into account that budget can become negative. static void net_rx_action(struct softirq_action *h) { struct softnet_data *queue = &__get_cpu_var(softnet_data); unsigned long start_time = jiffies; int budget = netdev_budget; void *have; local_irq_disable(); while (!list_empty(&queue->poll_list)) { struct net_device *dev; if (budget <= 0 || jiffies - start_time > 1) goto softnet_break; ... } >I took a stab at implementing this myself, and it > > You might want to respin your patch against my for-2.6.19 branch Do you think I should work on this or you plan to push your code? >> The biggest problem I have with this is that I don't know what to call >> the feature bit. Any suggestions? >Maybe set bit for the lack of the feature? REQUIRES_POLL_AFTER_ARM? We can take Michael's suggestion or use NOT_REQUIRES_POLL_AFTER_ARM so we can implement this for mthca without touching ipath or ehca at the first step. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] IB/ipoib: NAPI
> > However I don't think we want to use peek CQ always -- I think that > extra CQ lock/unlock may kill a lot of the performance gain you see > with NAPI (and I don't think even mthca can do a lockless CQ peek, > since we need to protect against races with resize CQ, etc). So > probably what we need is a feature bit in the struct ib_device to say > whether the peek CQ is needed or whether req notify will generate > events for existing CQEs. > Sounds good to me > You might want to respin your patch against my for-2.6.19 branch -- I > already split up the handle WC routine into separate send and receive > functions, so the patch will become much smaller. > Sure. I can send an updated patch > Also, the handling of how many completions to poll and the logic of > when to call netif_rx_complete() seems very strange to me. First, you > totally ignore the budget parameter, Not totally. I update it whenever I decide it is a "not_done" which happens in two cases: a. I finish my quota b. I handled any completions - rx or tx. I do count tx as well since I want to coalesce as many completions as possible. I do not update quota and budget when I exit polling mode because I think there is no point in doing that (this is unlike the example in NAPI-howto.txt but I will check again if this is the right thing to do). > > My poll routine ended up looking like the following, which I think is > more correct: > > +int ipoib_poll(struct net_device *dev, int *budget) > +{ > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + int max = min(*budget, dev->quota); > + int done = 0; > + int t; > + int empty = 0; > + int n, i; > + > + while (max) { > + t = min(IPOIB_NUM_WC, max); > + n = ib_poll_cq(priv->cq, t, priv->ibwc); > + > + for (i = 0; i < n; ++i) { > + if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) { > + ++done; > + --max; > + ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); > + } else > + ipoib_ib_handle_tx_wc(dev, priv->ibwc + i); > + } > + > + if (n != t) { > + empty = 1; > + break; I don't think this is the right thing to do. Polling less completions then you could is no reason to quit polling mode. You may receive more completions in the next time you got called. > + } > + } > + > + dev->quota -= done; > + *budget-= done; > + > + if (empty) { > + netif_rx_complete(dev); > + ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP); > + /* XXX rotting packet! */ > + return 0; > + } > + > + return 1; > +} > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] heads-up - ipoib NAPI
> Hi Eli, > > Thanks for sharing the results with us. It is great to see the reduction > in > Interrupts. Could you please specify the netperf test specifications > [message size; socket size]. Wondering what the numbers would be if we use > large socket and message sizes [128K & 64K respectively]. The reason for > the > request is to make sure we are not hitting any TCP related bottleneck > while > comparing NAPI vs. no NAPI cases. Please let me know what you think. I used large socket buffer sizes. Here is the command line I used. The reult for the bandwidth is the some of all the connections. netperf -H 11.4.3.144 -l 600 -f M -p $port -- -s 20,20 -S 20,20 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] IB/ipoib: NAPI
This patch implements NAPI for iopib. It is a draft implementation. I would like your opinion on whether we need a module parameter to control if NAPI should be activated or not. Also there is a need to implement peek_cq and call it for ib_req_notify_cq() so as to know if there is a need to call netif_rx_schedule_prep() again. Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> --- Index: openib-1.1-rc6/drivers/infiniband/ulp/ipoib/ipoib_main.c === --- openib-1.1-rc6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-09-21 16:30:35.0 +0300 +++ openib-1.1-rc6/drivers/infiniband/ulp/ipoib/ipoib_main.c2006-09-21 16:30:42.0 +0300 @@ -69,6 +69,8 @@ MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0"); #endif +static const int poll_def_weight = 64; + struct ipoib_path_iter { struct net_device *dev; struct ipoib_path path; @@ -91,6 +93,9 @@ .remove = ipoib_remove_one }; + +int ipoib_poll(struct net_device *dev, int *budget); + int ipoib_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -689,6 +694,7 @@ goto out; } + if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) { spin_lock(&priv->lock); __skb_queue_tail(&neigh->queue, skb); @@ -892,6 +898,7 @@ /* Delete any child interfaces first */ list_for_each_entry_safe(cpriv, tcpriv, &priv->child_intfs, list) { + netif_poll_disable(priv->dev); unregister_netdev(cpriv->dev); ipoib_dev_cleanup(cpriv->dev); free_netdev(cpriv->dev); @@ -919,6 +926,8 @@ dev->hard_header = ipoib_hard_header; dev->set_multicast_list = ipoib_set_mcast_list; dev->neigh_setup = ipoib_neigh_setup_dev; + dev->poll= ipoib_poll; + dev->weight = poll_def_weight; dev->watchdog_timeo = HZ; @@ -1097,6 +1106,8 @@ goto register_failed; } + netif_poll_enable(priv->dev); + ipoib_create_debug_files(priv->dev); if (ipoib_add_pkey_attr(priv->dev)) @@ -,6 +1122,7 @@ return priv->dev; sysfs_failed: + netif_poll_disable(priv->dev); ipoib_delete_debug_files(priv->dev); unregister_netdev(priv->dev); @@ -1168,6 +1180,7 @@ dev_list = ib_get_client_data(device, &ipoib_client); list_for_each_entry_safe(priv, tmp, dev_list, list) { + netif_poll_disable(priv->dev); ib_unregister_event_handler(&priv->event_handler); flush_scheduled_work(); Index: openib-1.1-rc6/drivers/infiniband/ulp/ipoib/ipoib_ib.c === --- openib-1.1-rc6.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2006-09-21 16:30:38.0 +0300 +++ openib-1.1-rc6/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2006-09-21 17:24:59.0 +0300 @@ -169,7 +169,7 @@ return 0; } -static void ipoib_ib_handle_wc(struct net_device *dev, +static void ipoib_ib_handle_rwc(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -178,122 +178,186 @@ ipoib_dbg_data(priv, "called: id %d, op %d, status: %d\n", wr_id, wc->opcode, wc->status); - if (wr_id & IPOIB_OP_RECV) { - wr_id &= ~IPOIB_OP_RECV; - - if (wr_id < ipoib_recvq_size) { - struct sk_buff *skb = priv->rx_ring[wr_id].skb; - dma_addr_t addr = priv->rx_ring[wr_id].mapping; - - if (unlikely(wc->status != IB_WC_SUCCESS)) { - if (wc->status != IB_WC_WR_FLUSH_ERR) - ipoib_warn(priv, "failed recv event " - "(status=%d, wrid=%d vend_err %x)\n", - wc->status, wr_id, wc->vendor_err); - dma_unmap_single(priv->ca->dma_device, addr, -IPOIB_BUF_SIZE, DMA_FROM_DEVICE); - dev_kfree_skb_any(skb); - priv->rx_ring[wr_id].skb = NULL; - return; - } - - /* -* If we can't allocate a new RX buffer, dump -* this packet and reuse the old buffer. -*/ - if (unlikely(ipoib_alloc_rx_skb(dev, wr
[openib-general] heads-up - ipoib NAPI
Hi, I have a draft implementation of NAPI in ipoib and got the following results: System descriptions === Quad CPU E64T 2.4 Ghz 4 GB RAM MT25204 Sinai HCA I used netperf for benchmarking, the BW test ran for 600 seconds with 8 clients and 8 servers. The results I received are bellow: netperf TCP_STREAM: BW [MByte/sec]clients side [irqs/sec] server side [irqs/sec] ----- -- without NAPI: 50686441 66311 with NAPI: 550 6830 13600 netperf TCP_RR: rate [tran/sec] --- without NAPI: 39600 with NAPI: 39470 Please note this is still under work and we plan to do more tests and measure on other devices. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] IB/ipoib: unlikely in send
Use unlikely in send flow Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> --- Index: openib-1.1-rc6/drivers/infiniband/ulp/ipoib/ipoib_ib.c === --- openib-1.1-rc6.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2006-09-21 16:19:33.0 +0300 +++ openib-1.1-rc6/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2006-09-21 16:20:39.0 +0300 @@ -385,7 +385,7 @@ struct ipoib_tx_buf *tx_req; dma_addr_t addr; - if (skb->len > dev->mtu + INFINIBAND_ALEN) { + if (unlikely(skb->len > dev->mtu + INFINIBAND_ALEN)) { ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", skb->len, dev->mtu + INFINIBAND_ALEN); ++priv->stats.tx_dropped; ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] IB/ipoib: likely/unlikely annotations
Use likely/unlikely in data tx flow Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> Acked-by: Michael S. Tsirkin <[EMAIL PROTECTED]> --- Index: openib-1.1-rc6/drivers/infiniband/ulp/ipoib/ipoib_main.c === --- openib-1.1-rc6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-09-21 15:43:49.0 +0300 +++ openib-1.1-rc6/drivers/infiniband/ulp/ipoib/ipoib_main.c2006-09-21 15:46:26.0 +0300 @@ -643,7 +643,7 @@ struct ipoib_neigh *neigh; unsigned long flags; - if (!spin_trylock_irqsave(&priv->tx_lock, flags)) + if (unlikely(!spin_trylock_irqsave(&priv->tx_lock, flags))) return NETDEV_TX_LOCKED; /* @@ -656,7 +656,7 @@ return NETDEV_TX_BUSY; } - if (skb->dst && skb->dst->neighbour) { + if (likely(skb->dst && skb->dst->neighbour)) { if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) { ipoib_path_lookup(skb, dev); goto out; ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] ipoib mcast restart
> Why is the ipoib_mcast_start_thread() at the end of ipoib_ib_dev_up() > not sufficient to rejoin all the mcgs? > Because after a port event all the mcast groups on the device are flushed and all that remains is from the dev->mclist and we must renew the joins from there. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib multicast problem
> > > I don't understand. How could ipoib rejoin the broadcast group and > then not rejoin the rest of the full member groups it has? > > That is because the broadcast group is not part of the multicast groups maintained by the kernel but rather is part of ipoib and is joined from a different function. The other full members are maintained by the kernel for the net device and come from dev->mclist. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] ipoib multicast problems on RHEL4.0 u4
Hi, while testing ipoib multicast on RHEL4.0 u4, I noticed that setsockopt() succeeds to add a multicast group to an interface but actually the multicast group is not added to the net_device. This means that an application cannot join a multicast group as a full member. When I examined the differences between the kernel sources for u3 and u4 I noticed that essential code was removed: diff -ru net/ipv4/arp.c ../linux-2.6.9-42.ELsmp/net/ipv4/arp.c --- net/ipv4/arp.c 2006-09-18 15:35:03.0 +0300 +++ ../linux-2.6.9-42.ELsmp/net/ipv4/arp.c 2006-09-19 10:08:06.0 +0300 @@ -213,9 +213,6 @@ case ARPHRD_IEEE802_TR: ip_tr_mc_map(addr, haddr); return 0; - case ARPHRD_INFINIBAND: - ip_ib_mc_map(addr, haddr); - return 0; default: if (dir) { memcpy(haddr, dev->broadcast, dev->addr_len); Can anyone suggest a workaround to this issue? Thanks Eli ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] ipoib mcast restart
Make sure after after ipoib_ib_dev_flush is executed, ipoib_mcast_restart_task is executed also to join all the mcast groups maintained by the kernel for the device. Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> Signed-off-by: Michael S. Tsirkin <[EMAIL PROTECTED]> Index: openib-1.1/drivers/infiniband/ulp/ipoib/ipoib_ib.c === --- openib-1.1.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2006-09-14 17:20:06.0 +0300 +++ openib-1.1/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2006-09-17 15:51:52.0 +0300 @@ -619,8 +619,10 @@ * The device could have been brought down between the start and when * we get here, don't bring it back up if it's not configured up */ - if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) { ipoib_ib_dev_up(dev); + ipoib_mcast_restart_task(dev); + } mutex_lock(&priv->vlan_mutex); ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] ipoib multicast problem
Hi, I have seen the following problem with ipoib: 1. An application registers to a multicast group as a full member. As a result all the groups are listed in dev->mclist. 2. The infiniband link falls momentarily, opensm restarted etc. 3. All multicast memberships are flushed. 4. The net device will not join again until at a later time something will cause ipoib_set_mcast_list() to be called. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] ipoib sendonly join
When sendonly join fails mcast->query must be set to NULL in order that succeesive joins will be attempted for the group. Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> Signed-off-by: Michael S. Tsirkin <[EMAIL PROTECTED]> Index: openib-1.1/drivers/infiniband/ulp/ipoib/ipoib_multicast.c === --- openib-1.1.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-09-12 14:28:33.0 +0300 +++ openib-1.1/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-09-14 17:17:12.0 +0300 @@ -326,6 +326,7 @@ /* Clear the busy flag so we try again */ clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + mcast->query = NULL; } complete(&mcast->done); ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] ipoib send only failure
Hi, when running a test I encountered the following scenario: the test sends to multicast address ipoib issues send only joins which fails. successive joins to this group will not be attempted since the query field of the mcast object holds the old pointer. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] PXE + infiniband?
On Thu, 2006-09-07 at 08:19 +0100, Paul Baxter wrote: > > There is an implementation of PXE for Mellanox's HCAs that can be found > > here: http://sourceforge.net/forum/forum.php?forum_id=494529 > > Thanks for the tip > > I, too, am interested in this. > > Do you have a more direct link as I wandered around etherboot's project site > and couldn't find anything IB-specific. > > Paul Baxter Hi, Please use the following link http://kent.dl.sourceforge.net/sourceforge/etherboot/etherboot-5.4.2.tar.bz2 to download the package. Unpack the package and cd to the src dir. Use an x86 arch machine to build the binaries. The infiniband drivers are located at src/drivers/net/mlx_ipoib/ where you can find a readme file in the doc directory. To build. cd src make bin/MT23108.zrom // for MT230108 make bin/MT25208.zrom make bin/MT25218.zrom This covers all Mellanox HCAs. Please let me know if you need more assistance. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] PXE + infiniband?
On Fri, 2006-09-01 at 14:51 -0400, Cain, Brian (GE Healthcare) wrote: > A while back > (http://openib.org/pipermail/openib-general/2005-September/010801.html) > there was mention of putting PXE stuff on an HCA. Has anyone done this > with PXELINUX? It doesn't seem like it's as straightforward as just > putting the stock PXELINUX image on your HCA. I'm assuming this image > would have to recognize the HCA and bring up IPoIB in order to use the > conventional TFTP transport? There is an implementation of PXE for Mellanox's HCAs that can be found here: http://sourceforge.net/forum/forum.php?forum_id=494529 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] huge pages support
On Fri, 2006-08-18 at 15:23 +0200, Robert Rex wrote: > Hello, > > I've also worked on the same topic. Here is what I've done so far as I > successfully tested it on mthca and ehca. I'd appreciate for comments and > suggestions. > > + down_read(¤t->mm->mmap_sem); > + if (is_vm_hugetlb_page(find_vma(current->mm, (unsigned long) addr))) { > + use_hugepages = 1; > + region_page_mask= HPAGE_MASK; > + region_page_size= HPAGE_SIZE; This might cause a kernel oops if the address passed by the user does not belong to the process's address space. In that case find_vma() might return NULL and is_vm_hugetlb() will crash. And even if find_vma() returns none NULL value, that still does not guarantee that the vma returned is the one that contains that address. You need to check that the address is greater then or equal to vma->vm_start. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] huge pages support
Oh that one I have not seen... but it looks like this is the approach I took in VAPI and somehow it looked cumbersome to me. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Roland Dreier Sent: Sunday, August 20, 2006 7:02 PM To: Tziporet Koren Cc: Openib-general@openib.org Subject: Re: [PATCH] huge pages support Roland> Sure, you are building the OFED release so you can put Roland> whatever you want into it. ...although maybe it would be a good idea to follow the approach of the second patch posted, and make multiple get_user_pages() calls, skipping along by HPAGE_SIZE. This avoids having all the extra work of follow_hugetlb_page() creating extra fake pages and then calling put_page() many times. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] huge pages support
On Thu, 2006-08-17 at 09:06 -0700, Roland Dreier wrote: > My first reaction on reading this is that it looks like we need some > more help from mm/ to make this cleaner. > > > + for (j = 0; j < umem->page_size / PAGE_SIZE; ++j) > > + put_page(chunk->page_list[i].page); > > This just looks really bad... you're assuming that calling put_page() > on a different page than the one get_user_pages() gave you is OK. > Does that actually work? (I don't remember the details of how hugetlb > pages work) A huge page is a compound page that is comprised of a number of pages and the first one represents all the others so it is safe to do this. > > > + if (hpage == -1) > > + hpage = is_vm_hugetlb_page(vma); > > + else > > + if (hpage != is_vm_hugetlb_page(vma)) > > + return -ENOMEM; > > Users aren't allowed to register regions that span both hugetlb and > non-hugetlb pages?? Why not? It is not that easy anyway to obtain huge pages and even more unlikely that a user will want to register a region with different page sizes so I just don't allow that. But I think these cases can be allowed by using PAGE_SIZE registration. > > > + if (!(PAGE_SIZE / (sizeof (struct page *) << shift_diff))) { > > shift_diff is used here... > > > + ret = -ENOMEM; > > + goto exit_up; > > + } > > + > > + page_size = 1 << page_shift; > > + page_mask = ~(page_size - 1); > > + shift_diff = page_shift - PAGE_SHIFT; > > and set down here... don't you get a warning about "is used uninitialized"? Oops. I didn't see any warning. This if should moved a few lines down. > > Also, if someone tries to register a hugetlb region and a regular page > isn't big enough to hold all the pointers, then they just lose?? This > would be a real problem if/when GB pages are implemented on amd64 -- > there would be no way to register such a region. > > > Basically this patch seems to be trying to undo what follow_hugetlb_pages() > does to create normal pages for the pieces of a hugetlb page. It > seems like it would be better to have a new function like get_user_pages() > that returned its result in a struct scatterlist so that it could give > back huge pages directly. > I agree about that. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] huge pages support
The following patch adds support for registering huge pages memory regions while passing to the low level driver the correct page size thus preserving mtt entries and kernel memory used to store page descriptors memory. Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> Index: last_stable/drivers/infiniband/core/uverbs_mem.c === --- last_stable.orig/drivers/infiniband/core/uverbs_mem.c +++ last_stable/drivers/infiniband/core/uverbs_mem.c @@ -35,6 +35,7 @@ */ #include +#include #include #include "uverbs.h" @@ -49,7 +50,7 @@ struct ib_umem_account_work { static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty) { struct ib_umem_chunk *chunk, *tmp; - int i; + int i, j; list_for_each_entry_safe(chunk, tmp, &umem->chunk_list, list) { dma_unmap_sg(dev->dma_device, chunk->page_list, @@ -57,13 +58,43 @@ static void __ib_umem_release(struct ib_ for (i = 0; i < chunk->nents; ++i) { if (umem->writable && dirty) set_page_dirty_lock(chunk->page_list[i].page); - put_page(chunk->page_list[i].page); + for (j = 0; j < umem->page_size / PAGE_SIZE; ++j) + put_page(chunk->page_list[i].page); } kfree(chunk); } } +static int get_page_shift(void *addr, size_t size) +{ + struct vm_area_struct *vma; + unsigned long va = (unsigned long)addr; + int hpage = -1; + +next: + vma = find_vma(current->mm, va); + if (!vma) + return -ENOMEM; + + if (va < vma->vm_start) + return -ENOMEM; + + if (hpage == -1) + hpage = is_vm_hugetlb_page(vma); + else + if (hpage != is_vm_hugetlb_page(vma)) + return -ENOMEM; + + if ((va + size) > vma->vm_end) { + size -= (vma->vm_end - va); + va = vma->vm_end; + goto next; + } + + return hpage ? HPAGE_SHIFT : PAGE_SHIFT; +} + int ib_umem_get(struct ib_device *dev, struct ib_umem *mem, void *addr, size_t size, int write) { @@ -73,9 +104,12 @@ int ib_umem_get(struct ib_device *dev, s unsigned long lock_limit; unsigned long cur_base; unsigned long npages; + int page_shift, shift_diff, nreg_pages, tmp; + int max_page_chunk; + unsigned long page_mask, page_size; int ret = 0; int off; - int i; + int i, j, nents; if (!can_do_mlock()) return -EPERM; @@ -84,17 +118,36 @@ int ib_umem_get(struct ib_device *dev, s if (!page_list) return -ENOMEM; + down_write(¤t->mm->mmap_sem); + + page_shift = get_page_shift(addr, size); + if (IS_ERR_VALUE(page_shift)) { + ret = page_shift; + goto exit_up; + } + + /* make sure enough pointers get into PAGE_SIZE to + contain at least one huge page */ + if (!(PAGE_SIZE / (sizeof (struct page *) << shift_diff))) { + ret = -ENOMEM; + goto exit_up; + } + + page_size = 1 << page_shift; + page_mask = ~(page_size - 1); + shift_diff = page_shift - PAGE_SHIFT; + mem->user_base = (unsigned long) addr; mem->length= size; - mem->offset= (unsigned long) addr & ~PAGE_MASK; - mem->page_size = PAGE_SIZE; + mem->offset= (unsigned long) addr & ~page_mask; + mem->page_size = page_size; mem->writable = write; INIT_LIST_HEAD(&mem->chunk_list); - npages = PAGE_ALIGN(size + mem->offset) >> PAGE_SHIFT; - down_write(¤t->mm->mmap_sem); + nreg_pages = ALIGN(size + mem->offset, page_size) >> page_shift; + npages = nreg_pages << shift_diff; locked = npages + current->mm->locked_vm; lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT; @@ -104,14 +157,15 @@ int ib_umem_get(struct ib_device *dev, s goto out; } - cur_base = (unsigned long) addr & PAGE_MASK; + cur_base = (unsigned long) addr & page_mask; while (npages) { + tmp = min_t(int, npages, + (PAGE_SIZE / (sizeof (struct page *) << shift_diff)) + << shift_diff); ret = get_user_pages(current, current->mm, cur_base, -min_t(int, npages, - PAGE_SIZE / sizeof (struct page *)), +tmp,
Re: [openib-general] potential bug: ipoib freeing ah before completion
The issue was originally raised by Eitan Rabin. -Original Message- From: Michael S. Tsirkin Sent: Monday, June 12, 2006 3:49 PM To: openib-general@openib.org; Roland Dreier Cc: Eli Cohen Subject: potential bug: ipoib freeing ah before completion Hello, Roland! The following was noted by Eli Cohen: It seems that ipoib_flush_paths can be called while completions are still outstanding on IPoIB QP (e.g. from ipoib_ib_dev_flush). If this happens, an address handle might get freed while a work request is still outstanding for it. This can trigger a local QP error, and IPoIB will stop working, until QP is reset. Please comment. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [Bug 122] New: mad layer problem
Hi, Here is some info: Attached are the SysRq messages. The relation of MADs to ARP is that after ARP resolves a hardware address it is required to use an SM query to resolve the path to the host bearing the hardware address. How to invoke the tests: Attached one readme file and one configuration file. Eli messages.bz2 Description: messages.bz2 114-115-800conn.conf Description: 114-115-800conn.conf Runing the test: On both machine: make sure sdp module is up on both machines and ping over ib0 is working export LD_PRELOAD=/usr/local/ofed/lib64/libsdp.so export LIBSDP_CONFIG_FILE=/usr/local/ofed/etc/libsdp.conf run: sw114 SERVER side: /usr/local/polygraph/bin/polysrv --config /usr/local/polygraph/114-115-800conn.conf --verb_lvl 10 sw115 CLIENT side: /usr/local/polygraph/bin/polyclt --config /usr/local/polygraph/114-115-800conn.conf --verb_lvl 10 In less then 5 minutes IPoIB will stop working (try ping) and also the poly test will fail In case you need to compile polygraph: 1. cd /mswg/work/alberto/polygraph/polygraph-2.8.1/ 2. ./configure --prefix=/usr/local/polygraph 3. make clean && make && make install 4. cp 114-115-800conn.conf /usr/local/polygraph/ ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH]Repost: IPoIB skb panic
> More clarification: we saw two races here: > 1. path_free() was called by both unicast_arp_send() and > ipoib_flush_paths() in the same time. It is not possible to call path_free() on the same object from both unicast_arp_send() and ipoib_flush_paths(). This becasue unicast_arp_send() calls it only for newly created objects for which path_rec_create() failed, in which case the object was never inserted into the list or the rb_tree. > 2. during unicast arp skb retransmission, unicast_arp_send() appended > the skb on the list, while ipoib_flush_paths() calling path_free() to > free the same skb from the list. I don't see any issue here as well. Can you reproduce the crash? If you do, can you send how? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] ipoib
Sorry about the badly formatted previous email - here it is again... When ipoib_stop is called it first calls netif_stop_queue() to stop the kernel from passing more packets to the network driver. After that ipoib_ib_dev_stop() moves the QP to error causing all pending work requests to complete with error but then the completion handler may call netif_wake_queue() re-enabling packet transfer. This patch makes sure to enable calling netif_wake_queue() only if IPOIB_FLAG_ADMIN_UP is set. Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> Signed-off-by: Michael Tsirkin <[EMAIL PROTECTED]> Index: linux-kernel/infiniband/ulp/ipoib/ipoib_ib.c === --- linux-kernel.orig/infiniband/ulp/ipoib/ipoib_ib.c 2006-05-28 12:53:19.661381000 +0300 +++ linux-kernel/infiniband/ulp/ipoib/ipoib_ib.c2006-05-28 13:00:02.512096000 +0300 @@ -269,6 +269,7 @@ spin_lock_irqsave(&priv->tx_lock, flags); ++priv->tx_tail; if (netif_queue_stopped(dev) && + test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags) && priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) netif_wake_queue(dev); spin_unlock_irqrestore(&priv->tx_lock, flags); ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] ipoib
When ipoib_stop is called it first calls netif_stop_queue() to stop the kernel from passing more packets to the network driver. After that ipoib_ib_dev_stop() moves the QP to error causing all pending work requests to complete with error but then the completion handler may call netif_wake_queue() re-enabling packet transfer. This patch makes sure to enable calling netif_wake_queue() only if IPOIB_FLAG_ADMIN_UP is set. Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> Signed-off-by: Michael Tsirkin <[EMAIL PROTECTED]> Index: linux-kernel/infiniband/ulp/ipoib/ipoib_ib.c === --- linux-kernel.orig/infiniband/ulp/ipoib/ipoib_ib.c 2006-05-28 12:53:19.661381000 +0300 +++ linux-kernel/infiniband/ulp/ipoib/ipoib_ib.c2006-05-28 13:00:02.512096000 +0300 @@ -269,6 +269,7 @@ spin_lock_irqsave(&priv->tx_lock, flags); ++priv->tx_tail; if (netif_queue_stopped(dev) && + test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags) && priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) netif_wake_queue(dev); spin_unlock_irqrestore(&priv->tx_lock, flags); ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [IPoIB] executing iperf over IPoIB causes to multicast (IP) packets to be recieved out-of-order
> This is on back to back. I don't think hardware does this - this is > ipoib software thing. > I saw this problem with unicast UD datagrams as well. This was on Fedora C4 kernel 2.6.11-1.1369_FC4smp. I verified that the packets arrived in order just before calling netif_rx_ni() by peeking into the ip and udp layers. After that I tried this on kernel 2.6.16.17 and the there were no out of order reports. So I guess this was a Linux networking stack problem that was resolved in newer kernels. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] RE: [PATCH] ipoib_cleanup_module
> Not sure I follow this. What error could occur? It's a more theoretical problem then a real problem. ipoib workqueue is still running and could be using debugfs which does not exist anymore. I know in the current implementation there is no real problem but it is cleaner to destroy in the opposite order of creation. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] ipoib_cleanup_module
ensure reverse order of creation or else might cause errors if debugfs is used. Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> Index: latest/drivers/infiniband/ulp/ipoib/ipoib_main.c === --- latest.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ latest/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1216,8 +1216,8 @@ err_fs: static void __exit ipoib_cleanup_module(void) { ib_unregister_client(&ipoib_client); - ipoib_unregister_debugfs(); destroy_workqueue(ipoib_workqueue); + ipoib_unregister_debugfs(); } module_init(ipoib_init_module); ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] ipoib_mcast_restart_task
> > The reason for that is probably because I am using a custom kernel compiled > with 'Debug memory allocations' which poisons freed memory. oops. Ignore this. I did not notice the succeding discussion. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] ipoib_mcast_restart_task
On Wednesday 05 April 2006 18:43, Roland Dreier wrote: > Michael> Not sure I read you. It'd still be use after free, won't it? > > It's definitely a bug. But it doesn't explain the specific oops we > saw. In other words, doing: > > kfree(mcast); > dev = mcast->dev; > > shouldn't cause an oops, because mcast is still a valid kernel > pointer, even if the memory it points to might be reused and > corrupted. Following the dev pointer after that snippet might cause > an oops, because it might be overwritten. > The reason for that is probably because I am using a custom kernel compiled with 'Debug memory allocations' which poisons freed memory. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] ipoib_flush_paths
ib_sa_cancel_query must be called with priv->lock held since a completion might arrive and set path->query to NULL Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> Index: latest/drivers/infiniband/ulp/ipoib/ipoib_main.c === --- latest.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ latest/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -342,14 +342,16 @@ void ipoib_flush_paths(struct net_device list_for_each_entry(path, &remove_list, list) rb_erase(&path->rb_node, &priv->path_tree); - spin_unlock_irqrestore(&priv->lock, flags); list_for_each_entry_safe(path, tp, &remove_list, list) { if (path->query) ib_sa_cancel_query(path->query_id, path->query); + spin_unlock_irqrestore(&priv->lock, flags); wait_for_completion(&path->done); path_free(dev, path); + spin_lock_irqsave(&priv->lock, flags); } + spin_unlock_irqrestore(&priv->lock, flags); } static void path_rec_completion(int status, ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] ipoib_mcast_restart_task
ipoib_mcast_restart_task might free an mcast object while a join request is still outstanding, leading to an oops when the query completes. Fix this by waiting for query to complete, similar to what ipoib_stop_thread is doing. The wait for mcast completion code is consolidated in wait_join_complete(). Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> Signed-off-by: Michael S. Tsirkin <[EMAIL PROTECTED]> Index: latest/drivers/infiniband/ulp/ipoib/ipoib_multicast.c === --- latest.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ latest/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -617,6 +617,22 @@ int ipoib_mcast_start_thread(struct net_ return 0; } +static void wait_join_complete(struct ipoib_dev_priv *priv, + struct ipoib_mcast *mcast) +{ + spin_lock_irq(&priv->lock); + if (mcast && mcast->query) { + ib_sa_cancel_query(mcast->query_id, mcast->query); + mcast->query = NULL; + spin_unlock_irq(&priv->lock); + ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + wait_for_completion(&mcast->done); + } + else + spin_unlock_irq(&priv->lock); +} + int ipoib_mcast_stop_thread(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -636,28 +652,10 @@ int ipoib_mcast_stop_thread(struct net_d if (flush) flush_workqueue(ipoib_workqueue); - spin_lock_irq(&priv->lock); - if (priv->broadcast && priv->broadcast->query) { - ib_sa_cancel_query(priv->broadcast->query_id, priv->broadcast->query); - priv->broadcast->query = NULL; - spin_unlock_irq(&priv->lock); - ipoib_dbg_mcast(priv, "waiting for bcast\n"); - wait_for_completion(&priv->broadcast->done); - } else - spin_unlock_irq(&priv->lock); + wait_join_complete(priv, priv->broadcast); - list_for_each_entry(mcast, &priv->multicast_list, list) { - spin_lock_irq(&priv->lock); - if (mcast->query) { - ib_sa_cancel_query(mcast->query_id, mcast->query); - mcast->query = NULL; - spin_unlock_irq(&priv->lock); - ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", - IPOIB_GID_ARG(mcast->mcmember.mgid)); - wait_for_completion(&mcast->done); - } else - spin_unlock_irq(&priv->lock); - } + list_for_each_entry(mcast, &priv->multicast_list, list) + wait_join_complete(priv, mcast); return 0; } @@ -910,6 +908,7 @@ void ipoib_mcast_restart_task(void *dev_ /* We have to cancel outside of the spinlock */ list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { + wait_join_complete(priv, mcast); ipoib_mcast_leave(mcast->dev, mcast); ipoib_mcast_free(mcast); } ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH] ipoib_mcast_restart_task
> Yes, looks like there might be problem here. However, is there any > way to consolidate the "cancel and wait for done" code in one place, > rather than just cut-and-pasting it from ipoib_stop_thread()? An appropriate patch will follow. > This could explain the oops in ipoib_mcast_sendonly_join_complete(), > but only if a send-only group is being replaced by a full-member > join. Is Eli's test doing that? No, not deliberately but it did not happen again after a full night runs. I will keep running the tests for a while and track this. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH] mthca - query qp
On Thu, 2006-03-16 at 14:38 -0800, Roland Dreier wrote: > I need to think about how to handle the other attributes. Probably > anything that just copies values from the struct ib_qp passed in to > the function should at least be in the generic ib_query_qp function, > since there's no good reason for every low-level driver to duplicate it. > Probably this is how it should be done > However I'm wondering whether we even have the right interface for > query QP at all. The only thing that the consumer couldn't have saved > off for themselves is the QP state, so maybe the right interface is > just an ib_query_qp_state() function that gives back the QP's current state. > I think we should pass to the consumer all the info in a centralized way through a standard query_qp verb. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] mthca - query qp
On Tue, 2006-03-14 at 08:08 -0800, Roland Dreier wrote: > Sorry, I didn't like all of it (it seems silly to copy values from the > ibqp struct, which the user passed in, into the output struct), and > then I didn't get around to applying the parts I did like. But the alternative is that the verb consumer will have to gather all information from three different objects while in some places the fields are defined but do not contain valid values. Would it not be simpler to pack all the info in qp_attr and qp_init_attr? After all it is not data path. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: ipoib error handling question
On Tue, 2006-03-14 at 08:03 -0800, Roland Dreier wrote: > Also I don't see much chance of having the skb queue be full but not > having a path record query pending. > if (path->ah) { kref_get(&path->ah->ref); neigh->ah = path->ah; ipoib_send(dev, skb, path->ah, be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); } else { neigh->ah = NULL; if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) __skb_queue_tail(&neigh->queue, skb); else goto err_list; if (!path->query && path_rec_start(dev, path)) goto err_list; } The if on the skb queue len is not necessary since this queue is on a newly created neighbor and is obviously empty. Then the path query would be called if path->query is null. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCH] mthca - query qp
Hi Roland, What about this patch? Are you going to apply it? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Eli Cohen Sent: Wednesday, March 08, 2006 4:32 PM To: Roland Dreier Cc: openib-general@openib.org Subject: [openib-general] [PATCH] mthca - query qp 1. Fix error flow - if status is not zero exit with error 2. Few more values extracted Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> Index: last_stable/drivers/infiniband/hw/mthca/mthca_qp.c === --- last_stable.orig/drivers/infiniband/hw/mthca/mthca_qp.c +++ last_stable/drivers/infiniband/hw/mthca/mthca_qp.c @@ -427,6 +427,12 @@ int mthca_query_qp(struct ib_qp *ibqp, s if (err) goto out; + if (status) { + mthca_warn(dev, "QUERY_QP returned status %02x\n", status); + err = -EINVAL; + goto out; + } + qp_param= mailbox->buf; context = &qp_param->context; mthca_state = be32_to_cpu(context->flags) >> 28; @@ -454,6 +460,9 @@ int mthca_query_qp(struct ib_qp *ibqp, s qp_attr->pkey_index = be32_to_cpu(context->pri_path.port_pkey) & 0x7f; qp_attr->alt_pkey_index = be32_to_cpu(context->alt_path.port_pkey) & 0x7f; + qp_attr->timeout = context->pri_path.ackto >> 3; + qp_attr->alt_timeout = context->alt_path.ackto >> 3; + /* qp_attr->en_sqd_async_notify is only applicable in modify qp */ qp_attr->sq_draining = mthca_state == MTHCA_QP_STATE_DRAINING; @@ -469,7 +478,17 @@ int mthca_query_qp(struct ib_qp *ibqp, s qp_attr->rnr_retry = context->pri_path.rnr_retry >> 5; qp_attr->alt_port_num = qp_attr->alt_ah_attr.port_num; qp_attr->alt_timeout= context->alt_path.ackto >> 3; + + qp_init_attr->event_handler = ibqp->event_handler; + qp_init_attr->qp_context= ibqp->qp_context; + qp_init_attr->send_cq = ibqp->send_cq; + qp_init_attr->recv_cq = ibqp->recv_cq; + qp_init_attr->srq = ibqp->srq; qp_init_attr->cap = qp_attr->cap; + qp_init_attr->sq_sig_type = ((be32_to_cpu(context->params1) >> 3) & 0x1) ? + IB_SIGNAL_REQ_WR : IB_SIGNAL_ALL_WR; + qp_init_attr->qp_type = ibqp->qp_type; + qp_init_attr->port_num = qp_attr->ah_attr.port_num; out: mthca_free_mailbox(dev, mailbox); ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] Re: ipoib error handling question
> > Yes, that looks like an obvious mistake. I checked this in: > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c > index 1633aad..00342d5 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > @@ -519,12 +519,10 @@ static void neigh_add_path(struct sk_buf > be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); > } else { > neigh->ah = NULL; > - if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) { > + if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) > __skb_queue_tail(&neigh->queue, skb); > - } else { > - ++priv->stats.tx_dropped; > - dev_kfree_skb_any(skb); > - } > + else > + goto err_list; > > if (!path->query && path_rec_start(dev, path)) > goto err; Doing it this way may result in having an ipoib_path object for which the query is not complete. I think if the queue is full you should do cleanup but still execute if (!path->query && path_rec_start(dev, path)) goto err; ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] mthca - query srq fixes
1. Fix endianess handling of srq_limit 2. Add support for srq_limit for Tavor Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> Index: last_stable/drivers/infiniband/hw/mthca/mthca_srq.c === --- last_stable.orig/drivers/infiniband/hw/mthca/mthca_srq.c2006-03-12 17:15:36.0 +0200 +++ last_stable/drivers/infiniband/hw/mthca/mthca_srq.c 2006-03-12 18:24:45.0 +0200 @@ -49,7 +49,8 @@ struct mthca_tavor_srq_context { __be32 state_pd; __be32 lkey; __be32 uar; - __be32 wqe_cnt; + __be16 limit_watermark; + __be16 wqe_cnt; u32reserved[2]; }; @@ -369,6 +370,7 @@ int mthca_query_srq(struct ib_srq *ibsrq struct mthca_srq *srq = to_msrq(ibsrq); struct mthca_mailbox *mailbox; struct mthca_arbel_srq_context *arbel_ctx; + struct mthca_tavor_srq_context *tavor_ctx; u8 status; int err; @@ -382,9 +384,11 @@ int mthca_query_srq(struct ib_srq *ibsrq if (mthca_is_memfree(dev)) { arbel_ctx = mailbox->buf; - srq_attr->srq_limit = arbel_ctx->limit_watermark; - } else - srq_attr->srq_limit = 0; + srq_attr->srq_limit = be16_to_cpu(arbel_ctx->limit_watermark); + } else { + tavor_ctx = mailbox->buf; + srq_attr->srq_limit = be16_to_cpu(tavor_ctx->limit_watermark); + } srq_attr->max_wr = (mthca_is_memfree(dev)) ? srq->max - 1 : srq->max; srq_attr->max_sge = srq->max_gs; ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] mthca - query qp
1. Fix error flow - if status is not zero exit with error 2. Few more values extracted Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> Index: last_stable/drivers/infiniband/hw/mthca/mthca_qp.c === --- last_stable.orig/drivers/infiniband/hw/mthca/mthca_qp.c +++ last_stable/drivers/infiniband/hw/mthca/mthca_qp.c @@ -427,6 +427,12 @@ int mthca_query_qp(struct ib_qp *ibqp, s if (err) goto out; + if (status) { + mthca_warn(dev, "QUERY_QP returned status %02x\n", status); + err = -EINVAL; + goto out; + } + qp_param= mailbox->buf; context = &qp_param->context; mthca_state = be32_to_cpu(context->flags) >> 28; @@ -454,6 +460,9 @@ int mthca_query_qp(struct ib_qp *ibqp, s qp_attr->pkey_index = be32_to_cpu(context->pri_path.port_pkey) & 0x7f; qp_attr->alt_pkey_index = be32_to_cpu(context->alt_path.port_pkey) & 0x7f; + qp_attr->timeout = context->pri_path.ackto >> 3; + qp_attr->alt_timeout = context->alt_path.ackto >> 3; + /* qp_attr->en_sqd_async_notify is only applicable in modify qp */ qp_attr->sq_draining = mthca_state == MTHCA_QP_STATE_DRAINING; @@ -469,7 +478,17 @@ int mthca_query_qp(struct ib_qp *ibqp, s qp_attr->rnr_retry = context->pri_path.rnr_retry >> 5; qp_attr->alt_port_num = qp_attr->alt_ah_attr.port_num; qp_attr->alt_timeout= context->alt_path.ackto >> 3; + + qp_init_attr->event_handler = ibqp->event_handler; + qp_init_attr->qp_context= ibqp->qp_context; + qp_init_attr->send_cq = ibqp->send_cq; + qp_init_attr->recv_cq = ibqp->recv_cq; + qp_init_attr->srq = ibqp->srq; qp_init_attr->cap = qp_attr->cap; + qp_init_attr->sq_sig_type = ((be32_to_cpu(context->params1) >> 3) & 0x1) ? + IB_SIGNAL_REQ_WR : IB_SIGNAL_ALL_WR; + qp_init_attr->qp_type = ibqp->qp_type; + qp_init_attr->port_num = qp_attr->ah_attr.port_num; out: mthca_free_mailbox(dev, mailbox); ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] mthca - read byte count & read request size
On Mon, 2006-03-06 at 19:25 -0800, Grant Grundler wrote: > Why is this an enum? > > > +static int pcix_max_rbc = PCIX_MAX_RBC_INVALID; > > It's declared an int and is "user visible". > I think the user interface would be better served > if the user could just specify "pcix_max_rbc=2048" > instead of some magic value. Using an enum confines the set of valid values the user can specify to the range 0-3 and it is easier to filter out invalid values. If I used raw values for the parameters then questions like how do I interpret a value of 1000 would rise: As invalid value so the parameter is ignored? Or maybe I would configure rbc to 512? or 1024? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] mthca - read byte count & read request size
Add control of PCIX max read byte count and PCI Express max read request size parameters. This is due to the fact that some chipsets may not work or may not work optimally with the default parameters. Logic as follows: If the user sets a valid value to a parameter this value wins. Otherwise the value read from the device is compared to the default value defined by the spec. If the value read does not equal the default it is assumed the BIOS configured it and the BIOS value is retained. Otherwise the the former mthca configured defaults of 4K are set. To summarize, it assumes that if the BIOS configures a value it knows best and we keep this value. Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_main.c === --- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_main.c +++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_main.c @@ -69,6 +69,41 @@ MODULE_PARM_DESC(msi, "attempt to use MS #endif /* CONFIG_PCI_MSI */ +/* the PCIX max read byte count and PCI Express max read request size + parameters have default values as defined by the PRM which will + should give best performance in most cases. However there are some + chipsets that work better with other values or that do not work at + all with the default values */ +enum { + PCIX_MAX_RBC_512, + PCIX_MAX_RBC_1024, + PCIX_MAX_RBC_2048, + PCIX_MAX_RBC_4096, + PCIX_MAX_RBC_INVALID +}; + +#define CAP_PRMS_DESC " - some chipsets do not work optimally or do not work at " \ + "all with the max value and this parameter can be used to tune " \ + "the optimal value" + +static int pcix_max_rbc = PCIX_MAX_RBC_INVALID; +module_param(pcix_max_rbc, int, 0444); +MODULE_PARM_DESC(pcix_max_rbc, "PCIX max read byte count" CAP_PRMS_DESC); + +enum { + PCIE_MAX_RRS_128, + PCIE_MAX_RRS_256, + PCIE_MAX_RRS_512, + PCIE_MAX_RRS_1024, + PCIE_MAX_RRS_2048, + PCIE_MAX_RRS_4096, + PCIE_MAX_RRS_INVALID +}; + +static int pcie_max_rrs = PCIE_MAX_RRS_INVALID; +module_param(pcie_max_rrs, int, 0444); +MODULE_PARM_DESC(pcie_max_rrs, "PCI Express max read read request size" CAP_PRMS_DESC); + static const char mthca_version[] __devinitdata = DRV_NAME ": Mellanox InfiniBand HCA driver v" DRV_VERSION " (" DRV_RELDATE ")\n"; @@ -89,8 +124,9 @@ static int __devinit mthca_tune_pci(stru { int cap; u16 val; + int orig_val, set_val; - /* First try to max out Read Byte Count */ + /* First determine Max Read Byte Count */ cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX); if (cap) { if (pci_read_config_word(mdev->pdev, cap + PCI_X_CMD, &val)) { @@ -98,12 +134,25 @@ static int __devinit mthca_tune_pci(stru "aborting.\n"); return -ENODEV; } - val = (val & ~PCI_X_CMD_MAX_READ) | (3 << 2); + + if (pcix_max_rbc < PCIX_MAX_RBC_INVALID) +set_val = pcix_max_rbc; + else { + orig_val = (val & PCI_X_CMD_MAX_READ) >> 2; + if (orig_val != PCIX_MAX_RBC_512) + /* retain value set by the bios */ + set_val = orig_val; + else + set_val = PCIX_MAX_RBC_4096; + } + val = (val & ~PCI_X_CMD_MAX_READ) | (set_val << 2); if (pci_write_config_word(mdev->pdev, cap + PCI_X_CMD, val)) { mthca_err(mdev, "Couldn't write PCI-X command register, " "aborting.\n"); return -ENODEV; } + mthca_dbg(mdev, "setting read byte count to %d bytes\n", + 512 << set_val); } else if (!(mdev->mthca_flags & MTHCA_FLAG_PCIE)) mthca_info(mdev, "No PCI-X capability, not setting RBC.\n"); @@ -114,12 +163,26 @@ static int __devinit mthca_tune_pci(stru "register, aborting.\n"); return -ENODEV; } - val = (val & ~PCI_EXP_DEVCTL_READRQ) | (5 << 12); + + if (pcie_max_rrs < PCIE_MAX_RRS_INVALID) + set_val = pcie_max_rrs; + else { + orig_val = (val & PCI_EXP_DEVCTL_READRQ) >> 12; + if (orig_val != PCIE_MAX_RRS_512) + /* retain value set by the bios */ + set_val =
[openib-general] [PATCH] mthca - optimize sinai large message
Memory key generation modified to optimize large messages transfer in Sinai. Performance gain will be in effect for messages > 80 KByte on sinai DDR. The enhancement works when the HCA memory key table configured is up to and including 2^23. For larger tables the enhancement is off. Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> Signed-off-by: Michael Tsirkin <[EMAIL PROTECTED]> Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_profile.c === --- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_profile.c +++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_profile.c @@ -153,8 +153,8 @@ u64 mthca_make_profile(struct mthca_dev "won't in 0x%llx bytes of context memory.\n", (unsigned long long) total_size, (unsigned long long) mem_avail); - kfree(profile); - return -ENOMEM; + total_size = -ENOMEM; + goto exit; } if (profile[i].size) @@ -260,6 +260,12 @@ u64 mthca_make_profile(struct mthca_dev */ dev->limits.num_pds = MTHCA_NUM_PDS; + if ((dev->mthca_flags & MTHCA_FLAG_SINAI_OPT) && init_hca->log_mpt_sz > 23) { + mthca_warn(dev, "MPT table too large - disabling mkey optimization " + "for Sinai\n"); + dev->mthca_flags &= ~MTHCA_FLAG_SINAI_OPT; + } + /* * For Tavor, FMRs use ioremapped PCI memory. For 32 bit * systems it may use too much vmalloc space to map all MTT @@ -272,6 +278,7 @@ u64 mthca_make_profile(struct mthca_dev else dev->limits.fmr_reserved_mtts = request->fmr_reserved_mtts; +exit: kfree(profile); return total_size; } Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_main.c === --- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_main.c +++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_main.c @@ -935,13 +935,16 @@ enum { static struct { u64 latest_fw; - int is_memfree; - int is_pcie; + u32 flags; } mthca_hca_table[] = { - [TAVOR]= { .latest_fw = MTHCA_FW_VER(3, 3, 3), .is_memfree = 0, .is_pcie = 0 }, - [ARBEL_COMPAT] = { .latest_fw = MTHCA_FW_VER(4, 7, 0), .is_memfree = 0, .is_pcie = 1 }, - [ARBEL_NATIVE] = { .latest_fw = MTHCA_FW_VER(5, 1, 0), .is_memfree = 1, .is_pcie = 1 }, - [SINAI]= { .latest_fw = MTHCA_FW_VER(1, 0, 1), .is_memfree = 1, .is_pcie = 1 } + [TAVOR]= { .latest_fw = MTHCA_FW_VER(3, 3, 3), + .flags = 0 }, + [ARBEL_COMPAT] = { .latest_fw = MTHCA_FW_VER(4, 7, 0), + .flags = MTHCA_FLAG_PCIE }, + [ARBEL_NATIVE] = { .latest_fw = MTHCA_FW_VER(5, 1, 0), + .flags = MTHCA_FLAG_MEMFREE | MTHCA_FLAG_PCIE }, + [SINAI]= { .latest_fw = MTHCA_FW_VER(1, 0, 1), + .flags = MTHCA_FLAG_MEMFREE | MTHCA_FLAG_PCIE | MTHCA_FLAG_SINAI_OPT } }; static int __devinit mthca_init_one(struct pci_dev *pdev, @@ -1031,12 +1034,9 @@ static int __devinit mthca_init_one(stru mdev->pdev = pdev; + mdev->mthca_flags = mthca_hca_table[id->driver_data].flags; if (ddr_hidden) mdev->mthca_flags |= MTHCA_FLAG_DDR_HIDDEN; - if (mthca_hca_table[id->driver_data].is_memfree) - mdev->mthca_flags |= MTHCA_FLAG_MEMFREE; - if (mthca_hca_table[id->driver_data].is_pcie) - mdev->mthca_flags |= MTHCA_FLAG_PCIE; /* * Now reset the HCA before we touch the PCI capabilities or Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_dev.h === --- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_dev.h +++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_dev.h @@ -71,7 +71,8 @@ enum { MTHCA_FLAG_NO_LAM = 1 << 5, MTHCA_FLAG_FMR= 1 << 6, MTHCA_FLAG_MEMFREE= 1 << 7, - MTHCA_FLAG_PCIE = 1 << 8 + MTHCA_FLAG_PCIE = 1 << 8, + MTHCA_FLAG_SINAI_OPT = 1 << 9 }; enum { Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_mr.c === --- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_mr.c +++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_mr.c @@ -76,6 +76,8 @@ struct mthca_mpt_entry { #define MTHCA_MPT_STATUS_SW 0xF0 #define MTHCA_MPT_STATUS_HW 0x00 +#define SINAI_FMR_KEY_INC 0x100 + /* * Buddy allocator for MTT segments (currently not very efficient * since it doesn't keep a free list
Re: [openib-general] [PATCH] mthca - optimize sinai large message
On Tue, 2006-02-28 at 13:15 +0200, Or Gerlitz wrote: > Can you elaborate a little what are the implications of not using this > optimizations and what IB transport/size/type of messages (or better > which ULPs) are effected? The improvement will be noticable for protocol messages larger then ~80 KByte: RDMA Write/Read and Send. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] mthca - optimize sinai large message
Memory key generation modified to optimize large messages transfer in Sinai. This implementation restricts the MPT table size for Sinai to a maximum of 2^23 entries. Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_profile.c === --- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_profile.c +++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_profile.c @@ -153,8 +153,8 @@ u64 mthca_make_profile(struct mthca_dev "won't in 0x%llx bytes of context memory.\n", (unsigned long long) total_size, (unsigned long long) mem_avail); - kfree(profile); - return -ENOMEM; + total_size = -ENOMEM; + goto exit; } if (profile[i].size) @@ -260,6 +260,13 @@ u64 mthca_make_profile(struct mthca_dev */ dev->limits.num_pds = MTHCA_NUM_PDS; + /* for Sinai MPT table must be smaller the 2^24 for optimized oprtatipn */ + if ((dev->mthca_flags & MTHCA_FLAG_SINAI_OPT) && init_hca->log_mpt_sz > 23) { + total_size = -ENOSYS; + mthca_err(dev, "MPT table too large\n"); + goto exit; + } + /* * For Tavor, FMRs use ioremapped PCI memory. For 32 bit * systems it may use too much vmalloc space to map all MTT @@ -272,6 +279,7 @@ u64 mthca_make_profile(struct mthca_dev else dev->limits.fmr_reserved_mtts = request->fmr_reserved_mtts; +exit: kfree(profile); return total_size; } Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_main.c === --- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_main.c +++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_main.c @@ -937,11 +937,12 @@ static struct { u64 latest_fw; int is_memfree; int is_pcie; + int mkey_opt; } mthca_hca_table[] = { - [TAVOR]= { .latest_fw = MTHCA_FW_VER(3, 3, 3), .is_memfree = 0, .is_pcie = 0 }, - [ARBEL_COMPAT] = { .latest_fw = MTHCA_FW_VER(4, 7, 0), .is_memfree = 0, .is_pcie = 1 }, - [ARBEL_NATIVE] = { .latest_fw = MTHCA_FW_VER(5, 1, 0), .is_memfree = 1, .is_pcie = 1 }, - [SINAI]= { .latest_fw = MTHCA_FW_VER(1, 0, 1), .is_memfree = 1, .is_pcie = 1 } + [TAVOR]= { .latest_fw = MTHCA_FW_VER(3, 3, 3), .is_memfree = 0, .is_pcie = 0, .mkey_opt = 0 }, + [ARBEL_COMPAT] = { .latest_fw = MTHCA_FW_VER(4, 7, 0), .is_memfree = 0, .is_pcie = 1, .mkey_opt = 0 }, + [ARBEL_NATIVE] = { .latest_fw = MTHCA_FW_VER(5, 1, 0), .is_memfree = 1, .is_pcie = 1, .mkey_opt = 0 }, + [SINAI]= { .latest_fw = MTHCA_FW_VER(1, 0, 1), .is_memfree = 1, .is_pcie = 1, .mkey_opt = 1 } }; static int __devinit mthca_init_one(struct pci_dev *pdev, @@ -1037,6 +1038,9 @@ static int __devinit mthca_init_one(stru mdev->mthca_flags |= MTHCA_FLAG_MEMFREE; if (mthca_hca_table[id->driver_data].is_pcie) mdev->mthca_flags |= MTHCA_FLAG_PCIE; + if (mthca_hca_table[id->driver_data].mkey_opt) + mdev->mthca_flags |= MTHCA_FLAG_SINAI_OPT; + /* * Now reset the HCA before we touch the PCI capabilities or Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_dev.h === --- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_dev.h +++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_dev.h @@ -71,7 +71,8 @@ enum { MTHCA_FLAG_NO_LAM = 1 << 5, MTHCA_FLAG_FMR= 1 << 6, MTHCA_FLAG_MEMFREE= 1 << 7, - MTHCA_FLAG_PCIE = 1 << 8 + MTHCA_FLAG_PCIE = 1 << 8, + MTHCA_FLAG_SINAI_OPT = 1 << 9 }; enum { Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_mr.c === --- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_mr.c +++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_mr.c @@ -76,6 +76,8 @@ struct mthca_mpt_entry { #define MTHCA_MPT_STATUS_SW 0xF0 #define MTHCA_MPT_STATUS_HW 0x00 +#define SINAI_FMR_KEY_INC 0x100 + /* * Buddy allocator for MTT segments (currently not very efficient * since it doesn't keep a free list and just searches linearly @@ -330,6 +332,14 @@ static inline u32 key_to_hw_index(struct return tavor_key_to_hw_index(key); } +static inline u32 adjust_key(struct mthca_dev *dev, u32 key) +{ + if (dev->mthca_flags & MTHCA_FLAG_SINAI_OPT) + return ((key << 20) & 0x80) | (k
[openib-general] [PATCH] mthca - command interface
This patch is checks whether the HCA supports posting commands through doorbells and if they are, will configure the driver accordingly. This functionality can be controlled by the value of a read/write paramter - post_cmd2dbell. When 0 commands are posted through HCR - otherwise if HCA is capable commands go through doorbell. The value of the parameter works collectively on all available HCAs. This use of UAR0 to post commands eliminates the need for polling the go bit prior to posting a new command. Since reading from a PCI device is much more expensive then issuing a posted write, it is expected that this command will provide better CPU utilization. We are currently developing such tests and once we have results I will post them in this list. Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> Index: source/drivers/infiniband/hw/mthca/mthca_dev.h === --- source.orig/drivers/infiniband/hw/mthca/mthca_dev.h +++ source/drivers/infiniband/hw/mthca/mthca_dev.h @@ -117,9 +117,18 @@ enum { MTHCA_OPCODE_INVALID= 0xff }; +enum { + MTHCA_CMD_USE_EVENTS = (1 << 0), + MTHCA_CMD_CAN_POST_DOORBELLS = (1 << 1), + MTHCA_CMD_POST_DOORBELLS = (1 << 2) +}; + +enum { + MTHCA_CMD_NUM_DBELL_DWORDS = 8 +}; + struct mthca_cmd { struct pci_pool *pool; - int use_events; struct mutex hcr_mutex; struct semaphore poll_sem; struct semaphore event_sem; @@ -128,6 +137,10 @@ struct mthca_cmd { int free_head; struct mthca_cmd_context *context; u16 token_mask; + u32 flags; + void __iomem *dbell_map; + u64 dbell_base; + u16 dbell_offsets[MTHCA_CMD_NUM_DBELL_DWORDS]; }; struct mthca_limits { Index: source/drivers/infiniband/hw/mthca/mthca_cmd.c === --- source.orig/drivers/infiniband/hw/mthca/mthca_cmd.c +++ source/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -182,25 +182,68 @@ struct mthca_cmd_context { u8status; }; + +static int post_cmd2dbell = 1; +module_param(post_cmd2dbell, int, 0666); +MODULE_PARM_DESC(post_cmd2dbell, "post commands through doorbell"); + static inline int go_bit(struct mthca_dev *dev) { return readl(dev->hcr + HCR_STATUS_OFFSET) & swab32(1 << HCR_GO_BIT); } -static int mthca_cmd_post(struct mthca_dev *dev, - u64 in_param, - u64 out_param, - u32 in_modifier, - u8 op_modifier, - u16 op, - u16 token, - int event) + +static void mthca_cmd_post_dbell(struct mthca_dev *dev, +u64 in_param, +u64 out_param, +u32 in_modifier, +u8 op_modifier, +u16 op, +u16 token) { - int err = 0; + void __iomem *ptr = dev->cmd.dbell_map; + u16 *offs = dev->cmd.dbell_offsets; - mutex_lock(&dev->cmd.hcr_mutex); + __raw_writel((__force u32) cpu_to_be32(in_param >> 32), +ptr + offs[0]); + wmb(); + __raw_writel((__force u32) cpu_to_be32(in_param & 0xul), +ptr + offs[1]); + wmb(); + __raw_writel((__force u32) cpu_to_be32(in_modifier), +ptr + offs[2]); + wmb(); + __raw_writel((__force u32) cpu_to_be32(out_param >> 32), +ptr + offs[3]); + wmb(); + __raw_writel((__force u32) cpu_to_be32(out_param & 0xul), +ptr + offs[4]); + wmb(); + __raw_writel((__force u32) cpu_to_be32(token << 16), +ptr + offs[5]); + wmb(); + __raw_writel((__force u32) cpu_to_be32((1 << HCR_GO_BIT) | + (1 << HCA_E_BIT) | + (op_modifier << HCR_OPMOD_SHIFT) | + op), +ptr + offs[6]); + wmb(); + __raw_writel((__force u32) 0, +ptr + offs[7]); + wmb(); +} + +static int mthca_cmd_post_hcr(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + u16 token, + int e
Re: [openib-general] Re: PATCH] mthca - command interface - revised
Looks like we all agree that should be better to post commands through doorbells and save the polling o the go bit. Then I can change the patch so that there will be a r/w module parameter to mthca, ON by default, to enable this feature and the user can turn off/on this functionality on the fly by changing the value of this parameter. If we agree on this I'll post a new patch early next week Eli signature.asc Description: This is a digitally signed message part ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] PATCH] mthca - command interface - revised
Roland, this patch is modified according to your comments. It also adds a kernel configuration option which selects whether to use posting commands through doorbells. The option is off by default. Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_dev.h === --- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_dev.h +++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_dev.h @@ -117,9 +117,18 @@ enum { MTHCA_OPCODE_INVALID= 0xff }; +enum { + MTHCA_CMD_USE_EVENTS = (1 << 0), + MTHCA_CMD_CAN_POST_DOORBELLS = (1 << 1), + MTHCA_CMD_POST_DOORBELLS = (1 << 2) +}; + +enum { + MTHCA_CMD_NUM_DBELL_DWORDS = 8 +}; + struct mthca_cmd { struct pci_pool *pool; - int use_events; struct mutex hcr_mutex; struct semaphore poll_sem; struct semaphore event_sem; @@ -128,6 +137,10 @@ struct mthca_cmd { int free_head; struct mthca_cmd_context *context; u16 token_mask; + u32 flags; + void __iomem *dbell_map; + u64 dbell_base; + u16 dbell_offsets[MTHCA_CMD_NUM_DBELL_DWORDS]; }; struct mthca_limits { Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_cmd.c === --- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_cmd.c +++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -188,19 +188,57 @@ static inline int go_bit(struct mthca_de swab32(1 << HCR_GO_BIT); } -static int mthca_cmd_post(struct mthca_dev *dev, - u64 in_param, - u64 out_param, - u32 in_modifier, - u8 op_modifier, - u16 op, - u16 token, - int event) + +static void mthca_cmd_post_dbell(struct mthca_dev *dev, +u64 in_param, +u64 out_param, +u32 in_modifier, +u8 op_modifier, +u16 op, +u16 token) { - int err = 0; + void __iomem *ptr = dev->cmd.dbell_map; + u16 *offs = dev->cmd.dbell_offsets; + + __raw_writel((__force u32) cpu_to_be32(in_param >> 32), +ptr + offs[0]); + wmb(); + __raw_writel((__force u32) cpu_to_be32(in_param & 0xul), +ptr + offs[1]); + wmb(); + __raw_writel((__force u32) cpu_to_be32(in_modifier), +ptr + offs[2]); + wmb(); + __raw_writel((__force u32) cpu_to_be32(out_param >> 32), +ptr + offs[3]); + wmb(); + __raw_writel((__force u32) cpu_to_be32(out_param & 0xul), +ptr + offs[4]); + wmb(); + __raw_writel((__force u32) cpu_to_be32(token << 16), +ptr + offs[5]); + wmb(); + __raw_writel((__force u32) cpu_to_be32((1 << HCR_GO_BIT) | + (1 << HCA_E_BIT) | + (op_modifier << HCR_OPMOD_SHIFT) | + op), +ptr + offs[6]); + wmb(); + __raw_writel((__force u32) 0, +ptr + offs[7]); + wmb(); +} - mutex_lock(&dev->cmd.hcr_mutex); +static int mthca_cmd_post_hcr(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + u16 token, + int event) +{ if (event) { unsigned long end = jiffies + GO_BIT_TIMEOUT; @@ -210,10 +248,8 @@ static int mthca_cmd_post(struct mthca_d } } - if (go_bit(dev)) { - err = -EAGAIN; - goto out; - } + if (go_bit(dev)) + return -EAGAIN; /* * We use writel (instead of something like memcpy_toio) @@ -236,7 +272,29 @@ static int mthca_cmd_post(struct mthca_d (op_modifier << HCR_OPMOD_SHIFT) | op), dev->hcr + 6 * 4); -out: + return 0; +} + +static int mthca_cmd_post(struct mthca_dev *dev, + u64 in_param, +
Re: [openib-general] [PATCH] mthca - command interface
On Mon, 2006-02-13 at 22:30 -0800, Roland Dreier wrote: > Or is the command doorbell stuff inside the UAR, which we already > request in mthca_request_regions()? Yes. These are offsets into UAR0 so there is no need to request a region. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] mthca - command interface
This patch implements posting commands by issuing posted writes in a manner similar to doorbells. The benefit of this method is that it does not require polling the go bit before posting the command and thus can lower CPU utilization. Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> Signed-off-by: Michael S. Tsirkin <[EMAIL PROTECTED]> Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_dev.h === --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_dev.h +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_dev.h @@ -128,6 +128,13 @@ struct mthca_cmd { int free_head; struct mthca_cmd_context *context; u16 token_mask; + /* post commands like doorbells instead of hcr */ + int can_post_doorbells; + int dbell_post; + void __iomem *dbell_map; + void __iomem *dbell_ptrs[8]; + u64 dbell_base; + u16 dbell_offsets[8]; }; struct mthca_limits { Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.c === --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_cmd.c +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -188,19 +188,40 @@ static inline int go_bit(struct mthca_de swab32(1 << HCR_GO_BIT); } -static int mthca_cmd_post(struct mthca_dev *dev, - u64 in_param, - u64 out_param, - u32 in_modifier, - u8 op_modifier, - u16 op, - u16 token, - int event) -{ - int err = 0; - mutex_lock(&dev->cmd.hcr_mutex); +static void mthca_cmd_post_dbell(struct mthca_dev *dev, +u64 in_param, +u64 out_param, +u32 in_modifier, +u8 op_modifier, +u16 op, +u16 token) +{ + void __iomem **ptr = dev->cmd.dbell_ptrs; + + writel((__force u32) cpu_to_be32(in_param >> 32), ptr[0]); + writel((__force u32) cpu_to_be32(in_param & 0xul), ptr[1]); + writel((__force u32) cpu_to_be32(in_modifier), ptr[2]); + writel((__force u32) cpu_to_be32(out_param >> 32), ptr[3]); + writel((__force u32) cpu_to_be32(out_param & 0xul), ptr[4]); + writel((__force u32) cpu_to_be32(token << 16), ptr[5]); + writel((__force u32) cpu_to_be32((1 << HCR_GO_BIT) | + (1 << HCA_E_BIT) | + (op_modifier << HCR_OPMOD_SHIFT) | + op), ptr[6]); + writel((__force u32) 0, ptr[7]); +} + +static int mthca_cmd_post_hcr(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + u16 token, + int event) +{ if (event) { unsigned long end = jiffies + GO_BIT_TIMEOUT; @@ -210,10 +231,8 @@ static int mthca_cmd_post(struct mthca_d } } - if (go_bit(dev)) { - err = -EAGAIN; - goto out; - } + if (go_bit(dev)) + return -EAGAIN; /* * We use writel (instead of something like memcpy_toio) @@ -236,7 +255,29 @@ static int mthca_cmd_post(struct mthca_d (op_modifier << HCR_OPMOD_SHIFT) | op), dev->hcr + 6 * 4); -out: + return 0; +} + +static int mthca_cmd_post(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + u16 token, + int event) +{ + int err = 0; + + mutex_lock(&dev->cmd.hcr_mutex); + + if (dev->cmd.dbell_post && event) + mthca_cmd_post_dbell(dev, in_param, out_param, in_modifier, + op_modifier, op, token); + else + err = mthca_cmd_post_hcr(dev, in_param, out_param, in_modifier, +op_modifie
[openib-general] [PATCH] mthca: query_qp and query_srq
Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> Signed-off-by: Michael S. Tsirkin <[EMAIL PROTECTED]> implement ib_query_qp() and ib_query_srq() for mthca. Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.h === --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_cmd.h +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.h @@ -303,6 +303,8 @@ int mthca_RESIZE_CQ(struct mthca_dev *de u8 *status); int mthca_SW2HW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, int srq_num, u8 *status); +int mthca_QUERY_SRQ(struct mthca_dev *dev, u32 num, + struct mthca_mailbox *mailbox, u8 *status); int mthca_HW2SW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, int srq_num, u8 *status); int mthca_ARM_SRQ(struct mthca_dev *dev, int srq_num, int limit, u8 *status); Index: openib_gen2/drivers/infiniband/core/verbs.c === --- openib_gen2.orig/drivers/infiniband/core/verbs.c +++ openib_gen2/drivers/infiniband/core/verbs.c @@ -257,9 +257,18 @@ int ib_query_qp(struct ib_qp *qp, int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr) { - return qp->device->query_qp ? + int err; + + err = qp->device->query_qp ? qp->device->query_qp(qp, qp_attr, qp_attr_mask, qp_init_attr) : -ENOSYS; + if (err) + return err; + qp_init_attr->recv_cq = qp->recv_cq; + qp_init_attr->send_cq = qp->send_cq; + qp_init_attr->srq = qp->srq; + + return err; } EXPORT_SYMBOL(ib_query_qp); Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_srq.c === --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_srq.c +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_srq.c @@ -357,6 +357,40 @@ int mthca_modify_srq(struct ib_srq *ibsr return 0; } + +int mthca_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *srq_attr) +{ + struct mthca_dev *dev = to_mdev(ibsrq->device); + struct mthca_srq *srq = to_msrq(ibsrq); + struct mthca_mailbox *mailbox; + struct mthca_arbel_srq_context *arbel_ctx; + u8 status; + int err; + + mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + + err = mthca_QUERY_SRQ(dev, srq->srqn, mailbox, &status); + if (err) + goto exit; + if (mthca_is_memfree(dev)) { + arbel_ctx = mailbox->buf; + srq_attr->srq_limit = arbel_ctx->limit_watermark; + } + else + srq_attr->srq_limit = 0; + + srq_attr->pd = ibsrq->pd; + srq_attr->max_wr = srq->max; + srq_attr->max_sge = srq->max_gs; + +exit: + mthca_free_mailbox(dev, mailbox); + + return err; +} + void mthca_srq_event(struct mthca_dev *dev, u32 srqn, enum ib_event_type event_type) { Index: openib_gen2/drivers/infiniband/include/rdma/ib_verbs.h === --- openib_gen2.orig/drivers/infiniband/include/rdma/ib_verbs.h +++ openib_gen2/drivers/infiniband/include/rdma/ib_verbs.h @@ -423,9 +423,10 @@ enum ib_srq_attr_mask { }; struct ib_srq_attr { - u32 max_wr; - u32 max_sge; - u32 srq_limit; + u32max_wr; + u32max_sge; + u32srq_limit; + struct ib_pd *pd; }; struct ib_srq_init_attr { Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_provider.c === --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_provider.c +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_provider.c @@ -1270,7 +1270,10 @@ int mthca_register_device(struct mthca_d (1ull << IB_USER_VERBS_CMD_DETACH_MCAST)| (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) | (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | - (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ); + (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) | + (1ull << IB_USER_VERBS_CMD_QUERY_QP)| + (1ull << IB_USER_VERBS_CMD_QUERY_SRQ); + dev->ib_dev.node_type= RDMA_NODE_IB_CA; dev->ib_dev.phys_port_cnt= dev->limits.num_ports; dev->ib_dev.dma_device = &dev->pdev->dev; @@ -1291,7 +1294,8 @@ int mthca_register_device(struct mthca_d if (dev->mthca_flags & MTHCA_FLAG_SRQ) { dev->ib_dev.create_srq = mthca_create_srq; -
[openib-general] [PATCH] mthca: query_qp and query_srq
Signed-off-by: Eli Cohen <[EMAIL PROTECTED]> Signed-off-by: Michael S. Tsirkin <[EMAIL PROTECTED]> implement ib_query_qp() and ib_query_srq() for mthca. Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.h === --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_cmd.h +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.h @@ -303,6 +303,8 @@ int mthca_RESIZE_CQ(struct mthca_dev *de u8 *status); int mthca_SW2HW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, int srq_num, u8 *status); +int mthca_QUERY_SRQ(struct mthca_dev *dev, u32 num, + struct mthca_mailbox *mailbox, u8 *status); int mthca_HW2SW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, int srq_num, u8 *status); int mthca_ARM_SRQ(struct mthca_dev *dev, int srq_num, int limit, u8 *status); Index: openib_gen2/drivers/infiniband/core/verbs.c === --- openib_gen2.orig/drivers/infiniband/core/verbs.c +++ openib_gen2/drivers/infiniband/core/verbs.c @@ -257,9 +257,18 @@ int ib_query_qp(struct ib_qp *qp, int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr) { - return qp->device->query_qp ? + int err; + + err = qp->device->query_qp ? qp->device->query_qp(qp, qp_attr, qp_attr_mask, qp_init_attr) : -ENOSYS; + if (err) + return err; + qp_init_attr->recv_cq = qp->recv_cq; + qp_init_attr->send_cq = qp->send_cq; + qp_init_attr->srq = qp->srq; + + return err; } EXPORT_SYMBOL(ib_query_qp); Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_srq.c === --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_srq.c +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_srq.c @@ -357,6 +357,40 @@ int mthca_modify_srq(struct ib_srq *ibsr return 0; } + +int mthca_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *srq_attr) +{ + struct mthca_dev *dev = to_mdev(ibsrq->device); + struct mthca_srq *srq = to_msrq(ibsrq); + struct mthca_mailbox *mailbox; + struct mthca_arbel_srq_context *arbel_ctx; + u8 status; + int err; + + mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + + err = mthca_QUERY_SRQ(dev, srq->srqn, mailbox, &status); + if (err) + goto exit; + if (mthca_is_memfree(dev)) { + arbel_ctx = mailbox->buf; + srq_attr->srq_limit = arbel_ctx->limit_watermark; + } + else + srq_attr->srq_limit = 0; + + srq_attr->pd = ibsrq->pd; + srq_attr->max_wr = srq->max; + srq_attr->max_sge = srq->max_gs; + +exit: + mthca_free_mailbox(dev, mailbox); + + return err; +} + void mthca_srq_event(struct mthca_dev *dev, u32 srqn, enum ib_event_type event_type) { Index: openib_gen2/drivers/infiniband/include/rdma/ib_verbs.h === --- openib_gen2.orig/drivers/infiniband/include/rdma/ib_verbs.h +++ openib_gen2/drivers/infiniband/include/rdma/ib_verbs.h @@ -423,9 +423,10 @@ enum ib_srq_attr_mask { }; struct ib_srq_attr { - u32 max_wr; - u32 max_sge; - u32 srq_limit; + u32max_wr; + u32max_sge; + u32srq_limit; + struct ib_pd *pd; }; struct ib_srq_init_attr { Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_provider.c === --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_provider.c +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_provider.c @@ -1270,7 +1270,10 @@ int mthca_register_device(struct mthca_d (1ull << IB_USER_VERBS_CMD_DETACH_MCAST)| (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) | (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | - (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ); + (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) | + (1ull << IB_USER_VERBS_CMD_QUERY_QP)| + (1ull << IB_USER_VERBS_CMD_QUERY_SRQ); + dev->ib_dev.node_type= RDMA_NODE_IB_CA; dev->ib_dev.phys_port_cnt= dev->limits.num_ports; dev->ib_dev.dma_device = &dev->pdev->dev; @@ -1291,7 +1294,8 @@ int mthca_register_device(struct mthca_d if (dev->mthca_flags & MTHCA_FLAG_SRQ) { dev->ib_dev.create_srq = mthca_create_srq; -
RE: [openib-general] DHCP over Infiniband
Title: RE: [openib-general] DHCP over Infiniband The client is Etherboot's client for configuring a client at boot time. The server is ISC. Can you explain what you mean by "the QPN issue"? -Original Message- From: Hal Rosenstock [mailto:[EMAIL PROTECTED]] Sent: Thursday, November 03, 2005 6:07 PM To: Eli Cohen Cc: 'openib-general@openib.org' Subject: Re: [openib-general] DHCP over Infiniband On Thu, 2005-11-03 at 11:01, Eli Cohen wrote: > Hi, > has anyone had the chance to run a DHCP server on an Infiniband > interface? I checked this on Suse 10 kernel 2.6.13-15-bigsmp and I do > not get responses from the server to DHCP requests. When running > tcpdump on ib0 interface I can see the requests but the server does > not respond. The server's version is isc-dhcpd-V3.0.3. I also tried > version dhcp-3.0.4b1 but with no luck. I checked on Suse SLES 9 with > Mellanox's IBGD1.8 and the server responds to requests. I still had a > problem that the server does not set the client identifier option in > its responses although the client does set this option. If you have > any experience with this please let me know. What DHCP server and what client are you using ? This has been done with the ISC ones. It requires modifications due to the difference in hardware addresses (and there is the QPN issue). -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] DHCP over Infiniband
Title: DHCP over Infiniband Hi, has anyone had the chance to run a DHCP server on an Infiniband interface? I checked this on Suse 10 kernel 2.6.13-15-bigsmp and I do not get responses from the server to DHCP requests. When running tcpdump on ib0 interface I can see the requests but the server does not respond. The server's version is isc-dhcpd-V3.0.3. I also tried version dhcp-3.0.4b1 but with no luck. I checked on Suse SLES 9 with Mellanox's IBGD1.8 and the server responds to requests. I still had a problem that the server does not set the client identifier option in its responses although the client does set this option. If you have any experience with this please let me know. Thanks Eli ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] RE: ptrace peektext failure for Mellanox IBGD 1.7.0 based cluster
Title: RE: ptrace peektext failure for Mellanox IBGD 1.7.0 based cluster David, IBGD 1.7 does not support kernel 2.6.11 so I assume you have made changes to IBGD to make it compile. In the files you sent I can't see a call to ptrace with PTRACE_PEEKTEXT but I can see a call to PTRACE_PEEKDATA. Note that in the IBGD stack, registered buffers are not inherited by a child process when a the parent forks. This is accomplished by setting the VM_DONTCOPY flag on the vma. It is so done to retain the virtual to physical translation of a page at the parent by disabling COW on the pages. So the child may not even have these buffers in its address space and this could be the reason why ptrace fails. Note also that IBGD 1.8 is the latest release and it does support kernel 2.6.11 so you may consider using it, though the description above holds also for IBGD 1.8 Eli -Original Message- From: David Lecomber [mailto:[EMAIL PROTECTED]] Sent: Monday, October 10, 2005 11:23 AM To: openib-general@openib.org Subject: ptrace peektext failure for Mellanox IBGD 1.7.0 based cluster Dear all, I'm having a kernel problem which I believe to be caused by the infiniband drivers on the system I am using. Kernel 2.6.11, Mellanox software stack IBGD 1.7.0. Essentially, once an MPI code is started, the kernel refuses to allow ptrace() access to the text segment (ie. where the program instructions lie), although it is possible to access the data segment. This means debugging is impossible (gdb, idb, ddt, etc.). The attached code demonstrates the problem. Untar, and then make. Run the 'mpi' program, and pick a line of it's output, paste into another shell. On the standard, non MPI test code, the ptrace reads are all successful. On the MPI test, it gives an error for the text segment reads.. Is this a known issue - are there any upgrades/fixes which should have been applied? I would appreciate if someone could run the test suggested on a really new setup, and see if the error happens. Regards David -- David Lecomber, CTO, Allinea Software tel: +44 1926 623231 fax: +44 1926 623232 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] RE: registering read-only memory
Title: RE: registering read-only memory OK thanks. I had another problem in MVAPI. -Original Message- From: Roland Dreier [mailto:[EMAIL PROTECTED]] Sent: Friday, May 27, 2005 7:49 PM To: Eli Cohen Cc: Michael S. Tsirkin; openib-general@openib.org Subject: Re: registering read-only memory Eli> I saw this this on Suse 9.3 and the memory was like: static Eli> const char *my_string = "Hello world"; defined locally in a Eli> function. This produced a VMA with VM_IO set. What kernel version/architecture? I'm having a hard time seeing how it's possible to get a VMA with VM_IO set except by using remap_pfn_range(), and I can't see how ordinary const data from userspace could cause that. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] RE: registering read-only memory
Title: RE: registering read-only memory R> Are you sure? It seems to work for me if I just pass write=0 to R> get_user_pages(). I saw this this on Suse 9.3 and the memory was like: static const char *my_string = "Hello world"; defined locally in a function. This produced a VMA with VM_IO set. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] RE: registering read-only memory
Title: RE: registering read-only memory I'm afraid that won't solve the problem. From some reason the vma of such a region is flagged with VM_IO and that's why get_user_pages() will fail. We need to find another way to get the physical address of such memory. One thing I can think of is implementing something similar to follow_page(). Or better if the kernel will export follow_page(). -Original Message- From: Roland Dreier [mailto:[EMAIL PROTECTED]] Sent: Thursday, May 26, 2005 10:09 PM To: Eli Cohen Cc: Michael S. Tsirkin; openib-general@openib.org Subject: Re: registering read-only memory OK, I think I understand the problem. The compiler puts a string defined with "const char *foo" on a read-only page, but in uverbs_mem.c we always pass write=1 to get_user_pages(). I need to fix the code so that it only asks for writeable pages if we're registering with writable permissions. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] RE: registering read-only memory
Title: RE: registering read-only memory Please try: const char *foo="Michael Tsirkin"; as opposed to const char foo[]="Michael Tsirkin"; -Original Message- From: Roland Dreier [mailto:[EMAIL PROTECTED]] Sent: Tuesday, May 24, 2005 10:07 PM To: Michael S. Tsirkin Cc: openib-general@openib.org; Eli Cohen Subject: Re: registering read-only memory > const char foo[]="Michael Tsirkin"; > ibv_reg_mr(ctx->pd, foo, strlen(foo), 0); This seemed to work fine on my system (I got a non-null ibv_mr back with valid-looking L_Key/R_Key). - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] cache.c fixes
Title: [openib-general] [PATCH] cache.c fixes Looks like allocation size is buggy and also cleanup. Index: cache.c === --- cache.c (revision 1292) +++ cache.c (working copy) @@ -249,10 +249,10 @@ device->cache.pkey_cache = kmalloc(sizeof *device->cache.pkey_cache * - end_port(device) - start_port(device), GFP_KERNEL); + (end_port(device) - start_port(device) + 1), GFP_KERNEL); device->cache.gid_cache = kmalloc(sizeof *device->cache.pkey_cache * - end_port(device) - start_port(device), GFP_KERNEL); + (end_port(device) - start_port(device) + 1), GFP_KERNEL); if (!device->cache.pkey_cache || !device->cache.gid_cache) { printk(KERN_WARNING "Couldn't allocate cache " @@ -280,8 +280,14 @@ } err: - kfree(device->cache.pkey_cache); - kfree(device->cache.gid_cache); + if (device->cache.pkey_cache) { + kfree(device->cache.pkey_cache); + device->cache.pkey_cache = NULL; + } + if (device->cache.gid_cache) { + device->cache.gid_cache = NULL; + kfree(device->cache.gid_cache); + } } void ib_cache_cleanup_one(struct ib_device *device) @@ -296,8 +302,12 @@ kfree(device->cache.gid_cache[p]); } - kfree(device->cache.pkey_cache); - kfree(device->cache.gid_cache); + if (device->cache.pkey_cache) { + kfree(device->cache.pkey_cache); + } + if (device->cache.gid_cache) { + kfree(device->cache.gid_cache); + } } struct ib_client cache_client = { ___ openib-general mailing list [EMAIL PROTECTED] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general