Re: [openib-general] IPoIB odd loopback packet from arp

2006-10-24 Thread Eli Cohen
Todd,
This does not look like an error. The first arp is a broadcast
(qpn=ff) so it is received in at the sending interface and is
dropped. The second on is a unicast arp (qpn=0x000404) so it is not
received at the local interface. 


On Mon, 2006-10-23 at 13:48 -0600, Todd Bowman wrote:
> Using the OFED 1.0 and OFED 1.1 stack I have notice some rcvswrelay
> errors.  I have tracked it down to the arp request.  I can reproduce
> the problem with the following steps:
> 
> ( I have used both 2.6.14.14 and 2.6.18.1 kernels) 
> 
> ib109> arp -d ib110
> ib109> ping ib110 -c 2
> 
> # ib_ipoib module debug
> 13:15:46 ib109 kernel: ib0: sending packet, length=60 address=f6187200
> qpn=0xff 
> 13:15:46 ib109 kernel: ib0: called: id 34, op 0, status: 0
> 13:15:46 ib109 kernel: ib0: send complete, wrid 34
> 13:15:46 ib109 kernel: ib0: called: id -2147483623, op 128, status: 0
> 13:15:46 ib109 kernel: ib0: received 100 bytes, SLID 0x0369 
> 13:15:46 ib109 kernel: ib0: dropping loopback packet
> 13:15:46 ib109 kernel: ib0: called: id -2147483622, op 128, status: 0
> 13:15:46 ib109 kernel: ib0: received 100 bytes, SLID 0x016d
> 13:15:46 ib109 kernel: ib0: sending packet, length=88 address=f6e57520
> qpn=0x000404 
> 13:15:46 ib109 kernel: ib0: called: id 35, op 0, status: 0
> 13:15:46 ib109 kernel: ib0: send complete, wrid 35
> 13:15:46 ib109 kernel: ib0: called: id -2147483621, op 128, status: 0
> 13:15:46 ib109 kernel: ib0: received 128 bytes, SLID 0x016d 
> 13:15:47 ib109 kernel: ib0: sending packet, length=88 address=f6e57520
> qpn=0x000404
> 13:15:47 ib109 kernel: ib0: called: id 36, op 0, status: 0
> 13:15:47 ib109 kernel: ib0: send complete, wrid 36
> 13:15:47 ib109 kernel: ib0: called: id -2147483620, op 128, status: 0 
> 13:15:47 ib109 kernel: ib0: received 128 bytes, SLID 0x016d
> 13:15:51 ib109 kernel: ib0: called: id -2147483619, op 128, status: 0
> 13:15:51 ib109 kernel: ib0: received 100 bytes, SLID 0x016d
> 13:15:51 ib109 kernel: ib0: sending packet, length=60 address=f6e57520
> qpn=0x000404 
> 13:15:51 ib109 kernel: ib0: called: id 37, op 0, status: 0
> 13:15:51 ib109 kernel: ib0: send complete, wrid 37
> 
> # tcpdump -i ib0
> 13:15:46.977578 arp who-has ib110 tell ib109 hardware #32
> 13:15:46.977682 arp reply ib110 is-at
> 00:00:04:04:fe:80:00:00:00:00:00:00:00:08:f1:04:03:96:11:59 hardware
> #32 
> 13:15:46.977710 IP ib109 > ib110: icmp 64: echo request seq 0
> 13:15:46.977790 IP ib110 > ib109: icmp 64: echo reply seq 0
> 13:15:47.92 IP ib109 > ib110: icmp 64: echo request seq 1
> 13:15:47.977892 IP ib110 > ib109: icmp 64: echo reply seq 1
> 13:15:51.977076 arp who-has ib109 tell ib110 hardware #32
> 13:15:51.977094 arp reply ib109 is-at
> 00:02:00:14:fe:80:00:00:00:00:00:00:00:02:c9:02:00:00:3b:31 hardware
> #32 
> 
> # error dump
> rcvswrelayerrors:1 MT47396 Infiniscale-III 0x2c9010b022090[1]
> <> ib109 HCA-1 0x2c9023b30[1]   
> 
> 1) The ping is successful and the arp table is populated so Is this
> really a problem or a false positive?  
> 2) The second arp does not generate an error (the error dump reports
> all new errors in switches). Why?
> 
> Any ideas?
> 
> Thanks in advance.
> 
> Todd
> 
> ___
> openib-general mailing list
> openib-general@openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] IPOIB NAPI

2006-10-16 Thread Eli Cohen
On Mon, 2006-10-16 at 09:48 -0700, Roland Dreier wrote:
>     Eli> Please diff to see my comments. Generaly it looks like the
> Eli> condition on netif_rx_reschedule() should be inverted.
> 
> Why?  A return value of 0 means that the reschedule failed (probably
> because the poll routine is already running somewhere else) and the
> poll routine should just return.  I think the code is correct as it stands.
> 
> Eli> Also ou need to set max to some large value since you don't
> Eli> know if how many completions you missed and you want to make
> Eli> sure you get all the ones the sneaked from the last poll to
> Eli> the request notify.
> 
> Why?  max is there to limit us from doing more work than the quota
> passed in from the networking stack.  If we fail to drain the CQ
> because we exhaust max, then the poll routine will return 1 and will
> remain scheduled, so the networking stack will call the poll routine
> again to continue grabbing completions.
> 
>  - R.

OK I see what you mean. So I guess it's OK then.


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] IPOIB NAPI

2006-10-16 Thread Eli Cohen
On Sun, 2006-10-15 at 09:39 -0700, Roland Dreier wrote:
> I've been meaning to mention this... I have a preliminary version in
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git 
> ipoib-napi
> 
> There are further changes I would like to add on top of that, but
> comments on the two patches there would be appreciated.  And also
> benchmarks would be good.

Please diff to see my comments. Generaly it looks like the condition on
netif_rx_reschedule() should be inverted. Also ou need to set max to
some large value since you don't know if how many completions you missed
and you want to make sure you get all the ones the sneaked from the last
poll to the request notify.

int ipoib_poll(struct net_device *dev, int *budget)
{
struct ipoib_dev_priv *priv = netdev_priv(dev);
int max = min(*budget, dev->quota);
int done;
int t;
int empty;
int missed_event;
int n, i;

repoll:
done  = 0;
empty = 0;

while (max) {
t = min(IPOIB_NUM_WC, max);
n = ib_poll_cq(priv->cq, t, priv->ibwc);

for (i = 0; i < n; ++i) {
if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) {
++done;
--max;
ipoib_ib_handle_rx_wc(dev, priv->ibwc + i);
} else
ipoib_ib_handle_tx_wc(dev, priv->ibwc + i);
}

if (n != t) {
empty = 1;
break;
}
}

dev->quota -= done;
*budget-= done;

if (empty) {
netif_rx_complete(dev);
ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP, &missed_event);
if (missed_event && !netif_rx_reschedule(dev, 0)) {
max = 1000;
goto repoll;
}

return 0;
}

return 1;
}



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] IPOIB NAPI

2006-10-15 Thread Eli Cohen
Hi Roland,

can you tell when you are going to push your NAPI patch to ipoib? Is
there anything I can do to help making this happen?


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [PATCH] IB/ipoib - possible deadlock in path query and join

2006-10-05 Thread Eli Cohen
When a path query or a join fails immediately, we want to
call complete(&obj->done). This is required to avoid a deadlock
that can occur if there is no farther attempt to the operation
(query or join) and a wait_for_completion() is called on obj->done
due to a flush operation.

Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>
---

Index: openib-1.1/drivers/infiniband/ulp/ipoib/ipoib_main.c
===
--- openib-1.1.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c   2006-10-05 
09:09:27.0 +0200
+++ openib-1.1/drivers/infiniband/ulp/ipoib/ipoib_main.c2006-10-05 
09:12:00.0 +0200
@@ -504,6 +504,7 @@ static int path_rec_start(struct net_dev
if (path->query_id < 0) {
ipoib_warn(priv, "ib_sa_path_rec_get failed\n");
path->query = NULL;
+   complete(&path->done);
return path->query_id;
}
 
Index: openib-1.1/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===
--- openib-1.1.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c  
2006-10-05 09:09:23.0 +0200
+++ openib-1.1/drivers/infiniband/ulp/ipoib/ipoib_multicast.c   2006-10-05 
12:23:25.0 +0200
@@ -371,6 +371,7 @@ static int ipoib_mcast_sendonly_join(str
if (ret < 0) {
ipoib_warn(priv, "ib_sa_mcmember_rec_set failed (ret = %d)\n",
   ret);
+   complete(&mcast->done);
} else {
ipoib_dbg_mcast(priv, "no multicast record for " IPOIB_GID_FMT
", starting join\n",
@@ -501,6 +502,7 @@ static void ipoib_mcast_join(struct net_
 
if (ret < 0) {
ipoib_warn(priv, "ib_sa_mcmember_rec_set failed, status %d\n", 
ret);
+   complete(&mcast->done);
 
mcast->backoff *= 2;
if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] [PATCH] IB/ipoib: NAPI

2006-09-26 Thread Eli cohen
On Tue, 2006-09-26 at 17:35 -0400, Bernard King-Smith wrote:

> Has anyone run the RR test in Netperf to look at latency? What 1 byte
> RR rates did you see before and after applying the patch. 
> 
netperf TCP_RR:
rate [tran/sec]
---
without NAPI:  39600
with NAPI: 39470

As you can see there is a minor difference.


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] heads-up - ipoib NAPI

2006-09-26 Thread Eli cohen
On Tue, 2006-09-26 at 21:34 -0700, Shirley Ma wrote:

> NAPI patch moves ipoib poll from hardware interrupt context to softirq
> context. It would reduce the hardware interrupts, reduce hardware
> latency and induce some network latency. It might reduce cpu
> utilization. But I still question about the BW improvement. I did see
> various performance with the same test under the same condition.
> 
When you open just one connection you can see around 10% of variations
in BW measure. But then you don't utilize all the CPU power you have and
you don't get to the threshold where NAPI becomes effective.
Using multiple connections utilizes all CPUs in the system, increases
send rate, and increases the chances of the receiver to poll CQEs up to
its quota and be scheduled again without re-enabling interrupts.


> Have you tested this patch with different message sizes, different
> socket sizes? Are these results consistent better?
> 
I used large socket sizes but I think with that with multiple
connections this parameter does not have significant effect.


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] heads-up - ipoib NAPI

2006-09-25 Thread Eli cohen
On Thu, 2006-09-21 at 19:31 -0700, harish wrote:

> How did the CPU utilizations compare for the NAPI vs. no NAPI case?
> What are your thoughts on what bottleneck you are hitting?
> 


The CPU utilization reported by netperf is not accurate since it is not
reported on a per cpu basis and I don't have one number to reliably
describe utilization. I think the bottleneck is that the CPU that
handles the softirq that handles the rx sk_buffs is 100% utilized and it
dictates the limit.



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] [PATCH] IB/ipoib: NAPI

2006-09-25 Thread Eli cohen
I experimented with your patch and could not find any noticeable 
change in BW or interrupt rate so I guess we can use your ideas.

Still, note that NAPI_howto.txt does not read *budget to limit the 
number of polls and the code bellow from kernel 2.6.17.7 takes into 
account that budget can become negative.

static void net_rx_action(struct softirq_action *h)
{
struct softnet_data *queue = &__get_cpu_var(softnet_data);
unsigned long start_time = jiffies;
int budget = netdev_budget;
void *have;

local_irq_disable();

while (!list_empty(&queue->poll_list)) {
struct net_device *dev;

if (budget <= 0 || jiffies - start_time > 1)
goto softnet_break;
...
}



>I took a stab at implementing this myself, and it
>
> You might want to respin your patch against my for-2.6.19 branch 

Do you think I should work on this or you plan to push your code?


>> The biggest problem I have with this is that I don't know what to
call
>> the feature bit.  Any suggestions?

>Maybe set bit for the lack of the feature? REQUIRES_POLL_AFTER_ARM?

We can take Michael's suggestion or use NOT_REQUIRES_POLL_AFTER_ARM so
we can implement this for mthca without touching ipath or ehca at the
first step.




___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] [PATCH] IB/ipoib: NAPI

2006-09-21 Thread eli
>
> However I don't think we want to use peek CQ always -- I think that
> extra CQ lock/unlock may kill a lot of the performance gain you see
> with NAPI (and I don't think even mthca can do a lockless CQ peek,
> since we need to protect against races with resize CQ, etc).  So
> probably what we need is a feature bit in the struct ib_device to say
> whether the peek CQ is needed or whether req notify will generate
> events for existing CQEs.
>
Sounds good to me

> You might want to respin your patch against my for-2.6.19 branch -- I
> already split up the handle WC routine into separate send and receive
> functions, so the patch will become much smaller.
>

Sure. I can send an updated patch

> Also, the handling of how many completions to poll and the logic of
> when to call netif_rx_complete() seems very strange to me.  First, you
> totally ignore the budget parameter,
Not totally. I update it whenever I decide it is a "not_done" which
happens in two cases:
a. I finish my quota
b. I handled any completions - rx or tx. I do count tx as well since I
want to  coalesce as many completions as possible.

I do not update quota and budget when I exit polling mode because I think
there is no point in doing that (this is unlike the example in
NAPI-howto.txt but I will check again if this is the right thing to do).


>
> My poll routine ended up looking like the following, which I think is
> more correct:
>
> +int ipoib_poll(struct net_device *dev, int *budget)
> +{
> + struct ipoib_dev_priv *priv = netdev_priv(dev);
> + int max = min(*budget, dev->quota);
> + int done = 0;
> + int t;
> + int empty = 0;
> + int n, i;
> +
> + while (max) {
> + t = min(IPOIB_NUM_WC, max);
> + n = ib_poll_cq(priv->cq, t, priv->ibwc);
> +
> + for (i = 0; i < n; ++i) {
> + if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) {
> + ++done;
> + --max;
> + ipoib_ib_handle_rx_wc(dev, priv->ibwc + i);
> + } else
> + ipoib_ib_handle_tx_wc(dev, priv->ibwc + i);
> + }
> +
> + if (n != t) {
> + empty = 1;
> + break;
I don't think this is the right thing to do. Polling less completions then
you could is no reason to quit polling mode. You may receive more
completions in the next time you got called.

> + }
> + }
> +
> + dev->quota -= done;
> + *budget-= done;
> +
> + if (empty) {
> + netif_rx_complete(dev);
> + ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP);
> + /* XXX rotting packet! */
> + return 0;
> + }
> +
> + return 1;
> +}
>



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] heads-up - ipoib NAPI

2006-09-21 Thread eli
> Hi Eli,
>
> Thanks for sharing the results with us. It is great to see the reduction
> in
> Interrupts. Could you please specify the netperf test specifications
> [message size; socket size]. Wondering what the numbers would be if we use
> large socket and message sizes [128K & 64K respectively]. The reason for
> the
> request is to make sure we are not hitting any TCP related bottleneck
> while
> comparing NAPI vs. no NAPI cases. Please let me know what you think.

I used large socket buffer sizes. Here is the command line I used. The
reult for the bandwidth is the some of all the connections.

netperf -H 11.4.3.144 -l 600 -f M -p $port -- -s 20,20 -S
20,20


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [PATCH] IB/ipoib: NAPI

2006-09-21 Thread Eli cohen
This patch implements NAPI for iopib. It is a draft implementation.
I would like your opinion on whether we need a module parameter
to control if NAPI should be activated or not.
Also there is a need to implement peek_cq and call it for
ib_req_notify_cq() so as to know if there is a need to call
netif_rx_schedule_prep() again.

Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>
---

Index: openib-1.1-rc6/drivers/infiniband/ulp/ipoib/ipoib_main.c
===
--- openib-1.1-rc6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c   
2006-09-21 16:30:35.0 +0300
+++ openib-1.1-rc6/drivers/infiniband/ulp/ipoib/ipoib_main.c2006-09-21 
16:30:42.0 +0300
@@ -69,6 +69,8 @@
 MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0");
 #endif
 
+static const int poll_def_weight = 64;
+
 struct ipoib_path_iter {
struct net_device *dev;
struct ipoib_path  path;
@@ -91,6 +93,9 @@
.remove = ipoib_remove_one
 };
 
+
+int ipoib_poll(struct net_device *dev, int *budget);
+
 int ipoib_open(struct net_device *dev)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -689,6 +694,7 @@
goto out;
}
 
+
if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
spin_lock(&priv->lock);
__skb_queue_tail(&neigh->queue, skb);
@@ -892,6 +898,7 @@
 
/* Delete any child interfaces first */
list_for_each_entry_safe(cpriv, tcpriv, &priv->child_intfs, list) {
+   netif_poll_disable(priv->dev);
unregister_netdev(cpriv->dev);
ipoib_dev_cleanup(cpriv->dev);
free_netdev(cpriv->dev);
@@ -919,6 +926,8 @@
dev->hard_header = ipoib_hard_header;
dev->set_multicast_list  = ipoib_set_mcast_list;
dev->neigh_setup = ipoib_neigh_setup_dev;
+   dev->poll= ipoib_poll;
+   dev->weight  = poll_def_weight;
 
dev->watchdog_timeo  = HZ;
 
@@ -1097,6 +1106,8 @@
goto register_failed;
}
 
+   netif_poll_enable(priv->dev);
+
ipoib_create_debug_files(priv->dev);
 
if (ipoib_add_pkey_attr(priv->dev))
@@ -,6 +1122,7 @@
return priv->dev;
 
 sysfs_failed:
+   netif_poll_disable(priv->dev);
ipoib_delete_debug_files(priv->dev);
unregister_netdev(priv->dev);
 
@@ -1168,6 +1180,7 @@
dev_list = ib_get_client_data(device, &ipoib_client);
 
list_for_each_entry_safe(priv, tmp, dev_list, list) {
+   netif_poll_disable(priv->dev);
ib_unregister_event_handler(&priv->event_handler);
flush_scheduled_work();
 
Index: openib-1.1-rc6/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===
--- openib-1.1-rc6.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2006-09-21 
16:30:38.0 +0300
+++ openib-1.1-rc6/drivers/infiniband/ulp/ipoib/ipoib_ib.c  2006-09-21 
17:24:59.0 +0300
@@ -169,7 +169,7 @@
return 0;
 }
 
-static void ipoib_ib_handle_wc(struct net_device *dev,
+static void ipoib_ib_handle_rwc(struct net_device *dev,
   struct ib_wc *wc)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -178,122 +178,186 @@
ipoib_dbg_data(priv, "called: id %d, op %d, status: %d\n",
   wr_id, wc->opcode, wc->status);
 
-   if (wr_id & IPOIB_OP_RECV) {
-   wr_id &= ~IPOIB_OP_RECV;
-
-   if (wr_id < ipoib_recvq_size) {
-   struct sk_buff *skb  = priv->rx_ring[wr_id].skb;
-   dma_addr_t  addr = priv->rx_ring[wr_id].mapping;
-
-   if (unlikely(wc->status != IB_WC_SUCCESS)) {
-   if (wc->status != IB_WC_WR_FLUSH_ERR)
-   ipoib_warn(priv, "failed recv event "
-  "(status=%d, wrid=%d 
vend_err %x)\n",
-  wc->status, wr_id, 
wc->vendor_err);
-   dma_unmap_single(priv->ca->dma_device, addr,
-IPOIB_BUF_SIZE, 
DMA_FROM_DEVICE);
-   dev_kfree_skb_any(skb);
-   priv->rx_ring[wr_id].skb = NULL;
-   return;
-   }
-
-   /*
-* If we can't allocate a new RX buffer, dump
-* this packet and reuse the old buffer.
-*/
-   if (unlikely(ipoib_alloc_rx_skb(dev, wr

[openib-general] heads-up - ipoib NAPI

2006-09-21 Thread Eli cohen
Hi,

I have a draft implementation of NAPI in ipoib and got the following
results:

System descriptions
===
Quad CPU E64T 2.4 Ghz
4 GB RAM
MT25204 Sinai HCA

I used netperf for benchmarking, the BW test ran for 600 seconds with 8
clients and 8 servers.

The results I received are bellow:

netperf TCP_STREAM:
BW [MByte/sec]clients side [irqs/sec]   server side 
[irqs/sec]
-----   
--
without NAPI:   50686441   66311
with NAPI:  550 6830   13600 


netperf TCP_RR:
rate [tran/sec]
---
without NAPI:  39600
with NAPI: 39470



Please note this is still under work and we plan to do more tests and
measure on other devices.


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [PATCH] IB/ipoib: unlikely in send

2006-09-21 Thread Eli cohen
Use unlikely in send flow

Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>
---

Index: openib-1.1-rc6/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===
--- openib-1.1-rc6.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2006-09-21 
16:19:33.0 +0300
+++ openib-1.1-rc6/drivers/infiniband/ulp/ipoib/ipoib_ib.c  2006-09-21 
16:20:39.0 +0300
@@ -385,7 +385,7 @@
struct ipoib_tx_buf *tx_req;
dma_addr_t addr;
 
-   if (skb->len > dev->mtu + INFINIBAND_ALEN) {
+   if (unlikely(skb->len > dev->mtu + INFINIBAND_ALEN)) {
ipoib_warn(priv, "packet len %d (> %d) too long to send, 
dropping\n",
   skb->len, dev->mtu + INFINIBAND_ALEN);
++priv->stats.tx_dropped;



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [PATCH] IB/ipoib: likely/unlikely annotations

2006-09-21 Thread Eli cohen
Use likely/unlikely in data tx flow

Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>
Acked-by: Michael S. Tsirkin <[EMAIL PROTECTED]>
---

Index: openib-1.1-rc6/drivers/infiniband/ulp/ipoib/ipoib_main.c
===
--- openib-1.1-rc6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c   
2006-09-21 15:43:49.0 +0300
+++ openib-1.1-rc6/drivers/infiniband/ulp/ipoib/ipoib_main.c2006-09-21 
15:46:26.0 +0300
@@ -643,7 +643,7 @@
struct ipoib_neigh *neigh;
unsigned long flags;
 
-   if (!spin_trylock_irqsave(&priv->tx_lock, flags))
+   if (unlikely(!spin_trylock_irqsave(&priv->tx_lock, flags)))
return NETDEV_TX_LOCKED;
 
/*
@@ -656,7 +656,7 @@
return NETDEV_TX_BUSY;
}
 
-   if (skb->dst && skb->dst->neighbour) {
+   if (likely(skb->dst && skb->dst->neighbour)) {
if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) {
ipoib_path_lookup(skb, dev);
goto out;



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] [PATCH] ipoib mcast restart

2006-09-19 Thread eli
> Why is the ipoib_mcast_start_thread() at the end of ipoib_ib_dev_up()
> not sufficient to rejoin all the mcgs?
>
Because after a port event all the mcast groups on the device are flushed
and all that remains is from the dev->mclist and we must renew the joins
from there.


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] ipoib multicast problem

2006-09-19 Thread eli
> >
> I don't understand.  How could ipoib rejoin the broadcast group and
> then not rejoin the rest of the full member groups it has?
>
>
That is because the broadcast group is not part of the multicast groups
maintained by the kernel but rather is part of ipoib and is joined from a
different function. The other full members are maintained by the kernel
for the net device and come from dev->mclist.


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] ipoib multicast problems on RHEL4.0 u4

2006-09-19 Thread Eli cohen
Hi,

while testing ipoib multicast on RHEL4.0 u4, I noticed that setsockopt()
succeeds to add a multicast group to an interface but actually the
multicast group is not added to the net_device. This means that an
application cannot join a multicast group as a full member. When I
examined the differences between the kernel sources for u3 and u4 I
noticed that essential code was removed:

diff -ru net/ipv4/arp.c ../linux-2.6.9-42.ELsmp/net/ipv4/arp.c
--- net/ipv4/arp.c  2006-09-18 15:35:03.0 +0300
+++ ../linux-2.6.9-42.ELsmp/net/ipv4/arp.c  2006-09-19
10:08:06.0 +0300
@@ -213,9 +213,6 @@
case ARPHRD_IEEE802_TR:
ip_tr_mc_map(addr, haddr);
return 0;
-   case ARPHRD_INFINIBAND:
-   ip_ib_mc_map(addr, haddr);
-   return 0;
default:
if (dir) {
memcpy(haddr, dev->broadcast, dev->addr_len);


Can anyone suggest a workaround to this issue?

Thanks
Eli


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [PATCH] ipoib mcast restart

2006-09-18 Thread Eli cohen
Make sure after after ipoib_ib_dev_flush is executed,
ipoib_mcast_restart_task is executed also to join all the
mcast groups maintained by the kernel for the device.

Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>
Signed-off-by: Michael S. Tsirkin <[EMAIL PROTECTED]>

Index: openib-1.1/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===
--- openib-1.1.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2006-09-14
17:20:06.0 +0300
+++ openib-1.1/drivers/infiniband/ulp/ipoib/ipoib_ib.c  2006-09-17
15:51:52.0 +0300
@@ -619,8 +619,10 @@
 * The device could have been brought down between the start and when
 * we get here, don't bring it back up if it's not configured up
 */
-   if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
+   if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) {
ipoib_ib_dev_up(dev);
+   ipoib_mcast_restart_task(dev);
+   }
 
mutex_lock(&priv->vlan_mutex);
 



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] ipoib multicast problem

2006-09-18 Thread Eli cohen
Hi,
I have seen the following problem with ipoib:

1. An application registers to a multicast group as a full member. As a
result all the groups are listed in dev->mclist.
2. The infiniband link falls momentarily, opensm restarted etc.
3. All multicast memberships are flushed.
4. The net device will not join again until at a later time something
will cause ipoib_set_mcast_list() to be called.
 



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [PATCH] ipoib sendonly join

2006-09-14 Thread Eli cohen
When sendonly join fails mcast->query must be set to NULL in
order that succeesive joins will be attempted for the group.

Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>
Signed-off-by: Michael S. Tsirkin <[EMAIL PROTECTED]>

Index: openib-1.1/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===
--- openib-1.1.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c  
2006-09-12 14:28:33.0 +0300
+++ openib-1.1/drivers/infiniband/ulp/ipoib/ipoib_multicast.c   2006-09-14 
17:17:12.0 +0300
@@ -326,6 +326,7 @@
 
/* Clear the busy flag so we try again */
clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+   mcast->query = NULL;
}
 
complete(&mcast->done);



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] ipoib send only failure

2006-09-14 Thread Eli cohen
Hi,
when running a test I encountered the following scenario:
the test sends to multicast address
ipoib issues send only joins which fails.
successive joins to this group will not be attempted since the query
field of the mcast object holds the old pointer.




___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] PXE + infiniband?

2006-09-10 Thread Eli cohen
On Thu, 2006-09-07 at 08:19 +0100, Paul Baxter wrote:
> > There is an implementation of PXE for Mellanox's HCAs that can be found
> > here: http://sourceforge.net/forum/forum.php?forum_id=494529
> 
> Thanks for the tip
> 
> I, too, am interested in this.
> 
> Do you have a more direct link as I wandered around etherboot's project site 
> and couldn't find anything IB-specific.
> 
> Paul Baxter 
Hi,

Please use the following link
http://kent.dl.sourceforge.net/sourceforge/etherboot/etherboot-5.4.2.tar.bz2 to 
download the package. Unpack the package and cd to the src dir. Use an x86 arch 
machine to build the binaries. The infiniband drivers are located at 
src/drivers/net/mlx_ipoib/ where you can find a readme file in the doc 
directory. To build.

cd src
make bin/MT23108.zrom  // for MT230108
make bin/MT25208.zrom
make bin/MT25218.zrom

This covers all Mellanox HCAs. Please let me know if you need more
assistance.



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] PXE + infiniband?

2006-09-06 Thread Eli cohen
On Fri, 2006-09-01 at 14:51 -0400, Cain, Brian (GE Healthcare) wrote:
> A while back
> (http://openib.org/pipermail/openib-general/2005-September/010801.html)
> there was mention of putting PXE stuff on an HCA.  Has anyone done this
> with PXELINUX?  It doesn't seem like it's as straightforward as just
> putting the stock PXELINUX image on your HCA.  I'm assuming this image
> would have to recognize the HCA and bring up IPoIB in order to use the
> conventional TFTP transport?

There is an implementation of PXE for Mellanox's HCAs that can be found
here: http://sourceforge.net/forum/forum.php?forum_id=494529


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] [PATCH] huge pages support

2006-08-23 Thread Eli Cohen
On Fri, 2006-08-18 at 15:23 +0200, Robert Rex wrote:
> Hello,
> 
> I've also worked on the same topic. Here is what I've done so far as I 
> successfully tested it on mthca and ehca. I'd appreciate for comments and 
> suggestions.
>  
> + down_read(¤t->mm->mmap_sem);
> + if (is_vm_hugetlb_page(find_vma(current->mm, (unsigned long) addr))) {
> + use_hugepages   = 1;
> + region_page_mask= HPAGE_MASK;
> + region_page_size= HPAGE_SIZE;

This might cause a kernel oops if the address passed by the user does
not belong to the process's address space. In that case find_vma() might
return NULL and is_vm_hugetlb() will crash.
And even if find_vma() returns none NULL value, that still does not
guarantee that the vma returned is the one that contains that address.
You need to check that the address is greater then or equal to
vma->vm_start.


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] [PATCH] huge pages support

2006-08-20 Thread Eli Cohen
Oh that one I have not seen... but it looks like this is the approach I
took in VAPI and somehow it looked cumbersome to me.

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Roland Dreier
Sent: Sunday, August 20, 2006 7:02 PM
To: Tziporet Koren
Cc: Openib-general@openib.org
Subject: Re: [PATCH] huge pages support

Roland> Sure, you are building the OFED release so you can put
Roland> whatever you want into it.

...although maybe it would be a good idea to follow the approach of
the second patch posted, and make multiple get_user_pages() calls,
skipping along by HPAGE_SIZE.

This avoids having all the extra work of follow_hugetlb_page()
creating extra fake pages and then calling put_page() many times.

 - R.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] [PATCH] huge pages support

2006-08-17 Thread Eli Cohen
On Thu, 2006-08-17 at 09:06 -0700, Roland Dreier wrote:
> My first reaction on reading this is that it looks like we need some
> more help from mm/ to make this cleaner.
> 
>  > +  for (j = 0; j < umem->page_size / PAGE_SIZE; ++j)
>  > +  put_page(chunk->page_list[i].page);
> 
> This just looks really bad... you're assuming that calling put_page()
> on a different page than the one get_user_pages() gave you is OK.
> Does that actually work?  (I don't remember the details of how hugetlb
> pages work)
A huge page is a compound page that is comprised of a number of pages
and the first one represents all the others so it is safe to do this.

> 
>  > +  if (hpage == -1)
>  > +  hpage = is_vm_hugetlb_page(vma);
>  > +  else
>  > +  if (hpage != is_vm_hugetlb_page(vma))
>  > +  return -ENOMEM;
> 
> Users aren't allowed to register regions that span both hugetlb and
> non-hugetlb pages??  Why not?

It is not that easy anyway to obtain huge pages and even more unlikely
that a user will want to register a region with different page sizes so
I just don't allow that. But I think these cases can be allowed by using
PAGE_SIZE registration.

> 
>  > +  if (!(PAGE_SIZE / (sizeof (struct page *) << shift_diff))) {
> 
> shift_diff is used here...
> 
>  > +  ret = -ENOMEM;
>  > +  goto exit_up;
>  > +  }
>  > +
>  > +  page_size = 1 << page_shift;
>  > +  page_mask = ~(page_size - 1);
>  > +  shift_diff = page_shift - PAGE_SHIFT;
> 
> and set down here... don't you get a warning about "is used uninitialized"?


Oops. I didn't see any warning. This if should moved a few lines down.
> 
> Also, if someone tries to register a hugetlb region and a regular page
> isn't big enough to hold all the pointers, then they just lose??  This
> would be a real problem if/when GB pages are implemented on amd64 --
> there would be no way to register such a region.
> 
> 
> Basically this patch seems to be trying to undo what follow_hugetlb_pages()
> does to create normal pages for the pieces of a hugetlb page.  It
> seems like it would be better to have a new function like get_user_pages()
> that returned its result in a struct scatterlist so that it could give
> back huge pages directly.
> 

I agree about that.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [PATCH] huge pages support

2006-08-17 Thread Eli Cohen
The following patch adds support for registering huge pages memory
regions while passing to the low level driver the correct page size thus
preserving mtt entries and kernel memory used to store page descriptors
memory.

Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>


Index: last_stable/drivers/infiniband/core/uverbs_mem.c
===
--- last_stable.orig/drivers/infiniband/core/uverbs_mem.c
+++ last_stable/drivers/infiniband/core/uverbs_mem.c
@@ -35,6 +35,7 @@
  */
 
 #include 
+#include 
 #include 
 
 #include "uverbs.h"
@@ -49,7 +50,7 @@ struct ib_umem_account_work {
 static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int 
dirty)
 {
struct ib_umem_chunk *chunk, *tmp;
-   int i;
+   int i, j;
 
list_for_each_entry_safe(chunk, tmp, &umem->chunk_list, list) {
dma_unmap_sg(dev->dma_device, chunk->page_list,
@@ -57,13 +58,43 @@ static void __ib_umem_release(struct ib_
for (i = 0; i < chunk->nents; ++i) {
if (umem->writable && dirty)
set_page_dirty_lock(chunk->page_list[i].page);
-   put_page(chunk->page_list[i].page);
+   for (j = 0; j < umem->page_size / PAGE_SIZE; ++j)
+   put_page(chunk->page_list[i].page);
}
 
kfree(chunk);
}
 }
 
+static int get_page_shift(void *addr, size_t size)
+{
+   struct vm_area_struct *vma;
+   unsigned long va = (unsigned long)addr;
+   int hpage = -1;
+
+next:
+   vma = find_vma(current->mm, va);
+   if (!vma)
+   return -ENOMEM;
+
+   if (va < vma->vm_start)
+   return -ENOMEM;
+
+   if (hpage == -1)
+   hpage = is_vm_hugetlb_page(vma);
+   else
+   if (hpage != is_vm_hugetlb_page(vma))
+   return -ENOMEM;
+
+   if ((va + size)  > vma->vm_end) {
+   size -= (vma->vm_end - va);
+   va = vma->vm_end;
+   goto next;
+   }
+
+   return hpage ? HPAGE_SHIFT : PAGE_SHIFT;
+}
+
 int ib_umem_get(struct ib_device *dev, struct ib_umem *mem,
void *addr, size_t size, int write)
 {
@@ -73,9 +104,12 @@ int ib_umem_get(struct ib_device *dev, s
unsigned long lock_limit;
unsigned long cur_base;
unsigned long npages;
+   int page_shift, shift_diff, nreg_pages, tmp;
+   int max_page_chunk;
+   unsigned long page_mask, page_size;
int ret = 0;
int off;
-   int i;
+   int i, j, nents;
 
if (!can_do_mlock())
return -EPERM;
@@ -84,17 +118,36 @@ int ib_umem_get(struct ib_device *dev, s
if (!page_list)
return -ENOMEM;
 
+   down_write(¤t->mm->mmap_sem);
+
+   page_shift = get_page_shift(addr, size);
+   if (IS_ERR_VALUE(page_shift)) {
+   ret = page_shift;
+   goto exit_up;
+   }
+
+   /* make sure enough pointers get into PAGE_SIZE to
+  contain at least one huge page */
+   if (!(PAGE_SIZE / (sizeof (struct page *) << shift_diff))) {
+   ret = -ENOMEM;
+   goto exit_up;
+   }
+
+   page_size = 1 << page_shift;
+   page_mask = ~(page_size - 1);
+   shift_diff = page_shift - PAGE_SHIFT;
+
mem->user_base = (unsigned long) addr;
mem->length= size;
-   mem->offset= (unsigned long) addr & ~PAGE_MASK;
-   mem->page_size = PAGE_SIZE;
+   mem->offset= (unsigned long) addr & ~page_mask;
+   mem->page_size = page_size;
mem->writable  = write;
 
INIT_LIST_HEAD(&mem->chunk_list);
 
-   npages = PAGE_ALIGN(size + mem->offset) >> PAGE_SHIFT;
 
-   down_write(¤t->mm->mmap_sem);
+   nreg_pages = ALIGN(size + mem->offset, page_size) >>  page_shift;
+   npages = nreg_pages << shift_diff;
 
locked = npages + current->mm->locked_vm;
lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> 
PAGE_SHIFT;
@@ -104,14 +157,15 @@ int ib_umem_get(struct ib_device *dev, s
goto out;
}
 
-   cur_base = (unsigned long) addr & PAGE_MASK;
+   cur_base = (unsigned long) addr & page_mask;
 
while (npages) {
+   tmp = min_t(int, npages,
+   (PAGE_SIZE / (sizeof (struct page *) << shift_diff))
+   << shift_diff);
ret = get_user_pages(current, current->mm, cur_base,
-min_t(int, npages,
-  PAGE_SIZE / sizeof (struct page *)),
+tmp,

Re: [openib-general] potential bug: ipoib freeing ah before completion

2006-06-12 Thread Eli Cohen
The issue was originally raised by Eitan Rabin.

-Original Message-
From: Michael S. Tsirkin 
Sent: Monday, June 12, 2006 3:49 PM
To: openib-general@openib.org; Roland Dreier
Cc: Eli Cohen
Subject: potential bug: ipoib freeing ah before completion

Hello, Roland!
The following was noted by Eli Cohen:

It seems that ipoib_flush_paths can be called while completions are
still
outstanding on IPoIB QP (e.g. from ipoib_ib_dev_flush). If this happens,
an
address handle might get freed while a work request is still outstanding
for it.
This can trigger a local QP error, and IPoIB will stop working, until QP
is reset.

Please comment.

-- 
MST

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [Bug 122] New: mad layer problem

2006-06-09 Thread Eli Cohen








Hi,

Here is some info:


 Attached are
 the SysRq messages.
 The relation
 of MADs to ARP is that after ARP resolves a hardware address it is
 required to use an SM query to resolve the path to the host bearing the
 hardware address.
 How to invoke
 the tests: Attached one readme file and one configuration file.


 

Eli








messages.bz2
Description: messages.bz2


114-115-800conn.conf
Description: 114-115-800conn.conf

Runing the test:

On both machine:
make sure sdp module is up on both machines and ping over ib0 is working
export LD_PRELOAD=/usr/local/ofed/lib64/libsdp.so 
export LIBSDP_CONFIG_FILE=/usr/local/ofed/etc/libsdp.conf

run:
sw114 SERVER side:
/usr/local/polygraph/bin/polysrv --config 
/usr/local/polygraph/114-115-800conn.conf --verb_lvl 10

sw115 CLIENT side:
/usr/local/polygraph/bin/polyclt --config 
/usr/local/polygraph/114-115-800conn.conf --verb_lvl 10


In less then 5 minutes IPoIB will stop working (try ping) and also the poly 
test will fail


In case you need to compile polygraph:
1. cd /mswg/work/alberto/polygraph/polygraph-2.8.1/
2. ./configure --prefix=/usr/local/polygraph
3. make clean && make && make install
4. cp 114-115-800conn.conf /usr/local/polygraph/
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re: [PATCH]Repost: IPoIB skb panic

2006-06-04 Thread Eli Cohen
> More clarification: we saw two races here:
> 1. path_free() was called by both unicast_arp_send() and
> ipoib_flush_paths() in the same time.

It is not possible to call path_free() on the same object from both
unicast_arp_send() and ipoib_flush_paths(). This becasue
unicast_arp_send() calls it only for newly created objects for which
path_rec_create() failed, in which case the object was never inserted
into the list or the rb_tree.

> 2. during unicast arp skb retransmission, unicast_arp_send() appended
> the skb on the list, while ipoib_flush_paths() calling path_free() to
> free the same skb from the list.

I don't see any issue here as well.

Can you reproduce the crash? If you do, can you send how?

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [PATCH] ipoib

2006-05-28 Thread Eli Cohen
Sorry about the badly formatted previous email - here it is again...

When ipoib_stop is called it first calls netif_stop_queue() to
stop the kernel from passing more packets to the network driver.
After that ipoib_ib_dev_stop() moves the QP to error causing all
pending work requests to complete with error but then the completion
handler may call netif_wake_queue() re-enabling packet transfer.
This patch makes sure to enable calling netif_wake_queue() only
if IPOIB_FLAG_ADMIN_UP is set.

Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>
Signed-off-by: Michael Tsirkin <[EMAIL PROTECTED]>

Index: linux-kernel/infiniband/ulp/ipoib/ipoib_ib.c
===
--- linux-kernel.orig/infiniband/ulp/ipoib/ipoib_ib.c   2006-05-28 
12:53:19.661381000 +0300
+++ linux-kernel/infiniband/ulp/ipoib/ipoib_ib.c2006-05-28 
13:00:02.512096000 
+0300
@@ -269,6 +269,7 @@
spin_lock_irqsave(&priv->tx_lock, flags);
++priv->tx_tail;
if (netif_queue_stopped(dev) &&
+   test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags) &&
priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1)
netif_wake_queue(dev);
spin_unlock_irqrestore(&priv->tx_lock, flags);
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] [PATCH] ipoib

2006-05-28 Thread Eli Cohen
When ipoib_stop is called it first calls netif_stop_queue() to
stop the kernel from passing more packets to the network driver.
After that ipoib_ib_dev_stop() moves the QP to error causing all
pending work requests to complete with error but then the completion
handler may call netif_wake_queue() re-enabling packet transfer.
This patch makes sure to enable calling netif_wake_queue() only
if IPOIB_FLAG_ADMIN_UP is set.

Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>
Signed-off-by: Michael Tsirkin <[EMAIL PROTECTED]>

Index: linux-kernel/infiniband/ulp/ipoib/ipoib_ib.c
===
--- linux-kernel.orig/infiniband/ulp/ipoib/ipoib_ib.c 2006-05-28
12:53:19.661381000 +0300 
+++ linux-kernel/infiniband/ulp/ipoib/ipoib_ib.c2006-05-28 
13:00:02.512096000 +0300
@@ -269,6 +269,7 @@
spin_lock_irqsave(&priv->tx_lock, flags);
++priv->tx_tail;
if (netif_queue_stopped(dev) &&
+ test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags) &&
priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1)
netif_wake_queue(dev);
spin_unlock_irqrestore(&priv->tx_lock, flags);

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: [IPoIB] executing iperf over IPoIB causes to multicast (IP) packets to be recieved out-of-order

2006-05-25 Thread Eli Cohen

> This is on back to back. I don't think hardware does this - this is
> ipoib software thing.
> 

I saw this problem with unicast UD datagrams as well. This was on Fedora
C4 kernel 2.6.11-1.1369_FC4smp. I verified that the packets arrived in
order just before calling netif_rx_ni() by peeking into the ip and udp
layers. After that I tried this on kernel 2.6.16.17 and the there were
no out of order reports. So I guess this was a Linux networking stack
problem that was resolved in newer kernels.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] RE: [PATCH] ipoib_cleanup_module

2006-04-08 Thread Eli Cohen

> Not sure I follow this.  What error could occur?

It's a more theoretical problem then a real problem. ipoib workqueue is
still running and could be using debugfs which does not exist anymore. I
know in the current implementation there is no real problem but it is
cleaner to destroy in the opposite order of creation.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] [PATCH] ipoib_cleanup_module

2006-04-06 Thread Eli Cohen
ensure reverse order of creation or else might cause errors
if debugfs is used.

Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>

Index: latest/drivers/infiniband/ulp/ipoib/ipoib_main.c
===
--- latest.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ latest/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1216,8 +1216,8 @@ err_fs:
 static void __exit ipoib_cleanup_module(void)
 {
ib_unregister_client(&ipoib_client);
-   ipoib_unregister_debugfs();
destroy_workqueue(ipoib_workqueue);
+   ipoib_unregister_debugfs();
 }
 
 module_init(ipoib_init_module);
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: [PATCH] ipoib_mcast_restart_task

2006-04-06 Thread Eli Cohen
>
> The reason for that is probably because I am using a custom kernel compiled
> with 'Debug memory allocations' which poisons freed memory.

oops. Ignore this. I did not notice the succeding discussion.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: [PATCH] ipoib_mcast_restart_task

2006-04-05 Thread Eli Cohen
On Wednesday 05 April 2006 18:43, Roland Dreier wrote:
> Michael> Not sure I read you. It'd still be use after free, won't it?
>
> It's definitely a bug.  But it doesn't explain the specific oops we
> saw.  In other words, doing:
>
>   kfree(mcast);
>   dev = mcast->dev;
>
> shouldn't cause an oops, because mcast is still a valid kernel
> pointer, even if the memory it points to might be reused and
> corrupted.  Following the dev pointer after that snippet might cause
> an oops, because it might be overwritten.
>

The reason for that is probably because I am using a custom kernel compiled
with 'Debug memory allocations' which poisons freed memory.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] [PATCH] ipoib_flush_paths

2006-04-05 Thread Eli Cohen
ib_sa_cancel_query must be called with priv->lock held since
a completion might arrive and set path->query to NULL

Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>

Index: latest/drivers/infiniband/ulp/ipoib/ipoib_main.c
===
--- latest.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ latest/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -342,14 +342,16 @@ void ipoib_flush_paths(struct net_device
list_for_each_entry(path, &remove_list, list)
rb_erase(&path->rb_node, &priv->path_tree);
 
-   spin_unlock_irqrestore(&priv->lock, flags);
 
list_for_each_entry_safe(path, tp, &remove_list, list) {
if (path->query)
ib_sa_cancel_query(path->query_id, path->query);
+   spin_unlock_irqrestore(&priv->lock, flags);
wait_for_completion(&path->done);
path_free(dev, path);
+   spin_lock_irqsave(&priv->lock, flags);
}
+   spin_unlock_irqrestore(&priv->lock, flags);
 }
 
 static void path_rec_completion(int status,
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] [PATCH] ipoib_mcast_restart_task

2006-04-05 Thread Eli Cohen
ipoib_mcast_restart_task might free an mcast object while a join request 
is still outstanding, leading to an oops when the query completes. Fix
this by waiting for query to complete, similar to what ipoib_stop_thread is
doing. The wait for mcast completion code is consolidated in 
wait_join_complete().
 

Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>
Signed-off-by: Michael S. Tsirkin <[EMAIL PROTECTED]>

Index: latest/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===
--- latest.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ latest/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -617,6 +617,22 @@ int ipoib_mcast_start_thread(struct net_
return 0;
 }
 
+static void wait_join_complete(struct ipoib_dev_priv *priv,
+  struct ipoib_mcast *mcast)
+{
+   spin_lock_irq(&priv->lock);
+   if (mcast && mcast->query) {
+   ib_sa_cancel_query(mcast->query_id, mcast->query);
+   mcast->query = NULL;
+   spin_unlock_irq(&priv->lock);
+   ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n",
+   IPOIB_GID_ARG(mcast->mcmember.mgid));
+   wait_for_completion(&mcast->done);
+   }
+   else
+   spin_unlock_irq(&priv->lock);
+}
+
 int ipoib_mcast_stop_thread(struct net_device *dev, int flush)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -636,28 +652,10 @@ int ipoib_mcast_stop_thread(struct net_d
if (flush)
flush_workqueue(ipoib_workqueue);
 
-   spin_lock_irq(&priv->lock);
-   if (priv->broadcast && priv->broadcast->query) {
-   ib_sa_cancel_query(priv->broadcast->query_id, 
priv->broadcast->query);
-   priv->broadcast->query = NULL;
-   spin_unlock_irq(&priv->lock);
-   ipoib_dbg_mcast(priv, "waiting for bcast\n");
-   wait_for_completion(&priv->broadcast->done);
-   } else
-   spin_unlock_irq(&priv->lock);
+   wait_join_complete(priv, priv->broadcast);
 
-   list_for_each_entry(mcast, &priv->multicast_list, list) {
-   spin_lock_irq(&priv->lock);
-   if (mcast->query) {
-   ib_sa_cancel_query(mcast->query_id, mcast->query);
-   mcast->query = NULL;
-   spin_unlock_irq(&priv->lock);
-   ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT 
"\n",
-   IPOIB_GID_ARG(mcast->mcmember.mgid));
-   wait_for_completion(&mcast->done);
-   } else
-   spin_unlock_irq(&priv->lock);
-   }
+   list_for_each_entry(mcast, &priv->multicast_list, list)
+   wait_join_complete(priv, mcast);
 
return 0;
 }
@@ -910,6 +908,7 @@ void ipoib_mcast_restart_task(void *dev_
 
/* We have to cancel outside of the spinlock */
list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {
+   wait_join_complete(priv, mcast);
ipoib_mcast_leave(mcast->dev, mcast);
ipoib_mcast_free(mcast);
}
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: [PATCH] ipoib_mcast_restart_task

2006-04-05 Thread Eli Cohen
> Yes, looks like there might be problem here.  However, is there any
> way to consolidate the "cancel and wait for done" code in one place,
> rather than just cut-and-pasting it from ipoib_stop_thread()?
An appropriate patch will follow.

> This could explain the oops in ipoib_mcast_sendonly_join_complete(),
> but only if a send-only group is being replaced by a full-member
> join.  Is Eli's test doing that?
No, not deliberately but it did not happen again after a full night runs. I 
will keep running the tests for a while and track this.






___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: [PATCH] mthca - query qp

2006-03-19 Thread Eli Cohen
On Thu, 2006-03-16 at 14:38 -0800, Roland Dreier wrote:
> I need to think about how to handle the other attributes.  Probably
> anything that just copies values from the struct ib_qp passed in to
> the function should at least be in the generic ib_query_qp function,
> since there's no good reason for every low-level driver to duplicate it.
> 
Probably this is how it should be done

> However I'm wondering whether we even have the right interface for
> query QP at all.  The only thing that the consumer couldn't have saved
> off for themselves is the QP state, so maybe the right interface is
> just an ib_query_qp_state() function that gives back the QP's current state.
> 
I think we should pass to the consumer all the info in a centralized way
through a standard query_qp verb.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] [PATCH] mthca - query qp

2006-03-16 Thread Eli Cohen
On Tue, 2006-03-14 at 08:08 -0800, Roland Dreier wrote:

> Sorry, I didn't like all of it (it seems silly to copy values from the
> ibqp struct, which the user passed in, into the output struct), and
> then I didn't get around to applying the parts I did like.

But the alternative is that the verb consumer will have to gather all
information from three different objects while in some places the fields
are defined but do not contain valid values. Would it not be simpler to
pack all the info in qp_attr and qp_init_attr? After all it is not data
path.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: ipoib error handling question

2006-03-15 Thread Eli Cohen
On Tue, 2006-03-14 at 08:03 -0800, Roland Dreier wrote:

> Also I don't see much chance of having the skb queue be full but not
> having a path record query pending.
> 

if (path->ah) {
kref_get(&path->ah->ref);
neigh->ah = path->ah;

ipoib_send(dev, skb, path->ah,
   be32_to_cpup((__be32 *) skb->dst->neighbour->ha));
} else {
neigh->ah  = NULL;
if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE)
__skb_queue_tail(&neigh->queue, skb);
else
goto err_list;

if (!path->query && path_rec_start(dev, path))
goto err_list;
}

The if on the skb queue len is not necessary since this queue is on a
newly created neighbor and is obviously empty. Then the path query would
be called if path->query is null.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCH] mthca - query qp

2006-03-13 Thread Eli Cohen
Hi Roland,
What about this patch? Are you going to apply it?

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Eli Cohen
Sent: Wednesday, March 08, 2006 4:32 PM
To: Roland Dreier
Cc: openib-general@openib.org
Subject: [openib-general] [PATCH] mthca - query qp

1. Fix error flow - if status is not zero exit with error
2. Few more values extracted

Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>

Index: last_stable/drivers/infiniband/hw/mthca/mthca_qp.c
===
--- last_stable.orig/drivers/infiniband/hw/mthca/mthca_qp.c
+++ last_stable/drivers/infiniband/hw/mthca/mthca_qp.c
@@ -427,6 +427,12 @@ int mthca_query_qp(struct ib_qp *ibqp, s
if (err)
goto out;
 
+   if (status) {
+   mthca_warn(dev, "QUERY_QP returned status %02x\n",
status);
+   err = -EINVAL;
+   goto out;
+   }
+
qp_param= mailbox->buf;
context = &qp_param->context;
mthca_state = be32_to_cpu(context->flags) >> 28;
@@ -454,6 +460,9 @@ int mthca_query_qp(struct ib_qp *ibqp, s
qp_attr->pkey_index =
be32_to_cpu(context->pri_path.port_pkey) & 0x7f;
qp_attr->alt_pkey_index =
be32_to_cpu(context->alt_path.port_pkey) & 0x7f;
 
+   qp_attr->timeout = context->pri_path.ackto >> 3;
+   qp_attr->alt_timeout = context->alt_path.ackto >> 3;
+
/* qp_attr->en_sqd_async_notify is only applicable in modify qp
*/
qp_attr->sq_draining = mthca_state == MTHCA_QP_STATE_DRAINING;
 
@@ -469,7 +478,17 @@ int mthca_query_qp(struct ib_qp *ibqp, s
qp_attr->rnr_retry  = context->pri_path.rnr_retry >> 5;
qp_attr->alt_port_num   = qp_attr->alt_ah_attr.port_num;
qp_attr->alt_timeout= context->alt_path.ackto >> 3;
+
+   qp_init_attr->event_handler = ibqp->event_handler;
+   qp_init_attr->qp_context= ibqp->qp_context;
+   qp_init_attr->send_cq   = ibqp->send_cq;
+   qp_init_attr->recv_cq   = ibqp->recv_cq;
+   qp_init_attr->srq   = ibqp->srq;
qp_init_attr->cap   = qp_attr->cap;
+   qp_init_attr->sq_sig_type   = ((be32_to_cpu(context->params1) >>
3) & 0x1) ?
+   IB_SIGNAL_REQ_WR : IB_SIGNAL_ALL_WR;
+   qp_init_attr->qp_type   = ibqp->qp_type;
+   qp_init_attr->port_num  = qp_attr->ah_attr.port_num;
 
 out:
mthca_free_mailbox(dev, mailbox);

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] Re: ipoib error handling question

2006-03-13 Thread Eli Cohen
> 
> Yes, that looks like an obvious mistake.  I checked this in:
>
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index 1633aad..00342d5 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -519,12 +519,10 @@ static void neigh_add_path(struct sk_buf
>  be32_to_cpup((__be32 *)
skb->dst->neighbour->ha));
>   } else {
>   neigh->ah  = NULL;
> - if (skb_queue_len(&neigh->queue) <
IPOIB_MAX_PATH_REC_QUEUE) {
> + if (skb_queue_len(&neigh->queue) <
IPOIB_MAX_PATH_REC_QUEUE)
>   __skb_queue_tail(&neigh->queue, skb);
> - } else {
> - ++priv->stats.tx_dropped;
> - dev_kfree_skb_any(skb);
> - }
> + else
> + goto err_list;
>  
>   if (!path->query && path_rec_start(dev, path))
>   goto err;

Doing it this way may result in having an ipoib_path object for which
the query is not complete. I think if the queue is full you should do
cleanup but still execute 
if (!path->query && path_rec_start(dev, path))
goto err;

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] [PATCH] mthca - query srq fixes

2006-03-13 Thread Eli Cohen
1. Fix endianess handling of srq_limit 
2. Add support for srq_limit for Tavor

Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>

Index: last_stable/drivers/infiniband/hw/mthca/mthca_srq.c
===
--- last_stable.orig/drivers/infiniband/hw/mthca/mthca_srq.c2006-03-12 
17:15:36.0 +0200
+++ last_stable/drivers/infiniband/hw/mthca/mthca_srq.c 2006-03-12 
18:24:45.0 +0200
@@ -49,7 +49,8 @@ struct mthca_tavor_srq_context {
__be32 state_pd;
__be32 lkey;
__be32 uar;
-   __be32 wqe_cnt;
+   __be16 limit_watermark;
+   __be16 wqe_cnt;
u32reserved[2];
 };
 
@@ -369,6 +370,7 @@ int mthca_query_srq(struct ib_srq *ibsrq
struct mthca_srq *srq = to_msrq(ibsrq);
struct mthca_mailbox *mailbox;
struct mthca_arbel_srq_context *arbel_ctx;
+   struct mthca_tavor_srq_context *tavor_ctx;
u8 status;
int err;
 
@@ -382,9 +384,11 @@ int mthca_query_srq(struct ib_srq *ibsrq
 
if (mthca_is_memfree(dev)) {
arbel_ctx = mailbox->buf;
-   srq_attr->srq_limit = arbel_ctx->limit_watermark;
-   } else
-   srq_attr->srq_limit = 0;
+   srq_attr->srq_limit = be16_to_cpu(arbel_ctx->limit_watermark);
+   } else {
+   tavor_ctx = mailbox->buf;
+   srq_attr->srq_limit = be16_to_cpu(tavor_ctx->limit_watermark);
+   }
 
srq_attr->max_wr  = (mthca_is_memfree(dev)) ? srq->max - 1 : srq->max;
srq_attr->max_sge = srq->max_gs;

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] [PATCH] mthca - query qp

2006-03-08 Thread Eli Cohen
1. Fix error flow - if status is not zero exit with error
2. Few more values extracted

Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>

Index: last_stable/drivers/infiniband/hw/mthca/mthca_qp.c
===
--- last_stable.orig/drivers/infiniband/hw/mthca/mthca_qp.c
+++ last_stable/drivers/infiniband/hw/mthca/mthca_qp.c
@@ -427,6 +427,12 @@ int mthca_query_qp(struct ib_qp *ibqp, s
if (err)
goto out;
 
+   if (status) {
+   mthca_warn(dev, "QUERY_QP returned status %02x\n", status);
+   err = -EINVAL;
+   goto out;
+   }
+
qp_param= mailbox->buf;
context = &qp_param->context;
mthca_state = be32_to_cpu(context->flags) >> 28;
@@ -454,6 +460,9 @@ int mthca_query_qp(struct ib_qp *ibqp, s
qp_attr->pkey_index = be32_to_cpu(context->pri_path.port_pkey) & 
0x7f;
qp_attr->alt_pkey_index = be32_to_cpu(context->alt_path.port_pkey) & 
0x7f;
 
+   qp_attr->timeout = context->pri_path.ackto >> 3;
+   qp_attr->alt_timeout = context->alt_path.ackto >> 3;
+
/* qp_attr->en_sqd_async_notify is only applicable in modify qp */
qp_attr->sq_draining = mthca_state == MTHCA_QP_STATE_DRAINING;
 
@@ -469,7 +478,17 @@ int mthca_query_qp(struct ib_qp *ibqp, s
qp_attr->rnr_retry  = context->pri_path.rnr_retry >> 5;
qp_attr->alt_port_num   = qp_attr->alt_ah_attr.port_num;
qp_attr->alt_timeout= context->alt_path.ackto >> 3;
+
+   qp_init_attr->event_handler = ibqp->event_handler;
+   qp_init_attr->qp_context= ibqp->qp_context;
+   qp_init_attr->send_cq   = ibqp->send_cq;
+   qp_init_attr->recv_cq   = ibqp->recv_cq;
+   qp_init_attr->srq   = ibqp->srq;
qp_init_attr->cap   = qp_attr->cap;
+   qp_init_attr->sq_sig_type   = ((be32_to_cpu(context->params1) >> 3) & 
0x1) ?
+   IB_SIGNAL_REQ_WR : IB_SIGNAL_ALL_WR;
+   qp_init_attr->qp_type   = ibqp->qp_type;
+   qp_init_attr->port_num  = qp_attr->ah_attr.port_num;
 
 out:
mthca_free_mailbox(dev, mailbox);

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] [PATCH] mthca - read byte count & read request size

2006-03-06 Thread Eli Cohen
On Mon, 2006-03-06 at 19:25 -0800, Grant Grundler wrote:

> Why is this an enum?
> 
> > +static int pcix_max_rbc = PCIX_MAX_RBC_INVALID;
> 
> It's declared an int and is "user visible".
> I think the user interface would be better served
> if the user could just specify "pcix_max_rbc=2048"
> instead of some magic value.

Using an enum confines the set of valid values the user can specify to
the range 0-3 and it is easier to filter out invalid values. If I used
raw values for the parameters then questions like how do I interpret a
value of 1000 would rise: As invalid value so the parameter is ignored?
Or maybe I would configure rbc to 512? or 1024?
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] [PATCH] mthca - read byte count & read request size

2006-03-05 Thread Eli Cohen
Add control of PCIX max read byte count and PCI Express max read
request size parameters. This is due to the fact that some chipsets
may not work or may not work optimally with the default parameters.
Logic as follows: If the user sets a valid value to a parameter this
value wins. Otherwise the value read from the device is compared to
the default value defined by the spec. If the value read does not equal
the default it is assumed the BIOS configured it and the BIOS value is
retained. Otherwise the the former mthca configured defaults of 4K are
set.
To summarize, it assumes that if the BIOS configures a value it knows
best and we keep this value.


Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>

Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_main.c
===
--- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_main.c
+++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_main.c
@@ -69,6 +69,41 @@ MODULE_PARM_DESC(msi, "attempt to use MS
 
 #endif /* CONFIG_PCI_MSI */
 
+/* the PCIX max read byte count and PCI Express max read request size
+   parameters have default values as defined by the PRM which will
+   should give best performance in most cases. However there are some
+   chipsets that work better with other values or that do not work at
+   all with the default values */
+enum {
+   PCIX_MAX_RBC_512,
+   PCIX_MAX_RBC_1024,
+   PCIX_MAX_RBC_2048,
+   PCIX_MAX_RBC_4096,
+   PCIX_MAX_RBC_INVALID
+};
+
+#define CAP_PRMS_DESC " - some chipsets do not work optimally or do not work 
at " \
+   "all with the max value and this parameter can be used to tune 
"  \
+   "the optimal value"
+
+static int pcix_max_rbc = PCIX_MAX_RBC_INVALID;
+module_param(pcix_max_rbc, int, 0444);
+MODULE_PARM_DESC(pcix_max_rbc, "PCIX max read byte count" CAP_PRMS_DESC);
+
+enum {
+   PCIE_MAX_RRS_128,
+   PCIE_MAX_RRS_256,
+   PCIE_MAX_RRS_512,
+   PCIE_MAX_RRS_1024,
+   PCIE_MAX_RRS_2048,
+   PCIE_MAX_RRS_4096,
+   PCIE_MAX_RRS_INVALID
+};
+
+static int pcie_max_rrs = PCIE_MAX_RRS_INVALID;
+module_param(pcie_max_rrs, int, 0444);
+MODULE_PARM_DESC(pcie_max_rrs, "PCI Express max read read request size" 
CAP_PRMS_DESC);
+
 static const char mthca_version[] __devinitdata =
DRV_NAME ": Mellanox InfiniBand HCA driver v"
DRV_VERSION " (" DRV_RELDATE ")\n";
@@ -89,8 +124,9 @@ static int __devinit mthca_tune_pci(stru
 {
int cap;
u16 val;
+   int orig_val, set_val;
 
-   /* First try to max out Read Byte Count */
+   /* First determine Max Read Byte Count */
cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX);
if (cap) {
if (pci_read_config_word(mdev->pdev, cap + PCI_X_CMD, &val)) {
@@ -98,12 +134,25 @@ static int __devinit mthca_tune_pci(stru
  "aborting.\n");
return -ENODEV;
}
-   val = (val & ~PCI_X_CMD_MAX_READ) | (3 << 2);
+
+   if (pcix_max_rbc < PCIX_MAX_RBC_INVALID)
+set_val = pcix_max_rbc;
+   else {
+   orig_val = (val & PCI_X_CMD_MAX_READ) >> 2;
+   if (orig_val != PCIX_MAX_RBC_512)
+   /* retain value set by the bios */
+   set_val = orig_val;
+   else
+   set_val = PCIX_MAX_RBC_4096;
+   }
+   val = (val & ~PCI_X_CMD_MAX_READ) | (set_val << 2);
if (pci_write_config_word(mdev->pdev, cap + PCI_X_CMD, val)) {
mthca_err(mdev, "Couldn't write PCI-X command register, 
"
  "aborting.\n");
return -ENODEV;
}
+   mthca_dbg(mdev, "setting read byte count to %d bytes\n",
+ 512 << set_val);
} else if (!(mdev->mthca_flags & MTHCA_FLAG_PCIE))
mthca_info(mdev, "No PCI-X capability, not setting RBC.\n");
 
@@ -114,12 +163,26 @@ static int __devinit mthca_tune_pci(stru
  "register, aborting.\n");
return -ENODEV;
}
-   val = (val & ~PCI_EXP_DEVCTL_READRQ) | (5 << 12);
+
+   if (pcie_max_rrs < PCIE_MAX_RRS_INVALID)
+   set_val = pcie_max_rrs;
+   else {
+   orig_val = (val & PCI_EXP_DEVCTL_READRQ) >> 12;
+   if (orig_val != PCIE_MAX_RRS_512)
+   /* retain value set by the bios */
+   set_val = 

[openib-general] [PATCH] mthca - optimize sinai large message

2006-03-02 Thread Eli Cohen
Memory key generation modified to optimize large messages transfer
in Sinai.
Performance gain will be in effect for messages > 80 KByte on sinai DDR.
The enhancement works when the HCA memory key table configured is up to
and including 2^23. For larger tables the enhancement is off.

Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>
Signed-off-by: Michael Tsirkin <[EMAIL PROTECTED]>

Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_profile.c
===
--- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_profile.c
+++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_profile.c
@@ -153,8 +153,8 @@ u64 mthca_make_profile(struct mthca_dev 
  "won't in 0x%llx bytes of context memory.\n",
  (unsigned long long) total_size,
  (unsigned long long) mem_avail);
-   kfree(profile);
-   return -ENOMEM;
+   total_size = -ENOMEM;
+   goto exit;
}
 
if (profile[i].size)
@@ -260,6 +260,12 @@ u64 mthca_make_profile(struct mthca_dev 
 */
dev->limits.num_pds = MTHCA_NUM_PDS;
 
+   if ((dev->mthca_flags & MTHCA_FLAG_SINAI_OPT) && init_hca->log_mpt_sz > 
23) {
+   mthca_warn(dev, "MPT table too large - disabling mkey 
optimization "
+  "for Sinai\n");
+   dev->mthca_flags &= ~MTHCA_FLAG_SINAI_OPT;
+   }
+
/*
 * For Tavor, FMRs use ioremapped PCI memory. For 32 bit
 * systems it may use too much vmalloc space to map all MTT
@@ -272,6 +278,7 @@ u64 mthca_make_profile(struct mthca_dev 
else
dev->limits.fmr_reserved_mtts = request->fmr_reserved_mtts;
 
+exit:
kfree(profile);
return total_size;
 }
Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_main.c
===
--- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_main.c
+++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_main.c
@@ -935,13 +935,16 @@ enum {
 
 static struct {
u64 latest_fw;
-   int is_memfree;
-   int is_pcie;
+   u32 flags;
 } mthca_hca_table[] = {
-   [TAVOR]= { .latest_fw = MTHCA_FW_VER(3, 3, 3), .is_memfree = 0, 
.is_pcie = 0 },
-   [ARBEL_COMPAT] = { .latest_fw = MTHCA_FW_VER(4, 7, 0), .is_memfree = 0, 
.is_pcie = 1 },
-   [ARBEL_NATIVE] = { .latest_fw = MTHCA_FW_VER(5, 1, 0), .is_memfree = 1, 
.is_pcie = 1 },
-   [SINAI]= { .latest_fw = MTHCA_FW_VER(1, 0, 1), .is_memfree = 1, 
.is_pcie = 1 }
+   [TAVOR]= { .latest_fw = MTHCA_FW_VER(3, 3, 3),
+   .flags = 0 },
+   [ARBEL_COMPAT] = { .latest_fw = MTHCA_FW_VER(4, 7, 0),
+   .flags = MTHCA_FLAG_PCIE },
+   [ARBEL_NATIVE] = { .latest_fw = MTHCA_FW_VER(5, 1, 0),
+   .flags = MTHCA_FLAG_MEMFREE | MTHCA_FLAG_PCIE },
+   [SINAI]= { .latest_fw = MTHCA_FW_VER(1, 0, 1),
+   .flags = MTHCA_FLAG_MEMFREE | MTHCA_FLAG_PCIE | 
MTHCA_FLAG_SINAI_OPT }
 };
 
 static int __devinit mthca_init_one(struct pci_dev *pdev,
@@ -1031,12 +1034,9 @@ static int __devinit mthca_init_one(stru
 
mdev->pdev = pdev;
 
+   mdev->mthca_flags = mthca_hca_table[id->driver_data].flags;
if (ddr_hidden)
mdev->mthca_flags |= MTHCA_FLAG_DDR_HIDDEN;
-   if (mthca_hca_table[id->driver_data].is_memfree)
-   mdev->mthca_flags |= MTHCA_FLAG_MEMFREE;
-   if (mthca_hca_table[id->driver_data].is_pcie)
-   mdev->mthca_flags |= MTHCA_FLAG_PCIE;
 
/*
 * Now reset the HCA before we touch the PCI capabilities or
Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_dev.h
===
--- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_dev.h
+++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_dev.h
@@ -71,7 +71,8 @@ enum {
MTHCA_FLAG_NO_LAM = 1 << 5,
MTHCA_FLAG_FMR= 1 << 6,
MTHCA_FLAG_MEMFREE= 1 << 7,
-   MTHCA_FLAG_PCIE   = 1 << 8
+   MTHCA_FLAG_PCIE   = 1 << 8,
+   MTHCA_FLAG_SINAI_OPT  = 1 << 9
 };
 
 enum {
Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_mr.c
===
--- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_mr.c
+++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_mr.c
@@ -76,6 +76,8 @@ struct mthca_mpt_entry {
 #define MTHCA_MPT_STATUS_SW 0xF0
 #define MTHCA_MPT_STATUS_HW 0x00
 
+#define SINAI_FMR_KEY_INC 0x100
+
 /*
  * Buddy allocator for MTT segments (currently not very efficient
  * since it doesn't keep a free list 

Re: [openib-general] [PATCH] mthca - optimize sinai large message

2006-02-28 Thread Eli Cohen
On Tue, 2006-02-28 at 13:15 +0200, Or Gerlitz wrote:
> Can you elaborate a little what are the implications of not using this 
> optimizations and what IB transport/size/type of messages (or better
> which ULPs) are effected?

The improvement will be noticable for protocol messages larger then ~80
KByte: RDMA Write/Read and Send.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] [PATCH] mthca - optimize sinai large message

2006-02-28 Thread Eli Cohen
Memory key generation modified to optimize large messages transfer
in Sinai.
This implementation restricts the MPT table size for Sinai to a
maximum of 2^23 entries.

Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>

Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_profile.c
===
--- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_profile.c
+++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_profile.c
@@ -153,8 +153,8 @@ u64 mthca_make_profile(struct mthca_dev 
  "won't in 0x%llx bytes of context memory.\n",
  (unsigned long long) total_size,
  (unsigned long long) mem_avail);
-   kfree(profile);
-   return -ENOMEM;
+   total_size = -ENOMEM;
+   goto exit;
}
 
if (profile[i].size)
@@ -260,6 +260,13 @@ u64 mthca_make_profile(struct mthca_dev 
 */
dev->limits.num_pds = MTHCA_NUM_PDS;
 
+   /* for Sinai MPT table must be smaller the 2^24 for optimized oprtatipn 
*/
+   if ((dev->mthca_flags & MTHCA_FLAG_SINAI_OPT) && init_hca->log_mpt_sz > 
23) {
+   total_size = -ENOSYS;
+   mthca_err(dev, "MPT table too large\n");
+   goto exit;
+   }
+
/*
 * For Tavor, FMRs use ioremapped PCI memory. For 32 bit
 * systems it may use too much vmalloc space to map all MTT
@@ -272,6 +279,7 @@ u64 mthca_make_profile(struct mthca_dev 
else
dev->limits.fmr_reserved_mtts = request->fmr_reserved_mtts;
 
+exit:
kfree(profile);
return total_size;
 }
Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_main.c
===
--- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_main.c
+++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_main.c
@@ -937,11 +937,12 @@ static struct {
u64 latest_fw;
int is_memfree;
int is_pcie;
+   int mkey_opt;
 } mthca_hca_table[] = {
-   [TAVOR]= { .latest_fw = MTHCA_FW_VER(3, 3, 3), .is_memfree = 0, 
.is_pcie = 0 },
-   [ARBEL_COMPAT] = { .latest_fw = MTHCA_FW_VER(4, 7, 0), .is_memfree = 0, 
.is_pcie = 1 },
-   [ARBEL_NATIVE] = { .latest_fw = MTHCA_FW_VER(5, 1, 0), .is_memfree = 1, 
.is_pcie = 1 },
-   [SINAI]= { .latest_fw = MTHCA_FW_VER(1, 0, 1), .is_memfree = 1, 
.is_pcie = 1 }
+   [TAVOR]= { .latest_fw = MTHCA_FW_VER(3, 3, 3), .is_memfree = 0, 
.is_pcie = 0, .mkey_opt = 0 },
+   [ARBEL_COMPAT] = { .latest_fw = MTHCA_FW_VER(4, 7, 0), .is_memfree = 0, 
.is_pcie = 1, .mkey_opt = 0 },
+   [ARBEL_NATIVE] = { .latest_fw = MTHCA_FW_VER(5, 1, 0), .is_memfree = 1, 
.is_pcie = 1, .mkey_opt = 0 },
+   [SINAI]= { .latest_fw = MTHCA_FW_VER(1, 0, 1), .is_memfree = 1, 
.is_pcie = 1, .mkey_opt = 1 }
 };
 
 static int __devinit mthca_init_one(struct pci_dev *pdev,
@@ -1037,6 +1038,9 @@ static int __devinit mthca_init_one(stru
mdev->mthca_flags |= MTHCA_FLAG_MEMFREE;
if (mthca_hca_table[id->driver_data].is_pcie)
mdev->mthca_flags |= MTHCA_FLAG_PCIE;
+   if (mthca_hca_table[id->driver_data].mkey_opt)
+   mdev->mthca_flags |= MTHCA_FLAG_SINAI_OPT;
+
 
/*
 * Now reset the HCA before we touch the PCI capabilities or
Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_dev.h
===
--- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_dev.h
+++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_dev.h
@@ -71,7 +71,8 @@ enum {
MTHCA_FLAG_NO_LAM = 1 << 5,
MTHCA_FLAG_FMR= 1 << 6,
MTHCA_FLAG_MEMFREE= 1 << 7,
-   MTHCA_FLAG_PCIE   = 1 << 8
+   MTHCA_FLAG_PCIE   = 1 << 8,
+   MTHCA_FLAG_SINAI_OPT  = 1 << 9
 };
 
 enum {
Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_mr.c
===
--- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_mr.c
+++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_mr.c
@@ -76,6 +76,8 @@ struct mthca_mpt_entry {
 #define MTHCA_MPT_STATUS_SW 0xF0
 #define MTHCA_MPT_STATUS_HW 0x00
 
+#define SINAI_FMR_KEY_INC 0x100
+
 /*
  * Buddy allocator for MTT segments (currently not very efficient
  * since it doesn't keep a free list and just searches linearly
@@ -330,6 +332,14 @@ static inline u32 key_to_hw_index(struct
return tavor_key_to_hw_index(key);
 }
 
+static inline u32 adjust_key(struct mthca_dev *dev, u32 key)
+{
+   if (dev->mthca_flags & MTHCA_FLAG_SINAI_OPT)
+   return ((key << 20) & 0x80) | (k

[openib-general] [PATCH] mthca - command interface

2006-02-20 Thread Eli Cohen
This patch is checks whether the HCA supports posting commands through
doorbells and if they are, will configure the driver accordingly. This
functionality can be controlled by the value of a read/write paramter -
post_cmd2dbell. When 0 commands are posted through HCR - otherwise if
HCA is capable commands go through doorbell. The value of the parameter
works collectively on all available HCAs.
This use of UAR0 to post commands eliminates the need for polling the go
bit prior to posting a new command. Since reading from a PCI device is
much more expensive then issuing a posted write, it is expected that
this command will provide better CPU utilization. We are currently
developing such tests and once we have results I will post them in this
list.

Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>

Index: source/drivers/infiniband/hw/mthca/mthca_dev.h
===
--- source.orig/drivers/infiniband/hw/mthca/mthca_dev.h
+++ source/drivers/infiniband/hw/mthca/mthca_dev.h
@@ -117,9 +117,18 @@ enum {
MTHCA_OPCODE_INVALID= 0xff
 };
 
+enum {
+   MTHCA_CMD_USE_EVENTS = (1 << 0),
+   MTHCA_CMD_CAN_POST_DOORBELLS = (1 << 1),
+   MTHCA_CMD_POST_DOORBELLS = (1 << 2)
+};
+
+enum {
+   MTHCA_CMD_NUM_DBELL_DWORDS = 8
+};
+
 struct mthca_cmd {
struct pci_pool  *pool;
-   int   use_events;
struct mutex  hcr_mutex;
struct semaphore  poll_sem;
struct semaphore  event_sem;
@@ -128,6 +137,10 @@ struct mthca_cmd {
int   free_head;
struct mthca_cmd_context *context;
u16   token_mask;
+   u32   flags;
+   void __iomem *dbell_map;
+   u64   dbell_base;
+   u16   dbell_offsets[MTHCA_CMD_NUM_DBELL_DWORDS];
 };
 
 struct mthca_limits {
Index: source/drivers/infiniband/hw/mthca/mthca_cmd.c
===
--- source.orig/drivers/infiniband/hw/mthca/mthca_cmd.c
+++ source/drivers/infiniband/hw/mthca/mthca_cmd.c
@@ -182,25 +182,68 @@ struct mthca_cmd_context {
u8status;
 };
 
+
+static int post_cmd2dbell = 1;
+module_param(post_cmd2dbell, int, 0666);
+MODULE_PARM_DESC(post_cmd2dbell, "post commands through doorbell");
+
 static inline int go_bit(struct mthca_dev *dev)
 {
return readl(dev->hcr + HCR_STATUS_OFFSET) &
swab32(1 << HCR_GO_BIT);
 }
 
-static int mthca_cmd_post(struct mthca_dev *dev,
- u64 in_param,
- u64 out_param,
- u32 in_modifier,
- u8 op_modifier,
- u16 op,
- u16 token,
- int event)
+
+static void mthca_cmd_post_dbell(struct mthca_dev *dev,
+u64 in_param,
+u64 out_param,
+u32 in_modifier,
+u8 op_modifier,
+u16 op,
+u16 token)
 {
-   int err = 0;
+   void __iomem *ptr = dev->cmd.dbell_map;
+   u16 *offs = dev->cmd.dbell_offsets;
 
-   mutex_lock(&dev->cmd.hcr_mutex);
+   __raw_writel((__force u32) cpu_to_be32(in_param >> 32),
+ptr + offs[0]);
+   wmb();
+   __raw_writel((__force u32) cpu_to_be32(in_param & 0xul),
+ptr + offs[1]);
+   wmb();
+   __raw_writel((__force u32) cpu_to_be32(in_modifier),
+ptr + offs[2]);
+   wmb();
+   __raw_writel((__force u32) cpu_to_be32(out_param >> 32),
+ptr + offs[3]);
+   wmb();
+   __raw_writel((__force u32) cpu_to_be32(out_param & 0xul),
+ptr + offs[4]);
+   wmb();
+   __raw_writel((__force u32) cpu_to_be32(token << 16),
+ptr + offs[5]);
+   wmb();
+   __raw_writel((__force u32) cpu_to_be32((1 << HCR_GO_BIT) |
+  (1 << HCA_E_BIT) |
+  (op_modifier << HCR_OPMOD_SHIFT) 
|
+   op),
+ptr + offs[6]);
+   wmb();
+   __raw_writel((__force u32) 0,
+ptr + offs[7]);
+   wmb();
+}
 
+
+static int mthca_cmd_post_hcr(struct mthca_dev *dev,
+ u64 in_param,
+ u64 out_param,
+ u32 in_modifier,
+ u8 op_modifier,
+ u16 op,
+ u16 token,
+ int e

Re: [openib-general] Re: PATCH] mthca - command interface - revised

2006-02-16 Thread Eli Cohen
Looks like we all agree that should be better to post commands through
doorbells and save the polling o the go bit. Then I can change the patch
so that there will be a r/w module parameter to mthca, ON by default, to
enable this feature and the user can turn off/on this functionality on
the fly by changing the value of this parameter. If we agree on this
I'll post a new patch early next week

Eli


signature.asc
Description: This is a digitally signed message part
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] PATCH] mthca - command interface - revised

2006-02-15 Thread Eli Cohen
Roland,
this patch is modified according to your comments. It also adds a kernel
configuration option which selects whether to use posting commands
through doorbells. The option is off by default.

Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>

Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_dev.h
===
--- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_dev.h
+++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_dev.h
@@ -117,9 +117,18 @@ enum {
MTHCA_OPCODE_INVALID= 0xff
 };
 
+enum {
+   MTHCA_CMD_USE_EVENTS = (1 << 0),
+   MTHCA_CMD_CAN_POST_DOORBELLS = (1 << 1),
+   MTHCA_CMD_POST_DOORBELLS = (1 << 2)
+};
+
+enum {
+   MTHCA_CMD_NUM_DBELL_DWORDS = 8
+};
+
 struct mthca_cmd {
struct pci_pool  *pool;
-   int   use_events;
struct mutex  hcr_mutex;
struct semaphore  poll_sem;
struct semaphore  event_sem;
@@ -128,6 +137,10 @@ struct mthca_cmd {
int   free_head;
struct mthca_cmd_context *context;
u16   token_mask;
+   u32   flags;
+   void __iomem *dbell_map;
+   u64   dbell_base;
+   u16   dbell_offsets[MTHCA_CMD_NUM_DBELL_DWORDS];
 };
 
 struct mthca_limits {
Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_cmd.c
===
--- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_cmd.c
+++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_cmd.c
@@ -188,19 +188,57 @@ static inline int go_bit(struct mthca_de
swab32(1 << HCR_GO_BIT);
 }
 
-static int mthca_cmd_post(struct mthca_dev *dev,
- u64 in_param,
- u64 out_param,
- u32 in_modifier,
- u8 op_modifier,
- u16 op,
- u16 token,
- int event)
+
+static void mthca_cmd_post_dbell(struct mthca_dev *dev,
+u64 in_param,
+u64 out_param,
+u32 in_modifier,
+u8 op_modifier,
+u16 op,
+u16 token)
 {
-   int err = 0;
+   void __iomem *ptr = dev->cmd.dbell_map;
+   u16 *offs = dev->cmd.dbell_offsets;
+
+   __raw_writel((__force u32) cpu_to_be32(in_param >> 32),
+ptr + offs[0]);
+   wmb();
+   __raw_writel((__force u32) cpu_to_be32(in_param & 0xul),
+ptr + offs[1]);
+   wmb();
+   __raw_writel((__force u32) cpu_to_be32(in_modifier),
+ptr + offs[2]);
+   wmb();
+   __raw_writel((__force u32) cpu_to_be32(out_param >> 32),
+ptr + offs[3]);
+   wmb();
+   __raw_writel((__force u32) cpu_to_be32(out_param & 0xul),
+ptr + offs[4]);
+   wmb();
+   __raw_writel((__force u32) cpu_to_be32(token << 16),
+ptr + offs[5]);
+   wmb();
+   __raw_writel((__force u32) cpu_to_be32((1 << HCR_GO_BIT) |
+  (1 << HCA_E_BIT) |
+  (op_modifier << HCR_OPMOD_SHIFT) 
|
+   op),
+ptr + offs[6]);
+   wmb();
+   __raw_writel((__force u32) 0,
+ptr + offs[7]);
+   wmb();
+}
 
-   mutex_lock(&dev->cmd.hcr_mutex);
 
+static int mthca_cmd_post_hcr(struct mthca_dev *dev,
+ u64 in_param,
+ u64 out_param,
+ u32 in_modifier,
+ u8 op_modifier,
+ u16 op,
+ u16 token,
+ int event)
+{
if (event) {
unsigned long end = jiffies + GO_BIT_TIMEOUT;
 
@@ -210,10 +248,8 @@ static int mthca_cmd_post(struct mthca_d
}
}
 
-   if (go_bit(dev)) {
-   err = -EAGAIN;
-   goto out;
-   }
+   if (go_bit(dev))
+   return -EAGAIN;
 
/*
 * We use writel (instead of something like memcpy_toio)
@@ -236,7 +272,29 @@ static int mthca_cmd_post(struct mthca_d
   (op_modifier << HCR_OPMOD_SHIFT) 
|
   op),   
dev->hcr + 6 * 4);
 
-out:
+   return 0;
+}
+
+static int mthca_cmd_post(struct mthca_dev *dev,
+ u64 in_param,
+   

Re: [openib-general] [PATCH] mthca - command interface

2006-02-14 Thread Eli Cohen
On Mon, 2006-02-13 at 22:30 -0800, Roland Dreier wrote: 
> Or is the command doorbell stuff inside the UAR, which we already
> request in mthca_request_regions()?

Yes. These are offsets into UAR0 so there is no need to request a
region.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] [PATCH] mthca - command interface

2006-02-13 Thread Eli Cohen
This patch implements posting commands by issuing posted writes in a
manner similar to doorbells. The benefit of this method is that it does
not require polling the go bit before posting the command and thus can
lower CPU utilization.

Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>
Signed-off-by: Michael S. Tsirkin <[EMAIL PROTECTED]>

Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_dev.h
===
--- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_dev.h
+++ openib_gen2/drivers/infiniband/hw/mthca/mthca_dev.h
@@ -128,6 +128,13 @@ struct mthca_cmd {
int   free_head;
struct mthca_cmd_context *context;
u16   token_mask;
+   /* post commands like doorbells instead of hcr */
+   int   can_post_doorbells;
+   int   dbell_post;
+   void __iomem *dbell_map;
+   void __iomem *dbell_ptrs[8];
+   u64   dbell_base;
+   u16   dbell_offsets[8];
 };
 
 struct mthca_limits {
Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.c
===
--- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_cmd.c
+++ openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.c
@@ -188,19 +188,40 @@ static inline int go_bit(struct mthca_de
swab32(1 << HCR_GO_BIT);
 }
 
-static int mthca_cmd_post(struct mthca_dev *dev,
- u64 in_param,
- u64 out_param,
- u32 in_modifier,
- u8 op_modifier,
- u16 op,
- u16 token,
- int event)
-{
-   int err = 0;
 
-   mutex_lock(&dev->cmd.hcr_mutex);
+static void mthca_cmd_post_dbell(struct mthca_dev *dev,
+u64 in_param,
+u64 out_param,
+u32 in_modifier,
+u8 op_modifier,
+u16 op,
+u16 token)
+{
+   void __iomem **ptr = dev->cmd.dbell_ptrs;
+
+   writel((__force u32) cpu_to_be32(in_param >> 32),   ptr[0]);
+   writel((__force u32) cpu_to_be32(in_param & 0xul),  ptr[1]);
+   writel((__force u32) cpu_to_be32(in_modifier),  ptr[2]);
+   writel((__force u32) cpu_to_be32(out_param >> 32),  ptr[3]);
+   writel((__force u32) cpu_to_be32(out_param & 0xul), ptr[4]);
+   writel((__force u32) cpu_to_be32(token << 16),  ptr[5]);
+   writel((__force u32) cpu_to_be32((1 << HCR_GO_BIT)  
|
+  (1 << HCA_E_BIT) 
|
+  (op_modifier << HCR_OPMOD_SHIFT) 
|
+  op), ptr[6]);
+   writel((__force u32) 0, ptr[7]);
+}
+
 
+static int mthca_cmd_post_hcr(struct mthca_dev *dev,
+ u64 in_param,
+ u64 out_param,
+ u32 in_modifier,
+ u8 op_modifier,
+ u16 op,
+ u16 token,
+ int event)
+{
if (event) {
unsigned long end = jiffies + GO_BIT_TIMEOUT;
 
@@ -210,10 +231,8 @@ static int mthca_cmd_post(struct mthca_d
}
}
 
-   if (go_bit(dev)) {
-   err = -EAGAIN;
-   goto out;
-   }
+   if (go_bit(dev))
+   return -EAGAIN;
 
/*
 * We use writel (instead of something like memcpy_toio)
@@ -236,7 +255,29 @@ static int mthca_cmd_post(struct mthca_d
   (op_modifier << HCR_OPMOD_SHIFT) 
|
   op),   
dev->hcr + 6 * 4);
 
-out:
+   return 0;
+}
+
+static int mthca_cmd_post(struct mthca_dev *dev,
+ u64 in_param,
+ u64 out_param,
+ u32 in_modifier,
+ u8 op_modifier,
+ u16 op,
+ u16 token,
+ int event)
+{
+   int err = 0;
+
+   mutex_lock(&dev->cmd.hcr_mutex);
+
+   if (dev->cmd.dbell_post && event)
+   mthca_cmd_post_dbell(dev, in_param, out_param, in_modifier,
+  op_modifier, op, token);
+   else
+   err = mthca_cmd_post_hcr(dev, in_param, out_param, in_modifier,
+op_modifie

[openib-general] [PATCH] mthca: query_qp and query_srq

2006-02-12 Thread Eli Cohen
Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>
Signed-off-by: Michael S. Tsirkin <[EMAIL PROTECTED]>

implement ib_query_qp() and ib_query_srq() for mthca.

Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.h
===
--- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_cmd.h
+++ openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.h
@@ -303,6 +303,8 @@ int mthca_RESIZE_CQ(struct mthca_dev *de
u8 *status);
 int mthca_SW2HW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox,
int srq_num, u8 *status);
+int mthca_QUERY_SRQ(struct mthca_dev *dev, u32 num,
+   struct mthca_mailbox *mailbox, u8 *status);
 int mthca_HW2SW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox,
int srq_num, u8 *status);
 int mthca_ARM_SRQ(struct mthca_dev *dev, int srq_num, int limit, u8 *status);
Index: openib_gen2/drivers/infiniband/core/verbs.c
===
--- openib_gen2.orig/drivers/infiniband/core/verbs.c
+++ openib_gen2/drivers/infiniband/core/verbs.c
@@ -257,9 +257,18 @@ int ib_query_qp(struct ib_qp *qp,
int qp_attr_mask,
struct ib_qp_init_attr *qp_init_attr)
 {
-   return qp->device->query_qp ?
+   int err;
+
+   err = qp->device->query_qp ?
qp->device->query_qp(qp, qp_attr, qp_attr_mask, qp_init_attr) :
-ENOSYS;
+   if (err)
+   return err;
+   qp_init_attr->recv_cq = qp->recv_cq;
+   qp_init_attr->send_cq = qp->send_cq;
+   qp_init_attr->srq = qp->srq;
+
+   return err;
 }
 EXPORT_SYMBOL(ib_query_qp);
 
Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_srq.c
===
--- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_srq.c
+++ openib_gen2/drivers/infiniband/hw/mthca/mthca_srq.c
@@ -357,6 +357,40 @@ int mthca_modify_srq(struct ib_srq *ibsr
return 0;
 }
 
+
+int mthca_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *srq_attr)
+{
+   struct mthca_dev *dev = to_mdev(ibsrq->device);
+   struct mthca_srq *srq = to_msrq(ibsrq);
+   struct mthca_mailbox *mailbox;
+   struct mthca_arbel_srq_context *arbel_ctx;
+   u8 status;
+   int err;
+
+   mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL);
+   if (IS_ERR(mailbox))
+   return PTR_ERR(mailbox);
+
+   err = mthca_QUERY_SRQ(dev, srq->srqn, mailbox, &status);
+   if (err)
+   goto exit;
+   if (mthca_is_memfree(dev)) {
+   arbel_ctx = mailbox->buf;
+   srq_attr->srq_limit = arbel_ctx->limit_watermark;
+   }
+   else
+   srq_attr->srq_limit = 0;
+
+   srq_attr->pd = ibsrq->pd;
+   srq_attr->max_wr = srq->max;
+   srq_attr->max_sge = srq->max_gs;
+
+exit:
+   mthca_free_mailbox(dev, mailbox);
+
+   return err;
+}
+
 void mthca_srq_event(struct mthca_dev *dev, u32 srqn,
 enum ib_event_type event_type)
 {
Index: openib_gen2/drivers/infiniband/include/rdma/ib_verbs.h
===
--- openib_gen2.orig/drivers/infiniband/include/rdma/ib_verbs.h
+++ openib_gen2/drivers/infiniband/include/rdma/ib_verbs.h
@@ -423,9 +423,10 @@ enum ib_srq_attr_mask {
 };
 
 struct ib_srq_attr {
-   u32 max_wr;
-   u32 max_sge;
-   u32 srq_limit;
+   u32max_wr;
+   u32max_sge;
+   u32srq_limit;
+   struct ib_pd  *pd;
 };
 
 struct ib_srq_init_attr {
Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_provider.c
===
--- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_provider.c
+++ openib_gen2/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -1270,7 +1270,10 @@ int mthca_register_device(struct mthca_d
(1ull << IB_USER_VERBS_CMD_DETACH_MCAST)|
(1ull << IB_USER_VERBS_CMD_CREATE_SRQ)  |
(1ull << IB_USER_VERBS_CMD_MODIFY_SRQ)  |
-   (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ);
+   (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) |
+   (1ull << IB_USER_VERBS_CMD_QUERY_QP)|
+   (1ull << IB_USER_VERBS_CMD_QUERY_SRQ);
+
dev->ib_dev.node_type= RDMA_NODE_IB_CA;
dev->ib_dev.phys_port_cnt= dev->limits.num_ports;
dev->ib_dev.dma_device   = &dev->pdev->dev;
@@ -1291,7 +1294,8 @@ int mthca_register_device(struct mthca_d
 
if (dev->mthca_flags & MTHCA_FLAG_SRQ) {
dev->ib_dev.create_srq   = mthca_create_srq;
-

[openib-general] [PATCH] mthca: query_qp and query_srq

2006-02-12 Thread Eli Cohen
Signed-off-by: Eli Cohen <[EMAIL PROTECTED]>
Signed-off-by: Michael S. Tsirkin <[EMAIL PROTECTED]>

implement ib_query_qp() and ib_query_srq() for mthca.

Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.h
===
--- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_cmd.h
+++ openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.h
@@ -303,6 +303,8 @@ int mthca_RESIZE_CQ(struct mthca_dev *de
u8 *status);
 int mthca_SW2HW_SRQ(struct mthca_dev *dev, struct mthca_mailbox
*mailbox,
int srq_num, u8 *status);
+int mthca_QUERY_SRQ(struct mthca_dev *dev, u32 num,
+   struct mthca_mailbox *mailbox, u8 *status);
 int mthca_HW2SW_SRQ(struct mthca_dev *dev, struct mthca_mailbox
*mailbox,
int srq_num, u8 *status);
 int mthca_ARM_SRQ(struct mthca_dev *dev, int srq_num, int limit, u8
*status);
Index: openib_gen2/drivers/infiniband/core/verbs.c
===
--- openib_gen2.orig/drivers/infiniband/core/verbs.c
+++ openib_gen2/drivers/infiniband/core/verbs.c
@@ -257,9 +257,18 @@ int ib_query_qp(struct ib_qp *qp,
int qp_attr_mask,
struct ib_qp_init_attr *qp_init_attr)
 {
-   return qp->device->query_qp ?
+   int err;
+
+   err = qp->device->query_qp ?
qp->device->query_qp(qp, qp_attr, qp_attr_mask, qp_init_attr) :
-ENOSYS;
+   if (err)
+   return err;
+   qp_init_attr->recv_cq = qp->recv_cq;
+   qp_init_attr->send_cq = qp->send_cq;
+   qp_init_attr->srq = qp->srq;
+
+   return err;
 }
 EXPORT_SYMBOL(ib_query_qp);
 
Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_srq.c
===
--- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_srq.c
+++ openib_gen2/drivers/infiniband/hw/mthca/mthca_srq.c
@@ -357,6 +357,40 @@ int mthca_modify_srq(struct ib_srq *ibsr
return 0;
 }
 
+
+int mthca_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *srq_attr)
+{
+   struct mthca_dev *dev = to_mdev(ibsrq->device);
+   struct mthca_srq *srq = to_msrq(ibsrq);
+   struct mthca_mailbox *mailbox;
+   struct mthca_arbel_srq_context *arbel_ctx;
+   u8 status;
+   int err;
+
+   mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL);
+   if (IS_ERR(mailbox))
+   return PTR_ERR(mailbox);
+
+   err = mthca_QUERY_SRQ(dev, srq->srqn, mailbox, &status);
+   if (err)
+   goto exit;
+   if (mthca_is_memfree(dev)) {
+   arbel_ctx = mailbox->buf;
+   srq_attr->srq_limit = arbel_ctx->limit_watermark;
+   }
+   else
+   srq_attr->srq_limit = 0;
+
+   srq_attr->pd = ibsrq->pd;
+   srq_attr->max_wr = srq->max;
+   srq_attr->max_sge = srq->max_gs;
+
+exit:
+   mthca_free_mailbox(dev, mailbox);
+
+   return err;
+}
+
 void mthca_srq_event(struct mthca_dev *dev, u32 srqn,
 enum ib_event_type event_type)
 {
Index: openib_gen2/drivers/infiniband/include/rdma/ib_verbs.h
===
--- openib_gen2.orig/drivers/infiniband/include/rdma/ib_verbs.h
+++ openib_gen2/drivers/infiniband/include/rdma/ib_verbs.h
@@ -423,9 +423,10 @@ enum ib_srq_attr_mask {
 };
 
 struct ib_srq_attr {
-   u32 max_wr;
-   u32 max_sge;
-   u32 srq_limit;
+   u32max_wr;
+   u32max_sge;
+   u32srq_limit;
+   struct ib_pd  *pd;
 };
 
 struct ib_srq_init_attr {
Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_provider.c
===
--- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_provider.c
+++ openib_gen2/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -1270,7 +1270,10 @@ int mthca_register_device(struct mthca_d
(1ull << IB_USER_VERBS_CMD_DETACH_MCAST)|
(1ull << IB_USER_VERBS_CMD_CREATE_SRQ)  |
(1ull << IB_USER_VERBS_CMD_MODIFY_SRQ)  |
-   (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ);
+   (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) |
+   (1ull << IB_USER_VERBS_CMD_QUERY_QP)|
+   (1ull << IB_USER_VERBS_CMD_QUERY_SRQ);
+
dev->ib_dev.node_type= RDMA_NODE_IB_CA;
dev->ib_dev.phys_port_cnt= dev->limits.num_ports;
dev->ib_dev.dma_device   = &dev->pdev->dev;
@@ -1291,7 +1294,8 @@ int mthca_register_device(struct mthca_d
 
if (dev->mthca_flags & MTHCA_FLAG_SRQ) {
dev->ib_dev.create_srq   = mthca_create_srq;
-

RE: [openib-general] DHCP over Infiniband

2005-11-03 Thread Eli Cohen
Title: RE: [openib-general] DHCP over Infiniband





The client is Etherboot's client for configuring a client at boot time. The server is ISC. Can you explain what you mean by "the QPN issue"?

-Original Message-
From: Hal Rosenstock [mailto:[EMAIL PROTECTED]]
Sent: Thursday, November 03, 2005 6:07 PM
To: Eli Cohen
Cc: 'openib-general@openib.org'
Subject: Re: [openib-general] DHCP over Infiniband



On Thu, 2005-11-03 at 11:01, Eli Cohen wrote:
> Hi,
> has anyone had the chance to run a DHCP server on an Infiniband
> interface? I checked this on Suse 10 kernel 2.6.13-15-bigsmp and I do
> not get responses from the server to DHCP requests. When running
> tcpdump on ib0 interface I can see the requests but the server does
> not respond. The server's version is isc-dhcpd-V3.0.3. I also tried
> version dhcp-3.0.4b1 but with no luck. I checked on Suse SLES 9 with
> Mellanox's IBGD1.8 and the server responds to requests. I still had a
> problem that the server does not set the client identifier option in
> its responses although the client does set this option. If you have
> any experience with this please let me know.


What DHCP server and what client are you using ? This has been done with
the ISC ones. It requires modifications due to the difference in
hardware addresses (and there is the QPN issue).


-- Hal


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general


To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] DHCP over Infiniband

2005-11-03 Thread Eli Cohen
Title: DHCP over Infiniband





Hi,
has anyone had the chance to run a DHCP server on an Infiniband interface? I checked this on Suse 10 kernel 2.6.13-15-bigsmp and I do not get responses from the server to DHCP requests. When running tcpdump on ib0 interface I can see the requests but the server does not respond. The server's version is isc-dhcpd-V3.0.3. I also tried version dhcp-3.0.4b1 but with no luck. I checked on Suse SLES 9 with Mellanox's IBGD1.8 and the server responds to requests. I still had a problem that the server does not set the client identifier option in its responses although the client does set this option. If you have any experience with this please let me know.

Thanks
Eli



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] RE: ptrace peektext failure for Mellanox IBGD 1.7.0 based cluster

2005-10-10 Thread Eli Cohen
Title: RE: ptrace peektext failure for Mellanox IBGD 1.7.0 based cluster





David,
IBGD 1.7 does not support kernel 2.6.11 so I assume you have made changes to IBGD to make it compile.
In the files you sent I can't see a call to ptrace with PTRACE_PEEKTEXT but I can see a call to PTRACE_PEEKDATA. Note that in the IBGD stack, registered buffers are not inherited by a child process when a the parent forks. This is accomplished by setting the VM_DONTCOPY flag on the vma. It is so done to retain the virtual to physical translation of a page at the parent by disabling COW on the pages. So the child may not even have these buffers in its address space and this could be the reason why ptrace fails.

Note also that IBGD 1.8 is the latest release and it does support kernel 2.6.11 so you may consider using it, though the description above holds also for IBGD 1.8

Eli


-Original Message-
From: David Lecomber [mailto:[EMAIL PROTECTED]]
Sent: Monday, October 10, 2005 11:23 AM
To: openib-general@openib.org
Subject: ptrace peektext failure for Mellanox IBGD 1.7.0 based cluster



Dear all,


I'm having a kernel problem which I believe to be caused by the
infiniband drivers on the system I am using.


Kernel 2.6.11, Mellanox software stack IBGD 1.7.0.


Essentially, once an MPI code is started, the kernel refuses to allow
ptrace() access to the text segment (ie. where the program instructions
lie), although it is possible to access the data segment.


This means debugging is impossible (gdb, idb, ddt, etc.).


The attached code demonstrates the problem.


Untar, and then make.  Run the 'mpi' program, and pick a line of it's
output, paste into another shell.  On the standard, non MPI test code,
the ptrace reads are all successful.  On the MPI test, it gives an error
for the text segment reads..


Is this a known issue - are there any upgrades/fixes which should have
been applied?  I would appreciate if someone could run the test
suggested on a really new setup, and see if the error happens.



Regards
David
-- 
David Lecomber, CTO, Allinea Software
tel: +44 1926 623231  fax: +44 1926 623232



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] RE: registering read-only memory

2005-05-30 Thread Eli Cohen
Title: RE: registering read-only memory





OK thanks. I had another problem in MVAPI.


-Original Message-
From: Roland Dreier [mailto:[EMAIL PROTECTED]]
Sent: Friday, May 27, 2005 7:49 PM
To: Eli Cohen
Cc: Michael S. Tsirkin; openib-general@openib.org
Subject: Re: registering read-only memory



    Eli> I saw this this on Suse 9.3 and the memory was like: static
    Eli> const char *my_string = "Hello world"; defined locally in a
    Eli> function. This produced a VMA with VM_IO set.


What kernel version/architecture?


I'm having a hard time seeing how it's possible to get a VMA with
VM_IO set except by using remap_pfn_range(), and I can't see how
ordinary const data from userspace could cause that.


 - R.



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] RE: registering read-only memory

2005-05-26 Thread Eli Cohen
Title: RE: registering read-only memory





R> Are you sure?  It seems to work for me if I just pass write=0 to
R> get_user_pages().


I saw this this on Suse 9.3 and the memory was like:
static const char *my_string = "Hello world";
defined locally in a function. This produced a VMA with VM_IO set.



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] RE: registering read-only memory

2005-05-26 Thread Eli Cohen
Title: RE: registering read-only memory





I'm afraid that won't solve the problem. From some reason the vma of such a region is flagged with VM_IO and that's why get_user_pages() will fail. We need to find another way to get the physical address of such memory. One thing I can think of is implementing something similar to follow_page(). Or better if the kernel will export follow_page().

-Original Message-
From: Roland Dreier [mailto:[EMAIL PROTECTED]]
Sent: Thursday, May 26, 2005 10:09 PM
To: Eli Cohen
Cc: Michael S. Tsirkin; openib-general@openib.org
Subject: Re: registering read-only memory



OK, I think I understand the problem.  The compiler puts a string
defined with "const char *foo" on a read-only page, but in
uverbs_mem.c we always pass write=1 to get_user_pages().  I need to
fix the code so that it only asks for writeable pages if we're
registering with writable permissions.


 - R.



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] RE: registering read-only memory

2005-05-24 Thread Eli Cohen
Title: RE: registering read-only memory





Please try:
const char *foo="Michael Tsirkin";


as opposed to
const char foo[]="Michael Tsirkin";


-Original Message-
From: Roland Dreier [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, May 24, 2005 10:07 PM
To: Michael S. Tsirkin
Cc: openib-general@openib.org; Eli Cohen
Subject: Re: registering read-only memory



    > const char foo[]="Michael Tsirkin";
    > ibv_reg_mr(ctx->pd, foo, strlen(foo), 0);


This seemed to work fine on my system (I got a non-null ibv_mr back
with valid-looking L_Key/R_Key).


 - R.



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] [PATCH] cache.c fixes

2004-11-25 Thread Eli Cohen
Title: [openib-general] [PATCH] cache.c fixes





Looks like allocation size is buggy and also cleanup.


Index: cache.c
===
--- cache.c (revision 1292)
+++ cache.c (working copy)
@@ -249,10 +249,10 @@


    device->cache.pkey_cache =
    kmalloc(sizeof *device->cache.pkey_cache *
-   end_port(device) - start_port(device), GFP_KERNEL);
+   (end_port(device) - start_port(device) + 1), GFP_KERNEL);
    device->cache.gid_cache =
    kmalloc(sizeof *device->cache.pkey_cache *
-   end_port(device) - start_port(device), GFP_KERNEL);
+   (end_port(device) - start_port(device) + 1), GFP_KERNEL);


    if (!device->cache.pkey_cache || !device->cache.gid_cache) {
    printk(KERN_WARNING "Couldn't allocate cache "
@@ -280,8 +280,14 @@
    }


 err:
-   kfree(device->cache.pkey_cache);
-   kfree(device->cache.gid_cache);
+   if (device->cache.pkey_cache) {
+    kfree(device->cache.pkey_cache);
+    device->cache.pkey_cache = NULL;
+  }
+   if (device->cache.gid_cache) {
+    device->cache.gid_cache = NULL;
+    kfree(device->cache.gid_cache);
+  }
 }


 void ib_cache_cleanup_one(struct ib_device *device)
@@ -296,8 +302,12 @@
    kfree(device->cache.gid_cache[p]);
    }


-   kfree(device->cache.pkey_cache);
-   kfree(device->cache.gid_cache);
+  if (device->cache.pkey_cache) {
+    kfree(device->cache.pkey_cache);
+  }
+  if (device->cache.gid_cache) {
+    kfree(device->cache.gid_cache);
+  }
 }


 struct ib_client cache_client = {



___
openib-general mailing list
[EMAIL PROTECTED]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general