Re: [PATCH] infiniband-diags/ibportstate: allow changes to CA portinfo parameters

2010-01-11 Thread Ralph Campbell
Its been 2 months since I posted this.
What is the current status? Last comment I saw was from Ira
saying either add this functionality to ibportstate or
rename ibportstate but don't split it off into a new command.
My preference is to go with the original patch. I haven't
seen any strong objection to it.

On Thu, 2009-11-05 at 11:56 -0800, Ralph Campbell wrote:
> This patch adds new commands to ibportstate to support initializing
> the link for CAs connected back-to-back. It also allows more than
> one field to be changed at the same time such as "lid 23 arm" or
> "width 1 speed 3 enable".
> 
> Signed-off-by: Ralph Campbell 
> ---
> 
> diff --git a/infiniband-diags/src/ibportstate.c 
> b/infiniband-diags/src/ibportstate.c
> index e631bfd..192b14e 100644
> --- a/infiniband-diags/src/ibportstate.c
> +++ b/infiniband-diags/src/ibportstate.c
> @@ -46,93 +46,133 @@
> 
>  #include "ibdiag_common.h"
> 
> +enum port_ops {
> +   QUERY,
> +   ENABLE,
> +   RESET,
> +   DISABLE,
> +   SPEED,
> +   WIDTH,
> +   DOWN,
> +   ARM,
> +   ACTIVE,
> +   VLS,
> +   MTU,
> +   LID,
> +   SMLID,
> +   LMC,
> +};
> +
>  struct ibmad_port *srcport;
> +int speed = 15;
> +int width = 255;
> +int lid;
> +int smlid;
> +int lmc;
> +int mtu;
> +int vls;
> +
> +struct {
> +   const char *name;
> +   int *val;
> +   int set;
> +} port_args[] = {
> +   { "query", NULL, 0 },   /* QUERY */
> +   { "enable", NULL, 0 },  /* ENABLE */
> +   { "reset", NULL, 0 },   /* RESET */
> +   { "disable", NULL, 0 }, /* DISABLE */
> +   { "speed", &speed, 0 }, /* SPEED */
> +   { "width", &width, 0 }, /* WIDTH */
> +   { "down", NULL, 0 },/* DOWN */
> +   { "arm", NULL, 0 }, /* ARM */
> +   { "active", NULL, 0 },  /* ACTIVE */
> +   { "vls", &vls, 0 }, /* VLS */
> +   { "mtu", &mtu, 0 }, /* MTU */
> +   { "lid", &lid, 0 }, /* LID */
> +   { "smlid", &smlid, 0 }, /* SMLID */
> +   { "lmc", &lmc, 0 }, /* LMC */
> +};
> +
> +#define NPORT_ARGS (sizeof(port_args) / sizeof(port_args[0]))
> 
>  /***/
> 
> +/*
> + * Return 1 if port is a switch, else zero.
> + */
>  static int get_node_info(ib_portid_t * dest, uint8_t * data)
>  {
> int node_type;
> 
> if (!smp_query_via(data, dest, IB_ATTR_NODE_INFO, 0, 0, srcport))
> -   return -1;
> +   IBERROR("smp query nodeinfo failed");
> 
> node_type = mad_get_field(data, 0, IB_NODE_TYPE_F);
> if (node_type == IB_NODE_SWITCH)/* Switch NodeType ? */
> -   return 0;
> -   else
> return 1;
> +   else
> +   return 0;
>  }
> 
> -static int get_port_info(ib_portid_t * dest, uint8_t * data, int portnum,
> -int port_op)
> +static void get_port_info(ib_portid_t * dest, uint8_t * data, int portnum)
> +{
> +   if (!smp_query_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, 
> srcport))
> +   IBERROR("smp query portinfo failed");
> +}
> +
> +static void show_port_info(ib_portid_t * dest, uint8_t * data, int portnum)
>  {
> char buf[2048];
> char val[64];
> 
> -   if (!smp_query_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, 
> srcport))
> -   return -1;
> -
> -   if (port_op != 4) {
> -   mad_dump_portstates(buf, sizeof buf, data, sizeof data);
> -   mad_decode_field(data, IB_PORT_LINK_WIDTH_SUPPORTED_F, val);
> -   mad_dump_field(IB_PORT_LINK_WIDTH_SUPPORTED_F,
> -  buf + strlen(buf), sizeof buf - strlen(buf),
> -  val);
> -   sprintf(buf + strlen(buf), "%s", "\n");
> -   mad_decode_field(data, IB_PORT_LINK_WIDTH_ENABLED_F, val);
> -   mad_dump_field(IB_PORT_LINK_WIDTH_ENABLED_F, buf + 
> strlen(buf),
> -  sizeof buf - strlen(buf), val);
> -   sprintf(buf + strlen(buf), "%s", "\n");
> -   mad_decode_field(data, IB_PORT_LINK_WIDTH_ACTIVE_F, val);
> -   mad_dump_field(IB_PORT_LINK_WIDTH_ACTIVE_F, buf + strlen(buf),
> -  sizeof buf - strlen(buf), val);
> -   sprintf(buf + strlen(buf), "%s", "\n");
> -   mad_decode_field(data, IB_PORT_LINK_SPEED_SUPPORTED_F, val);
> -   mad_dump_field(IB_PORT_LINK_SPEED_SUPPORTED_F,
> -  buf + strlen(buf), sizeof buf - strlen(buf),
> -  val);
> -   sprintf(buf + strlen(buf), "%s", "\n");
> -   mad_decode_field(data, IB_PORT_LINK_SPEED_ENABLED_F, val);
> -   mad_dump_field(IB_PORT_LINK_SPEED_ENABLED_F, buf + 
> strlen(buf),
> -  sizeof buf - strlen(buf), val);
> -   sprintf(buf + strlen(buf), "%s", "\n");
> -   mad_decode_field(data, IB_PORT_

RE: setting responder_resources and initiator_depth

2010-01-11 Thread Sean Hefty
>Currently, the RDS code sets both of these to 1 when calling
>rdma_accept(). Using an arbitrary greater value (e.g. 64) doesn't seem
>to work right.
>
>I can call ib_query_qp(). Am I to assume I can set initiator_depth up to
>ib_qp_attr.max_rd_atomic, and responder_resources to
>ib_qp_attr.max_dest_rd_atomic and this will work?

You need to coordinate the values with those supported by the remote end of the
connection.  See the rdma_connect and rdma_accept man pages for additional
details.  The librdmacm has checks in rdma_connect and rdma_accept to verify
that the values provided by the user at least make sense for the local HW.

- Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


setting responder_resources and initiator_depth

2010-01-11 Thread Andy Grover
Hi all,

Currently, the RDS code sets both of these to 1 when calling
rdma_accept(). Using an arbitrary greater value (e.g. 64) doesn't seem
to work right.

I can call ib_query_qp(). Am I to assume I can set initiator_depth up to
ib_qp_attr.max_rd_atomic, and responder_resources to
ib_qp_attr.max_dest_rd_atomic and this will work?

Thanks -- Regards -- Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ANNOUNCE] uDAPL v2 - dapl-2.0.26 release

2010-01-11 Thread Davis, Arlin R
 

New release for uDAPL 2.0 available on the OFA download page and in my git tree.

Bug fixes for RDMAoE.

md5sum: 32924de6b4f1fa351aa1468b19501981 dapl-2.0.26.tar.gz 

Summary of changes since last release: 

Release 2.0.26 fixes (tested on OFED-RDMAoE-rc6, Mellanox MHJH29-XTC, v2.7.0): 

v2 - openib_common: add check for both gid and global routing in RTR 
v2 - openib_common: remote memory read priviledge set multiple times 
v2 - scm, ucm: DAPL_GLOBAL_ROUTING enable setting to use GID's, causes segv 

Vlad, please pull new package into latest OFED 1.5 RDMAoE build and install the 
following:
 
dapl-2.0.26-1 
dapl-utils-2.0.26-1 
dapl-devel-2.0.26-1 
dapl-debuginfo-2.0.26-1 
compat-dapl-1.2.15-1 
compat-dapl-devel-1.2.15-1 

See http://www.openfabrics.org/downloads/dapl/ more details.

Thanks,

-arlin



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SRP Q's: 1) When is asynchronous I/O complete, 2) Is sequential I/O coalesced, and 3) why is iSCSI faster than SRP in some instances

2010-01-11 Thread Vladislav Bolkhovitin

Chris Worley, on 01/09/2010 01:39 AM wrote:

I thought if the device was opened with the O_DIRECT flag, then the
scheduler should have nothing to coalesce.

Depends on how many I/Os your application has in flight at once,
assuming it is using AIO or threads. If you have more requests in flight
than can be queued, the block layer will coalesce if possible.


I do use AIO, always 64 threads, each w/ 64 outstanding I/O's.  Local
or iSER initiator based, I never see any coalescing.  Only w/ SRP.


SRP initiator seems to be not too well optimized for the best 
performance. ISER initiator is noticeably better in this area.



There is the scst_vdisk "Direct I/O" option that's been commented out
of the code, as it's not supposed to work... maybe direct I/O doesn't
work... but that would be the target side.


O_DIRECT for vdisk is supposed to work. It's a matter of a small patch 
for the kernel, see http://scst.sourceforge.net/contributing.html#O_DIRECT.


Meanwhile, you can use fileio_tgt handler, with which O_DIRECT works well.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/mlx4: fix post_recv wq overflow check

2010-01-11 Thread Roland Dreier

 > I wonder if the overflow check could be removed all together and be
 > left to the ULP (kernel is trusted environment...) is there any risk
 > in doing so? this way the WR posting code will not experience
 > contention with the poll WC code on the CQ lock.

We could do that I guess if it's a really big performance gain.  I do
think it is quite common to see this WQ overflow check trigger, even for
kernel code, and so we would have to be sure the performance gain is
worth making these bugs much harder to find and fix.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: upstream mlx4/ib/4K mtu support

2010-01-11 Thread Vladimir Sokolovsky

Or Gerlitz wrote:

Hi Vlad, I came across this ofed patch which isn't upstream. Is it a must
for making mlx4/ib/4K mtu working? was it rejected from upstream? why?

Or.



Hi Or,
Yes, it is must for mlx4/ib/4K mtu.
I forgot to send it upstream.
I'll do this shortly.

Thanks,
Vladimir



mlx4/IB: Add set_4k_mtu module parameter.

It control Infiniband link MTU for all IB ports in a host.

Signed-off-by: Vladimir Sokolovsky 
---
Index: ofed_kernel-fixes/drivers/net/mlx4/port.c
===
--- ofed_kernel-fixes.orig/drivers/net/mlx4/port.c  2009-11-09 
02:20:06.0 +0200
+++ ofed_kernel-fixes/drivers/net/mlx4/port.c   2009-11-09 02:21:46.0 
+0200
@@ -37,6 +37,10 @@

 #include "mlx4.h"

+int mlx4_ib_set_4k_mtu = 0;
+module_param_named(set_4k_mtu, mlx4_ib_set_4k_mtu, int, 0444);
+MODULE_PARM_DESC(set_4k_mtu, "attempt to set 4K MTU to all ConnectX ports");
+
 #define MLX4_MAC_VALID (1ull << 63)
 #define MLX4_MAC_MASK  0xULL

@@ -308,6 +312,9 @@

memset(mailbox->buf, 0, 256);

+   if (mlx4_ib_set_4k_mtu)
+   ((__be32 *) mailbox->buf)[0] |= cpu_to_be32((1 << 22) | (1 << 21) | (5 
<< 12) | (2 << 4));
+
((__be32 *) mailbox->buf)[1] = dev->caps.ib_port_def_cap[port];
err = mlx4_cmd(dev, mailbox->dma, port, 0, MLX4_CMD_SET_PORT,
   MLX4_CMD_TIME_CLASS_B);



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7 4/9] ib_core: RoCEE CMA device binding

2010-01-11 Thread Hal Rosenstock
On Tue, Jan 5, 2010 at 5:32 AM, Eli Cohen  wrote:
> Add support for RoCEE device binding and IP --> GID resolution. Path resolving
> and multicast joining are implemented within cma.c by filling the responses 
> and
> pushing the callbacks to the cma work queue. IP->GID resolution always yields
> IPv6 link local addresses - remote GIDs are derived from the destination MAC
> address of the remote port. Multicast GIDs are always mapped to multicast MACs
> as is done in IPv6. Some helper functions are added to ib_addr.h.  IPv4
> multicast is enabled by translating IPv4 multicast addresses to IPv6 multicast
> as described in
> http://www.mail-archive.com/i...@sunroof.eng.sun.com/msg02134.html.
>
> Signed-off-by: Eli Cohen 
> ---
>  drivers/infiniband/core/cma.c  |  261 
> ++--
>  drivers/infiniband/core/ucma.c |   45 ++-
>  include/rdma/ib_addr.h         |   98 +++-
>  3 files changed, 385 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> index fbdd731..e8e28ae 100644
> --- a/drivers/infiniband/core/cma.c
> +++ b/drivers/infiniband/core/cma.c

[snip...]

> @@ -1707,6 +1740,78 @@ static int cma_resolve_iw_route(struct rdma_id_private 
> *id_priv, int timeout_ms)
>        return 0;
>  }
>
> +static int cma_resolve_rocee_route(struct rdma_id_private *id_priv)
> +{
> +       struct rdma_route *route = &id_priv->id.route;
> +       struct rdma_addr *addr = &route->addr;
> +       struct cma_work *work;
> +       int ret;
> +       struct sockaddr_in *src_addr = (struct sockaddr_in 
> *)&route->addr.src_addr;
> +       struct sockaddr_in *dst_addr = (struct sockaddr_in 
> *)&route->addr.dst_addr;
> +       struct net_device *ndev = NULL;
> +
> +       if (src_addr->sin_family != dst_addr->sin_family)
> +               return -EINVAL;
> +
> +       work = kzalloc(sizeof *work, GFP_KERNEL);
> +       if (!work)
> +               return -ENOMEM;
> +
> +       work->id = id_priv;
> +       INIT_WORK(&work->work, cma_work_handler);
> +
> +       route->path_rec = kzalloc(sizeof *route->path_rec, GFP_KERNEL);
> +       if (!route->path_rec) {
> +               ret = -ENOMEM;
> +               goto err1;
> +       }
> +
> +       route->num_paths = 1;
> +
> +       rocee_mac_to_ll(&route->path_rec->sgid, addr->dev_addr.src_dev_addr);
> +       rocee_mac_to_ll(&route->path_rec->dgid, addr->dev_addr.dst_dev_addr);
> +
> +       route->path_rec->hop_limit = 2;
> +       route->path_rec->reversible = 1;
> +       route->path_rec->pkey = cpu_to_be16(0x);
> +       route->path_rec->mtu_selector = 2;
> +
> +       if (addr->dev_addr.bound_dev_if) {
> +               ndev = dev_get_by_index(&init_net, 
> addr->dev_addr.bound_dev_if);
> +               if (!ndev)
> +                       return -ENODEV;
> +       }
> +
> +       if (ndev)
> +               route->path_rec->mtu = rocee_get_mtu(ndev->mtu);
> +       route->path_rec->rate_selector = 2;
> +       if (ndev)
> +               route->path_rec->rate = rocee_get_rate(ndev);

The rocee_get_rate routine seems to merely get the (local) device
rate. So this seems to me to only work in a homogeneous (single speed
subnet) but what about a heterogeneous one (either different speed
links or links negotiated down) ? What happens if a link internal to
the subnet is slower ? Isn't this important for setting a proper
static rate control ?

Similar thing may also be true for other path related parameters if
they can vary along the path.

-- Hal

[snip...]
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tests/subnet_discover: discover test utility

2010-01-11 Thread Hal Rosenstock
Hi Sasha,

On Mon, Dec 28, 2009 at 4:22 AM, Sasha Khapyorsky  wrote:
> Hi Ira,
>
> On 09:35 Mon 21 Dec     , Sasha Khapyorsky wrote:
>>
>> An errors are response timeouts. I guess that most of them are due
>> to switches' VL15 overflow (could be verified by VL15Dropped counter
>> evaluation). Will look at this deeply.
>
> I did a couple of modifications in the code (exact log is listed below).
> In particular there are default limitation for number of outstanding MADs
> on the wire and proper tracking for failed (timedout) MADs. I tested
> this where possible. Could you re-run this? Thanks.
>
> Sasha
>

[snip...]

> commit da6aa19840cb2d37e8cd3daa3874b87657a76ddc
> Author: Sasha Khapyorsky 
> Date:   Fri Dec 25 16:24:13 2009 +0200
>
>    tests/subnet_discover: --maxsmps (-n) option
>
>    This implements the limitation of outstanding SMPs on a wire at any
>    one time. --maxsmps=0 means - no limit.
>
>    Signed-off-by: Sasha Khapyorsky 
>
> diff --git a/tests/subnet_discover.c b/tests/subnet_discover.c
> index 7f8a85c..42e7aee 100644
> --- a/tests/subnet_discover.c
> +++ b/tests/subnet_discover.c
> @@ -40,6 +40,7 @@ static struct node *node_array[32 * 1024];
>  static unsigned node_count = 0;
>  static unsigned trid_cnt = 0;
>  static unsigned outstanding = 0;
> +static unsigned max_outstanding = 8;

Any reason why this default is different from the one which OpenSM
uses ? Seems to me it should be the same (or less).

-- Hal

>  static unsigned timeout = 100;
>  static unsigned retries = 3;
>  static unsigned verbose = 0;

[snip...]
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA Read sge errors

2010-01-11 Thread Or Gerlitz
Jack, I see now that commit cd155c1 "IB/mlx4: Fix creation of kernel QP with 
max number of send s/g entries" is mainstream but not ofed 1.4.x and that 
mlx4_0090_fix_sq_wrs.patch (below) is in ofed but not mainstream, was it 
rejected from the mainline kernel? why?

Or.


1. Limit qp resources accepted for ib_create_qp() to the limits reported
   in ib_query_device(). In kernel space,make sure that the limits
   returned to the caller following qp creation also lie within the
   reported device limits. For userspace, report as before, and
   do adjustment in libmlx4 (so as not to break ABI).

2. Limit max number of wqes per QP reported when querying the device,
   so that ib_create_qp will never fail due to any additional headroom WQEs 
allocated.

Signed-off-by: Jack Morgenstein 

---
 drivers/infiniband/hw/mlx4/main.c|2 +-
 drivers/infiniband/hw/mlx4/mlx4_ib.h |7 +++
 drivers/infiniband/hw/mlx4/qp.c  |   25 +++--
 3 files changed, 27 insertions(+), 7 deletions(-)

Index: ofed_kernel/drivers/infiniband/hw/mlx4/main.c
===
--- ofed_kernel.orig/drivers/infiniband/hw/mlx4/main.c
+++ ofed_kernel/drivers/infiniband/hw/mlx4/main.c
@@ -122,7 +122,7 @@ static int mlx4_ib_query_device(struct i
props->max_mr_size = ~0ull;
props->page_size_cap   = dev->dev->caps.page_size_cap;
props->max_qp  = dev->dev->caps.num_qps - 
dev->dev->caps.reserved_qps;
-   props->max_qp_wr   = dev->dev->caps.max_wqes;
+   props->max_qp_wr   = dev->dev->caps.max_wqes - 
MLX4_IB_SQ_MAX_SPARE;
props->max_sge = min(dev->dev->caps.max_sq_sg,
 dev->dev->caps.max_rq_sg);
props->max_cq  = dev->dev->caps.num_cqs - 
dev->dev->caps.reserved_cqs;
Index: ofed_kernel/drivers/infiniband/hw/mlx4/mlx4_ib.h
===
--- ofed_kernel.orig/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ ofed_kernel/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -44,6 +44,13 @@
 #include 
 #include 
 
+enum {
+   MLX4_IB_SQ_MIN_WQE_SHIFT = 6
+};
+
+#define MLX4_IB_SQ_HEADROOM(shift) ((2048 >> (shift)) + 1)
+#define MLX4_IB_SQ_MAX_SPARE (MLX4_IB_SQ_HEADROOM(MLX4_IB_SQ_MIN_WQE_SHIFT))
+
 struct mlx4_ib_ucontext {
struct ib_ucontext  ibucontext;
struct mlx4_uar uar;
Index: ofed_kernel/drivers/infiniband/hw/mlx4/qp.c
===
--- ofed_kernel.orig/drivers/infiniband/hw/mlx4/qp.c
+++ ofed_kernel/drivers/infiniband/hw/mlx4/qp.c
@@ -289,8 +289,9 @@ static int set_rq_size(struct mlx4_ib_de
   int is_user, int has_srq, struct mlx4_ib_qp *qp)
 {
/* Sanity check RQ size before proceeding */
-   if (cap->max_recv_wr  > dev->dev->caps.max_wqes  ||
-   cap->max_recv_sge > dev->dev->caps.max_rq_sg)
+   if (cap->max_recv_wr > dev->dev->caps.max_wqes - MLX4_IB_SQ_MAX_SPARE ||
+   cap->max_recv_sge >
+   min(dev->dev->caps.max_sq_sg, dev->dev->caps.max_rq_sg))
return -EINVAL;
 
if (has_srq) {
@@ -309,8 +310,19 @@ static int set_rq_size(struct mlx4_ib_de
qp->rq.wqe_shift = ilog2(qp->rq.max_gs * sizeof (struct 
mlx4_wqe_data_seg));
}
 
-   cap->max_recv_wr  = qp->rq.max_post = qp->rq.wqe_cnt;
-   cap->max_recv_sge = qp->rq.max_gs;
+   /* leave userspace return values as they were, so as not to break ABI */
+   if (is_user) {
+   cap->max_recv_wr  = qp->rq.max_post = qp->rq.wqe_cnt;
+   cap->max_recv_sge = qp->rq.max_gs;
+   } else {
+   cap->max_recv_wr  = qp->rq.max_post =
+   min(dev->dev->caps.max_wqes - MLX4_IB_SQ_MAX_SPARE, 
qp->rq.wqe_cnt);
+   cap->max_recv_sge = min(qp->rq.max_gs,
+   min(dev->dev->caps.max_sq_sg,
+   dev->dev->caps.max_rq_sg));
+   }
+   /* We don't support inline sends for kernel QPs (yet) */
+
 
return 0;
 }
@@ -321,8 +333,9 @@ static int set_kernel_sq_size(struct mlx
int s;
 
/* Sanity check SQ size before proceeding */
-   if (cap->max_send_wr > dev->dev->caps.max_wqes  ||
-   cap->max_send_sge> dev->dev->caps.max_sq_sg ||
+   if (cap->max_send_wr > (dev->dev->caps.max_wqes - 
MLX4_IB_SQ_MAX_SPARE) ||
+   cap->max_send_sge>
+   min(dev->dev->caps.max_sq_sg, dev->dev->caps.max_rq_sg) ||
cap->max_inline_data + send_wqe_overhead(type, qp->flags) +
sizeof (struct mlx4_wqe_inline_seg) > dev->dev->caps.max_sq_desc_sz)
return -EINVAL;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel

upstream mlx4/ib/4K mtu support

2010-01-11 Thread Or Gerlitz
Hi Vlad, I came across this ofed patch which isn't upstream. Is it a must
for making mlx4/ib/4K mtu working? was it rejected from upstream? why?

Or.


mlx4/IB: Add set_4k_mtu module parameter.

It control Infiniband link MTU for all IB ports in a host.

Signed-off-by: Vladimir Sokolovsky 
---
Index: ofed_kernel-fixes/drivers/net/mlx4/port.c
===
--- ofed_kernel-fixes.orig/drivers/net/mlx4/port.c  2009-11-09 
02:20:06.0 +0200
+++ ofed_kernel-fixes/drivers/net/mlx4/port.c   2009-11-09 02:21:46.0 
+0200
@@ -37,6 +37,10 @@

 #include "mlx4.h"

+int mlx4_ib_set_4k_mtu = 0;
+module_param_named(set_4k_mtu, mlx4_ib_set_4k_mtu, int, 0444);
+MODULE_PARM_DESC(set_4k_mtu, "attempt to set 4K MTU to all ConnectX ports");
+
 #define MLX4_MAC_VALID (1ull << 63)
 #define MLX4_MAC_MASK  0xULL

@@ -308,6 +312,9 @@

memset(mailbox->buf, 0, 256);

+   if (mlx4_ib_set_4k_mtu)
+   ((__be32 *) mailbox->buf)[0] |= cpu_to_be32((1 << 22) | (1 << 
21) | (5 << 12) | (2 << 4));
+
((__be32 *) mailbox->buf)[1] = dev->caps.ib_port_def_cap[port];
err = mlx4_cmd(dev, mailbox->dma, port, 0, MLX4_CMD_SET_PORT,
   MLX4_CMD_TIME_CLASS_B);
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7 4/9] ib_core: RoCEE CMA device binding

2010-01-11 Thread Eli Cohen
On Thu, Jan 07, 2010 at 11:24:51AM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 07, 2010 at 08:50:47AM -0800, Sean Hefty wrote:
> > >> > +  route->path_rec->hop_limit = 2;
> > >>
> > >
> > >The reason is that ib_init_ah_from_path() will not set IB_AH_GRH for
> > >hop_limit smaller then 2, and since that GRH is required in RoCEE, and
> > >since this is specific to RoCEE, I put 2 to make pass this.
> > 
> > A hop limit of 2 seems wrong though.  Isn't there some change that
> > could be made to ib_init_ah_from_path() that would work?  It seems
> > like the PR itself should indicate somehow whether the path is for
> > IB or Ethernet.  (Special LID values or some flag...)
> 

So, adding a flag to ib_init_ah_from_path() seems a reasonable
solution. This function is called from three places:
cm_init_av_by_path() and cma_sidr_rep_handler() can check the port
link layer and set a flag to request GRH for Ethernet link layers. The
other place is IPoIB:path_rec_completion() where we need not require
GRH since IPoIB over RoCEE is disable.


> I agree, on the wire the hop limit in the GRH should be 0 or 0xFF. 0
> is preferred since there is no support for routing.
> 

Why do you think we should use 0 for hop limit? Isn't 1 just as good?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ib/ipoib: remove TX moderation from the ethtool related code

2010-01-11 Thread Or Gerlitz
As of commit f56bcd8 "IPoIB: Use separate CQ for UD send completions",
there are no TX interrupts at the main code path. Change the ethtool
related code to comply with this, such the users will not be misleaded
to assume they can control TX interrupt moderation. Was pointed by
Alex Vainman 

Signed-off-by: Or Gerlitz 

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c 
b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
index e9795f6..d10b4ec 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
@@ -55,9 +55,7 @@ static int ipoib_get_coalesce(struct net_device *dev,
struct ipoib_dev_priv *priv = netdev_priv(dev);

coal->rx_coalesce_usecs = priv->ethtool.coalesce_usecs;
-   coal->tx_coalesce_usecs = priv->ethtool.coalesce_usecs;
coal->rx_max_coalesced_frames = priv->ethtool.max_coalesced_frames;
-   coal->tx_max_coalesced_frames = priv->ethtool.max_coalesced_frames;

return 0;
 }
@@ -69,10 +67,8 @@ static int ipoib_set_coalesce(struct net_device *dev,
int ret;

/*
-* Since IPoIB uses a single CQ for both rx and tx, we assume
-* that rx params dictate the configuration.  These values are
-* saved in the private data and returned when ipoib_get_coalesce()
-* is called.
+* These values are saved in the private data and returned
+* when ipoib_get_coalesce() is called
 */
if (coal->rx_coalesce_usecs   > 0x ||
coal->rx_max_coalesced_frames > 0x)
@@ -85,8 +81,6 @@ static int ipoib_set_coalesce(struct net_device *dev,
return ret;
}

-   coal->tx_coalesce_usecs   = coal->rx_coalesce_usecs;
-   coal->tx_max_coalesced_frames = coal->rx_max_coalesced_frames;
priv->ethtool.coalesce_usecs   = coal->rx_coalesce_usecs;
priv->ethtool.max_coalesced_frames = coal->rx_max_coalesced_frames;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RDMA Read sge errors

2010-01-11 Thread Håkon Bugge
Hi,


I modified the perftest program ib_read_bw to use sge lists. The length of each 
sge is 1 (one) byte. I configured the QP to support up to 32 sges (aligned with 
the capabilities of the device).

Now, everything works correct up to 30 sges. Using 31 or 32 sges, I receive 29 
bytes in both cases, but completion gives local length error.

I am running on CentOS 5.3, Intel E5540, OFED 1.4.1, and ConnectX w/fw 2.6.000.



Thanks, Håkon

--
Håkon Bugge
haakon.bu...@sun.com
+47 924 84 514



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] opensm SA DB: dump only if modified

2010-01-11 Thread Yevgeny Kliteynik
Currently, if dumping SA DB is enabled, the file is dumped
at every sweep. This patch adds "dirty" flag that denotes
whether the SA DB needs to be dumped or not.

SA DB contains 3 types of data:
 - informinfo records
 - service records
 - mcmember records

So the flag must be set in 6 places:
 - add/remove informinfo record
 - add/remove service records
 - add/remove mcmember records

Another possible place for setting this flag is mcgroup
create/delete (this is sometimes duplicated by the add/remove
mcmember records), but I think that this is not needed:

The mcgroups w/o mcmembers can be created in two ways:

1. By creating well-known mcgroup for a partition.
   In this case when new instance of OSM is launched,
   the same mcgroup should be creted anyway from the
   partition configuration, so no need to force its
   duplication in the SA DB file.

2. By loading mcgroup from SA DB file.
   In this case it's either mcgroup w/o members, which
   is what's handled by (1), or it's mcgroup with members,
   butthe members just weren't loaded yet.

So in both cases, no need to force SA DB dump.

Signed-off-by: Yevgeny Kliteynik 
---
 opensm/include/opensm/osm_sa.h |5 +
 opensm/opensm/osm_inform.c |2 ++
 opensm/opensm/osm_multicast.c  |3 +++
 opensm/opensm/osm_sa.c |   11 ++-
 opensm/opensm/osm_service.c|3 +++
 5 files changed, 23 insertions(+), 1 deletions(-)

diff --git a/opensm/include/opensm/osm_sa.h b/opensm/include/opensm/osm_sa.h
index c3484f9..9229d1a 100644
--- a/opensm/include/opensm/osm_sa.h
+++ b/opensm/include/opensm/osm_sa.h
@@ -125,6 +125,7 @@ typedef struct osm_sa {
atomic32_t sa_trans_id;
osm_sa_mad_ctrl_t mad_ctrl;
cl_timer_t sr_timer;
+   boolean_t dirty;
cl_disp_reg_handle_t cpi_disp_h;
cl_disp_reg_handle_t nr_disp_h;
cl_disp_reg_handle_t pir_disp_h;
@@ -178,6 +179,10 @@ typedef struct osm_sa {
 *  mad_ctrl
 *  Mad Controller
 *
+*  dirty
+*  A flag that denotes that SA DB is dirty and needs
+*  to be written to the dump file (if dumping is enabled)
+*
 * SEE ALSO
 *  SM object
 */
diff --git a/opensm/opensm/osm_inform.c b/opensm/opensm/osm_inform.c
index acbaf38..8108213 100644
--- a/opensm/opensm/osm_inform.c
+++ b/opensm/opensm/osm_inform.c
@@ -248,6 +248,7 @@ void osm_infr_insert_to_db(IN osm_subn_t * p_subn, IN 
osm_log_t * p_log,
 #endif

cl_qlist_insert_head(&p_subn->sa_infr_list, &p_infr->list_item);
+   p_subn->p_osm->sa.dirty = TRUE;

OSM_LOG(p_log, OSM_LOG_DEBUG, "Dump after insertion (size %d)\n",
cl_qlist_count(&p_subn->sa_infr_list));
@@ -271,6 +272,7 @@ void osm_infr_remove_from_db(IN osm_subn_t * p_subn, IN 
osm_log_t * p_log,
 OSM_LOG_DEBUG);

cl_qlist_remove_item(&p_subn->sa_infr_list, &p_infr->list_item);
+   p_subn->p_osm->sa.dirty = TRUE;

osm_infr_delete(p_infr);

diff --git a/opensm/opensm/osm_multicast.c b/opensm/opensm/osm_multicast.c
index 89f4b28..843b6ce 100644
--- a/opensm/opensm/osm_multicast.c
+++ b/opensm/opensm/osm_multicast.c
@@ -243,6 +243,7 @@ osm_mcm_port_t *osm_mgrp_add_port(IN osm_subn_t * subn, 
osm_log_t * log,
++mgrp->full_members == 1)
mgrp_send_notice(subn, log, mgrp, 66);

+   subn->p_osm->sa.dirty = TRUE;
return mcm_port;
 }

@@ -297,6 +298,8 @@ void osm_mgrp_remove_port(osm_subn_t * subn, osm_log_t * 
log, osm_mgrp_t * mgrp,
mgrp_send_notice(subn, log, mgrp, 67);
osm_mgrp_cleanup(subn, mgrp);
}
+
+   subn->p_osm->sa.dirty = TRUE;
 }

 void osm_mgrp_delete_port(osm_subn_t * subn, osm_log_t * log, osm_mgrp_t * 
mgrp,
diff --git a/opensm/opensm/osm_sa.c b/opensm/opensm/osm_sa.c
index 6fbea8d..0d203ad 100644
--- a/opensm/opensm/osm_sa.c
+++ b/opensm/opensm/osm_sa.c
@@ -704,7 +704,13 @@ static void sa_dump_all_sa(osm_opensm_t * p_osm, FILE * 
file)

 int osm_sa_db_file_dump(osm_opensm_t * p_osm)
 {
-   return opensm_dump_to_file(p_osm, "opensm-sa.dump", sa_dump_all_sa);
+   int res = 0;
+   if (p_osm->sa.dirty) {
+   res = opensm_dump_to_file(
+   p_osm, "opensm-sa.dump", sa_dump_all_sa);
+   p_osm->sa.dirty = FALSE;
+   }
+   return res;
 }

 /*
@@ -1110,6 +1116,9 @@ int osm_sa_db_file_load(osm_opensm_t * p_osm)
if (rereg_clients)
p_osm->subn.opt.no_clients_rereg = FALSE;

+   /* We've just finished loading SA DB file - clear the "dirty" flag */
+   p_osm->sa.dirty = FALSE;
+
 _error:
fclose(file);
return ret;
diff --git a/opensm/opensm/osm_service.c b/opensm/opensm/osm_service.c
index ceb8aad..91715e6 100644
--- a/opensm/opensm/osm_service.c
+++ b/opensm/opensm/osm_service.c
@@ -46,6 +46,7 @@
 #include 
 #include 
 #include 
+#include 

 void osm_svcr_delete(IN osm_svcr_t * p_svcr)
 {
@@ -122,6 +123,7 @@ void osm_svcr_insert

Re: [PATCH 3/3 v2] opensm SA DB dump/restore: dump SA DB only if modified

2010-01-11 Thread Yevgeny Kliteynik

Hi Sasha,

On 26/Nov/09 16:15, Sasha Khapyorsky wrote:

On 13:01 Wed 04 Nov , Yevgeny Kliteynik wrote:

Optimizing SA DB dump - added "dirty" flag to denote
that the SA DB was modified, so that the DB will be
dumped only when the flag is on.

[v2 - no changes, just rebased and resolved conflicts]


[snip]


+
+   subn->p_osm->sa.dirty = TRUE;
  }


In general I don't like an idea of spreading this global "dirty" flag
over various OpenSM areas (it makes the code dirty). But even if it is
needed couldn't we minimize number of such occurrences?


See below


For example those specific ones in osm_multicast.c are duplicated in
osm_sa_mcmember_record.c


This is fixed in the new patch


(and also will cause 'dirty' flag setup on the
SA DB from file loading).


Fixed - the flag is cleared at the end of loading SA DB


Could we consolidate all multicast related
cases with re-routing requesting for example?


SA DB contains 3 types of data:
 - informinfo records
 - service records
 - mcmember records

So the "dirty" flag (or "modified", if you wish) must be set
in 6 places:
 - add/remove informinfo record
 - add/remove service records
 - add/remove mcmember records

Another possible place for setting this flag is mcgroup
create/delete (this is what's sometimes duplicated by the
add/remove mcmember records), but I think that this is not
needed:

The mcgroups w/o mcmembers can be created in two ways:

1. By creating well-known mcgroup for a partition.
   In this case when new instance of OSM is launched,
   the same mcgroup should be creted anyway from the
   partition configuration, so no need to force its
   duplication in the SA DB file.

2. By loading mcgroup from SA DB file.
   In this case it's either mcgroup w/o members, which
   is what's handled by (1), or it's mcgroup with members,
   butthe members just weren't loaded yet.

So in both cases, no need to force SA DB dump.

I'll post a new patch shortly - please review and let me
know if you have better ideas.

-- Yevgeny

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html