RE: InfiniBand/RDMA merger plans for 2.6.36

2010-08-06 Thread Tziporet Koren
 It's not personal -- it was a combination of when things were ready and
 when I had time to work on things, and unfortunately it all lined up so
 I wasn't able to get anything big queued up this cycle.

Can you confirm it will be in 2.6.37?


Tziporet
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


{RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations

2010-08-06 Thread Walukiewicz, Miroslaw
Currently the ibv_post_send()/ibv_post_recv() path through kernel 
(using /dev/infiniband/rdmacm) could be optimized by removing dynamic memory 
allocations on the path. 

Currently the transmit/receive path works following way:
User calls ibv_post_send() where vendor specific function is called. 
When the path should go through kernel the ibv_cmd_post_send() is called.
 The function creates the POST_SEND message body that is passed to kernel. 
As the number of sges is unknown the dynamic allocation for message body is 
performed. 
(see libibverbs/src/cmd.c)

In the kernel the message body is parsed and a structure of wr and sges is 
recreated using dynamic allocations in kernel 
The goal of this operation is having a similar structure like in user space. 

The proposed path optimization is removing of dynamic allocations 
by redefining a structure definition passed to kernel. 
From 

struct ibv_post_send {
__u32 command;
__u16 in_words;
__u16 out_words;
__u64 response;
__u32 qp_handle;
__u32 wr_count;
__u32 sge_count;
__u32 wqe_size;
struct ibv_kern_send_wr send_wr[0];
};
To 

struct ibv_post_send {
__u32 command;
__u16 in_words;
__u16 out_words;
__u64 response;
__u32 qp_handle;
__u32 wr_count;
__u32 sge_count;
__u32 wqe_size;
struct ibv_kern_send_wr send_wr[512];
};

Similar change is required in kernel  struct ib_uverbs_post_send defined in 
/ofa_kernel/include/rdma/ib_uverbs.h

This change limits a number of send_wr passed from unlimited (assured by 
dynamic allocation) to reasonable number of 512. 
I think this number should be a max number of QP entries available to send. 
As the all iB/iWARP applications are low latency applications so the number of 
WRs passed are never unlimited.

As the result instead of dynamic allocation the ibv_cmd_post_send() fills the 
proposed structure 
directly and passes it to kernel. Whenever the number of send_wr number exceeds 
the limit the ENOMEM error is returned.

In kernel  in ib_uverbs_post_send() instead of dynamic allocation of the 
ib_send_wr structures 
the table of 512  ib_send_wr structures  will be defined and 
all entries will be linked to unidirectional list so qp-device-post_send(qp, 
wr, bad_wr) API will be not changed. 

As I know no driver uses that kernel path to posting buffers so iWARP multicast 
acceleration implemented in NES driver 
Would be a first application that can utilize the optimized path. 

Regards,

Mirek

Signed-off-by: Mirek Walukiewicz miroslaw.walukiew...@intel.com


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 4/42 v2] drivers/infiniband/hw/nes: Adjust confusing if indentation

2010-08-06 Thread Tung, Chien Tin
 I think the white space is meant to look like this.  I did look at
 whether the sq_cqes = 0; should only be done if netif_queue_stopped().
 In the end I decided this was what was intended, but it would be
 better if someone more familiar with the code reviewed it.
 
 Reported-by: Julia Lawall ju...@diku.dk
 Signed-off-by: Dan Carpenter erro...@gmail.com
 ---
 v2: changed different indents

Looks to me.  Thanks for the patch.

Acked-by: Chien Tung chien.tin.t...@intel.com

Chien




--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: yet again the atomic operations

2010-08-06 Thread Rui Machado
Hi there,

 There are two kinds supported. QLogic's driver does them in
 the host driver so they are atomic with respect to all the CPUs
 in the host. Mellanox uses HCA wide atomic which means the
 HCA will do a memory read/write without allowing other reads
 or writes from different QP operations passing through that
 HCA to get in between. The CPUs on the host won't see
 atomic operations since from their perspective, it looks
 like a normal read and write from the PCIe bus.

So if the CPU writes/reads to/from the same address, even atomically
(lock), there might be room for some inconsistency on the values? It
is not really atomic from the whole system point of view, just for the
HCA? If so, is there any possibility to make the whole operation
'system-wide' atomic?


 You can see what type the HCA supports with ibv_devinfo -v
 and look for atomic_cap: ATOMIC_HCA (1) or
 atomic_cap: ATOMIC_GLOB (2).

ATOMIC_HCA (1) is what I see in my Mellanox hardware. This is the case
you mentioned, without allowing other reads or writes from different
QP operations passing through that HCA to get in between
ATOMIC_GLOB (2) means with respect to all HCAs and even the CPU?

Cheers,
Rui
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv9 0/11] IBoE support to Infiniband

2010-08-06 Thread Eli Cohen
IBoE allows running the IB transport protocol using Ethernet frames, enabling
the deployment of IB semantics on lossless Ethernet fabrics.

IBoE packets are standard Ethernet frames with an IEEE assigned Ethertype, a
GRH, unmodified IB transport headers and payload.  IB subnet management and SA
services are not required for IBoE operation; Ethernet management practices are
used instead. IBoE resolves MAC addresses using the host IP stack. For
multicast GIDs, standard IP to MAC mappings apply.

The OFA RDMA Verbs API is syntactically unmodified. The CMA is adapted to
support IBoE ports allowing existing RDMA applications to run over IBoE with no
changes.

Address handles for IBoE are required to contain valid L3 addresses (GIDs) and
the IB L2 address fields become reserved. The complementary Ethernet L2 address
information is subsequently derived below the API (currently, the Eth L2
information is encoded in the GID).

As there is no SA in IBoE, the CMA code is adapted to locally fill-in
corresponding path record attributes for IBoE address handles. Also, the CMA
provides the required address handle attributes for SIDR requests and joining
of multicast groups.

With this patch set, each IBoE port is assigned a GID equal to the link local
address of its corresponding net device, and one more GID for each one of the
VLAN devices which are derived from it. iboe packets are tagged with the VLAN
ID of the corresponding netdevice through which they are generated.

The priority field in the 802.1q header of IBoE packets is derived from the SL
field in the address vector. rdma_cm applications can set the TOS value of the
rdma_cm_id object through the rdma_set_option() API which then maps to SL.

With these patches, IBoE multicast frames may be broadcast as there is
currently no use of a L2 multicast group membership protocol.

To enable IBoE with the mlx4 driver stack, both the mlx4_en and mlx4_ib drivers
must be loaded, and the netdevice for the corresponding IBoE port must be
running. Individual ports of a multi port HCA can be independently configured
as Ethernet (with support for IBoE) or as IB, as it was already the case.

We have successfully tested MPI, SDP, RDS, and native Verbs applications over
IBoE.

Following is a series of 12 patches based on Roland's iboe branch.  This new
series reflects changes based on feedback from the community on the previous
patch set.

Changes from v8
1. Rebase on Roland's iboe branch
2. For userspace consumers, move resolving of GID to MAC from kernel to
userspace (using Link Local GIDs that encode the Eth L2 information).
3. Bug fixes and improvements (see in the patches changelog).

Signed-off-by: Eli Cohen e...@mellanox.co.il
---

 drivers/infiniband/core/cma.c  |  282 
 drivers/infiniband/core/sa_query.c |5 
 drivers/infiniband/core/ucma.c |   54 ++-
 drivers/infiniband/core/ud_header.c|  158 +++--
 drivers/infiniband/core/user_mad.c |   11 
 drivers/infiniband/core/uverbs_cmd.c   |1 
 drivers/infiniband/hw/mlx4/ah.c|  170 --
 drivers/infiniband/hw/mlx4/mad.c   |   32 +
 drivers/infiniband/hw/mlx4/main.c  |  548 ++---
 drivers/infiniband/hw/mlx4/mlx4_ib.h   |   32 +
 drivers/infiniband/hw/mlx4/qp.c|  201 +---
 drivers/infiniband/hw/mthca/mthca_qp.c |2 
 drivers/net/mlx4/en_main.c |   15 
 drivers/net/mlx4/en_netdev.c   |   10 
 drivers/net/mlx4/en_port.c |4 
 drivers/net/mlx4/en_port.h |3 
 drivers/net/mlx4/fw.c  |3 
 drivers/net/mlx4/intf.c|   21 +
 drivers/net/mlx4/mlx4.h|1 
 drivers/net/mlx4/mlx4_en.h |1 
 drivers/net/mlx4/port.c|   19 +
 include/linux/mlx4/cmd.h   |1 
 include/linux/mlx4/device.h|   31 +
 include/linux/mlx4/driver.h|   16 
 include/linux/mlx4/qp.h|9 
 include/rdma/ib_addr.h |  134 
 include/rdma/ib_pack.h |   29 +
 include/rdma/ib_user_verbs.h   |3 
 28 files changed, 1599 insertions(+), 197 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv9 01/12] IB/umad: Enable support only for IB ports

2010-08-06 Thread Eli Cohen
Initialize umad context only for Infiniband (as opposed to Ethernet) ports.
Since the only ULP using QP1 is the CM and since all protocol traffic is done
in kernel, we do not expose QP1 to userspace.

Signed-off-by: Eli Cohen e...@mellanox.co.il
---
 drivers/infiniband/core/user_mad.c |   11 +++
 1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/user_mad.c 
b/drivers/infiniband/core/user_mad.c
index 6babb72..8b6e8a2 100644
--- a/drivers/infiniband/core/user_mad.c
+++ b/drivers/infiniband/core/user_mad.c
@@ -1149,8 +1149,9 @@ static void ib_umad_add_one(struct ib_device *device)
for (i = s; i = e; ++i) {
umad_dev-port[i - s].umad_dev = umad_dev;
 
-   if (ib_umad_init_port(device, i, umad_dev-port[i - s]))
-   goto err;
+   if (rdma_port_get_link_layer(device, i) == 
IB_LINK_LAYER_INFINIBAND)
+   if (ib_umad_init_port(device, i, umad_dev-port[i - 
s]))
+   goto err;
}
 
ib_set_client_data(device, umad_client, umad_dev);
@@ -1159,7 +1160,8 @@ static void ib_umad_add_one(struct ib_device *device)
 
 err:
while (--i = s)
-   ib_umad_kill_port(umad_dev-port[i - s]);
+   if (rdma_port_get_link_layer(device, i) == 
IB_LINK_LAYER_INFINIBAND)
+   ib_umad_kill_port(umad_dev-port[i - s]);
 
kref_put(umad_dev-ref, ib_umad_release_dev);
 }
@@ -1173,7 +1175,8 @@ static void ib_umad_remove_one(struct ib_device *device)
return;
 
for (i = 0; i = umad_dev-end_port - umad_dev-start_port; ++i)
-   ib_umad_kill_port(umad_dev-port[i]);
+   if (rdma_port_get_link_layer(device, i + 1) == 
IB_LINK_LAYER_INFINIBAND)
+   ib_umad_kill_port(umad_dev-port[i]);
 
kref_put(umad_dev-ref, ib_umad_release_dev);
 }
-- 
1.7.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv9 03/12] ib_core: IBoE UD packet packing support

2010-08-06 Thread Eli Cohen
Add support functions to aid in packing IBoE packets.

Signed-off-by: Eli Cohen e...@mellanox.co.il
---
 drivers/infiniband/core/ud_header.c |  128 ++-
 include/rdma/ib_pack.h  |   30 +++-
 2 files changed, 124 insertions(+), 34 deletions(-)

diff --git a/drivers/infiniband/core/ud_header.c 
b/drivers/infiniband/core/ud_header.c
index 650b501..58b5537 100644
--- a/drivers/infiniband/core/ud_header.c
+++ b/drivers/infiniband/core/ud_header.c
@@ -80,6 +80,29 @@ static const struct ib_field lrh_table[]  = {
  .size_bits= 16 }
 };
 
+static const struct ib_field eth_table[]  = {
+   { STRUCT_FIELD(eth, dmac_h),
+ .offset_words = 0,
+ .offset_bits  = 0,
+ .size_bits= 32 },
+   { STRUCT_FIELD(eth, dmac_l),
+ .offset_words = 1,
+ .offset_bits  = 0,
+ .size_bits= 16 },
+   { STRUCT_FIELD(eth, smac_h),
+ .offset_words = 1,
+ .offset_bits  = 16,
+ .size_bits= 16 },
+   { STRUCT_FIELD(eth, smac_l),
+ .offset_words = 2,
+ .offset_bits  = 0,
+ .size_bits= 32 },
+   { STRUCT_FIELD(eth, type),
+ .offset_words = 3,
+ .offset_bits  = 0,
+ .size_bits= 16 }
+};
+
 static const struct ib_field grh_table[]  = {
{ STRUCT_FIELD(grh, ip_version),
  .offset_words = 0,
@@ -180,17 +203,15 @@ static const struct ib_field deth_table[] = {
 /**
  * ib_ud_header_init - Initialize UD header structure
  * @payload_bytes:Length of packet payload
+ * @lrh_present: specify if LRH is present
+ * @eth_present: specify if Eth header is present
  * @grh_present:GRH flag (if non-zero, GRH will be included)
- * @immediate_present: specify if immediate data should be used
+ * @immediate_present: specify if immediate data is present
  * @header:Structure to initialize
- *
- * ib_ud_header_init() initializes the lrh.link_version, lrh.link_next_header,
- * lrh.packet_length, grh.ip_version, grh.payload_length,
- * grh.next_header, bth.opcode, bth.pad_count and
- * bth.transport_header_version fields of a struct ib_ud_header given
- * the payload length and whether a GRH will be included.
  */
 void ib_ud_header_init(int payload_bytes,
+  int  lrh_present,
+  int  eth_present,
   int  grh_present,
   int  immediate_present,
   struct ib_ud_header *header)
@@ -199,42 +220,79 @@ void ib_ud_header_init(int
payload_bytes,
 
memset(header, 0, sizeof *header);
 
-   header-lrh.link_version = 0;
-   header-lrh.link_next_header =
-   grh_present ? IB_LNH_IBA_GLOBAL : IB_LNH_IBA_LOCAL;
-   packet_length= (IB_LRH_BYTES +
-   IB_BTH_BYTES +
-   IB_DETH_BYTES+
-   payload_bytes+
-   4+ /* ICRC */
-   3) / 4;/* round up */
-
-   header-grh_present  = grh_present;
+   if (lrh_present) {
+   header-lrh.link_version = 0;
+   header-lrh.link_next_header =
+   grh_present ? IB_LNH_IBA_GLOBAL : IB_LNH_IBA_LOCAL;
+   packet_length = IB_LRH_BYTES;
+   }
+
+   if (eth_present)
+   packet_length += IB_ETH_BYTES;
+   packet_length += IB_BTH_BYTES + IB_DETH_BYTES + payload_bytes +
+   4   + /* ICRC */
+   3;/* round up */
+   packet_length /= 4;
if (grh_present) {
-   packet_length  += IB_GRH_BYTES / 4;
-   header-grh.ip_version  = 6;
-   header-grh.payload_length  =
-   cpu_to_be16((IB_BTH_BYTES +
-IB_DETH_BYTES+
-payload_bytes+
-4+ /* ICRC */
-3)  ~3);  /* round up */
+   packet_length += IB_GRH_BYTES / 4;
+   header-grh.ip_version = 6;
+   header-grh.payload_length =
+   cpu_to_be16((IB_BTH_BYTES  +
+IB_DETH_BYTES +
+payload_bytes +
+4 + /* ICRC */
+3)  ~3);   /* round up */
header-grh.next_header = 0x1b;
}
 
-   header-lrh.packet_length = cpu_to_be16(packet_length);
+   if (lrh_present)
+   header-lrh.packet_length = cpu_to_be16(packet_length);
 
-   header-immediate_present= 

[PATCHv9 04/12] mpt to API changelx4: Adapt to API changes in the previous commit

2010-08-06 Thread Eli Cohen
This is done to synchronize mlx4 with the API changes.

Signed-off-by: Eli Cohen e...@mellanox.co.il
---
 drivers/infiniband/hw/mlx4/qp.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 6a60827..bb1277c 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -1231,7 +1231,7 @@ static int build_mlx_header(struct mlx4_ib_sqp *sqp, 
struct ib_send_wr *wr,
for (i = 0; i  wr-num_sge; ++i)
send_size += wr-sg_list[i].length;
 
-   ib_ud_header_init(send_size, mlx4_ib_ah_grh_present(ah), 0, 
sqp-ud_header);
+   ib_ud_header_init(send_size, 1, 0, mlx4_ib_ah_grh_present(ah), 0, 
sqp-ud_header);
 
sqp-ud_header.lrh.service_level   =
be32_to_cpu(ah-av.sl_tclass_flowlabel)  28;
-- 
1.7.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv9 07/12] mlx4: Allow interfaces to correspond to each other

2010-08-06 Thread Eli Cohen
Add a mechanism for mlx4 core interfaces to get a pointer to other interfaces'
device object. For this, an exported function, mlx4_get_prot_dev() is added,
which allows an interfaces to get some other interface's device based on the
protocol that interface implements. Two new protocols are added, MLX4_PROT_IB
and MLX4_PROT_EN. This comes as a preperation for IBoE so that mlx4_ib will be
able to refer to the corresponding en device.

Signed-off-by: Eli Cohen e...@mellanox.co.il
---
 drivers/net/mlx4/en_main.c  |   15 ---
 drivers/net/mlx4/intf.c |   21 +
 drivers/net/mlx4/mlx4.h |1 +
 include/linux/mlx4/driver.h |   16 
 4 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/drivers/net/mlx4/en_main.c b/drivers/net/mlx4/en_main.c
index cbabf14..93e0239 100644
--- a/drivers/net/mlx4/en_main.c
+++ b/drivers/net/mlx4/en_main.c
@@ -101,6 +101,13 @@ static int mlx4_en_get_profile(struct mlx4_en_dev *mdev)
return 0;
 }
 
+static void *get_netdev(struct mlx4_dev *dev, void *ctx, u8 port)
+{
+   struct mlx4_en_dev *endev = ctx;
+
+   return endev-pndev[port];
+}
+
 static void mlx4_en_event(struct mlx4_dev *dev, void *endev_ptr,
  enum mlx4_dev_event event, int port)
 {
@@ -263,9 +270,11 @@ err_free_res:
 }
 
 static struct mlx4_interface mlx4_en_interface = {
-   .add= mlx4_en_add,
-   .remove = mlx4_en_remove,
-   .event  = mlx4_en_event,
+   .add= mlx4_en_add,
+   .remove = mlx4_en_remove,
+   .event  = mlx4_en_event,
+   .get_prot_dev   = get_netdev,
+   .protocol   = MLX4_PROT_EN,
 };
 
 static int __init mlx4_en_init(void)
diff --git a/drivers/net/mlx4/intf.c b/drivers/net/mlx4/intf.c
index 5550678..70d67bc 100644
--- a/drivers/net/mlx4/intf.c
+++ b/drivers/net/mlx4/intf.c
@@ -161,3 +161,24 @@ void mlx4_unregister_device(struct mlx4_dev *dev)
 
mutex_unlock(intf_mutex);
 }
+
+void *mlx4_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port)
+{
+   struct mlx4_priv *priv = mlx4_priv(dev);
+   struct mlx4_device_context *dev_ctx;
+   unsigned long flags;
+   void *result = NULL;
+
+   spin_lock_irqsave(priv-ctx_lock, flags);
+
+   list_for_each_entry(dev_ctx, priv-ctx_list, list)
+   if (dev_ctx-intf-protocol == proto  
dev_ctx-intf-get_prot_dev) {
+   result = dev_ctx-intf-get_prot_dev(dev, 
dev_ctx-context, port);
+   break;
+   }
+
+   spin_unlock_irqrestore(priv-ctx_lock, flags);
+
+   return result;
+}
+EXPORT_SYMBOL(mlx4_get_prot_dev);
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index 13343e8..4dcf567 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -363,6 +363,7 @@ int mlx4_restart_one(struct pci_dev *pdev);
 int mlx4_register_device(struct mlx4_dev *dev);
 void mlx4_unregister_device(struct mlx4_dev *dev);
 void mlx4_dispatch_event(struct mlx4_dev *dev, enum mlx4_dev_event type, int 
port);
+void *mlx4_find_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int 
port);
 
 struct mlx4_dev_cap;
 struct mlx4_init_hca_param;
diff --git a/include/linux/mlx4/driver.h b/include/linux/mlx4/driver.h
index 53c5fdb..0083256 100644
--- a/include/linux/mlx4/driver.h
+++ b/include/linux/mlx4/driver.h
@@ -44,15 +44,23 @@ enum mlx4_dev_event {
MLX4_DEV_EVENT_PORT_REINIT,
 };
 
+enum mlx4_prot {
+   MLX4_PROT_IB,
+   MLX4_PROT_EN,
+};
+
 struct mlx4_interface {
-   void *  (*add)   (struct mlx4_dev *dev);
-   void(*remove)(struct mlx4_dev *dev, void *context);
-   void(*event) (struct mlx4_dev *dev, void *context,
- enum mlx4_dev_event event, int port);
+   void *  (*add)   (struct mlx4_dev *dev);
+   void(*remove)(struct mlx4_dev *dev, void *context);
+   void(*event) (struct mlx4_dev *dev, void *context,
+ enum mlx4_dev_event event, int port);
+   void *  (*get_prot_dev) (struct mlx4_dev *dev, void *context, u8 port);
+   enum mlx4_prot  protocol;
struct list_headlist;
 };
 
 int mlx4_register_interface(struct mlx4_interface *intf);
 void mlx4_unregister_interface(struct mlx4_interface *intf);
+void *mlx4_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port);
 
 #endif /* MLX4_DRIVER_H */
-- 
1.7.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv9 09/12] ib_core: Add VLAN support to IBoE

2010-08-06 Thread Eli Cohen
Add 802.1q vlan support to IBoE. The vlan tag is encoded within the GID
derived from a link local address in the following way:

GID[11] GID[12] contain the vlan ID.
The 3 bit user priority field is identical to the 3 bits of the SL.

In case rdma_cm apps, the TOS field is used to generate the SL field by doing a
shift right of 5 bits effectively taking to 3 MS bits of the TOS field. In
order to support userspace verbs consumers, ib_uverbs_get_mac has changed into
ib_uverbs_get_eth_l2_addr and now returns both MAC and VLAN information.

Signed-off-by: Eli Cohen e...@mellanox.co.il
---
 drivers/infiniband/core/cma.c   |   20 +--
 drivers/infiniband/core/ucma.c  |   13 -
 drivers/infiniband/core/ud_header.c |   30 +++-
 include/rdma/ib_addr.h  |   44 +++---
 include/rdma/ib_pack.h  |   19 +++
 5 files changed, 101 insertions(+), 25 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 1d97882..4ff28b7 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -1764,6 +1764,7 @@ static int cma_resolve_iboe_route(struct rdma_id_private 
*id_priv)
struct sockaddr_in *src_addr = (struct sockaddr_in 
*)route-addr.src_addr;
struct sockaddr_in *dst_addr = (struct sockaddr_in 
*)route-addr.dst_addr;
struct net_device *ndev = NULL;
+   u16 vid;
 
if (src_addr-sin_family != dst_addr-sin_family)
return -EINVAL;
@@ -1783,14 +1784,6 @@ static int cma_resolve_iboe_route(struct rdma_id_private 
*id_priv)
 
route-num_paths = 1;
 
-   iboe_mac_to_ll(route-path_rec-sgid, addr-dev_addr.src_dev_addr);
-   iboe_mac_to_ll(route-path_rec-dgid, addr-dev_addr.dst_dev_addr);
-
-   route-path_rec-hop_limit = 1;
-   route-path_rec-reversible = 1;
-   route-path_rec-pkey = cpu_to_be16(0x);
-   route-path_rec-mtu_selector = IB_SA_EQ;
-
if (addr-dev_addr.bound_dev_if)
ndev = dev_get_by_index(init_net, addr-dev_addr.bound_dev_if);
if (!ndev) {
@@ -1798,6 +1791,17 @@ static int cma_resolve_iboe_route(struct rdma_id_private 
*id_priv)
goto err2;
}
 
+   vid = rdma_vlan_dev_vlan_id(ndev);
+
+   iboe_mac_vlan_to_ll(route-path_rec-sgid, 
addr-dev_addr.src_dev_addr, vid);
+   iboe_mac_vlan_to_ll(route-path_rec-dgid, 
addr-dev_addr.dst_dev_addr, vid);
+
+   route-path_rec-hop_limit = 1;
+   route-path_rec-reversible = 1;
+   route-path_rec-pkey = cpu_to_be16(0x);
+   route-path_rec-mtu_selector = IB_SA_EQ;
+   route-path_rec-sl = id_priv-tos  5;
+
route-path_rec-mtu = iboe_get_mtu(ndev-mtu);
route-path_rec-rate_selector = IB_SA_EQ;
route-path_rec-rate = iboe_get_rate(ndev);
diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index 3d3c926..a1f998c 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -587,13 +587,22 @@ static void ucma_copy_iboe_route(struct 
rdma_ucm_query_route_resp *resp,
 struct rdma_route *route)
 {
struct rdma_dev_addr *dev_addr;
+   struct net_device *dev;
+   u16 vid = 0;
 
resp-num_paths = route-num_paths;
switch (route-num_paths) {
case 0:
dev_addr = route-addr.dev_addr;
-   iboe_mac_to_ll((union ib_gid *) resp-ib_route[0].dgid,
-  dev_addr-dst_dev_addr);
+   dev = dev_get_by_index(init_net, dev_addr-bound_dev_if);
+   if (dev) {
+   vid = rdma_vlan_dev_vlan_id(dev);
+   dev_put(dev);
+   }
+
+
+   iboe_mac_vlan_to_ll((union ib_gid *) resp-ib_route[0].dgid,
+   dev_addr-dst_dev_addr, vid);
iboe_addr_get_sgid(dev_addr,
   (union ib_gid *) resp-ib_route[0].sgid);
resp-ib_route[0].pkey = cpu_to_be16(0x);
diff --git a/drivers/infiniband/core/ud_header.c 
b/drivers/infiniband/core/ud_header.c
index 58b5537..6ac2572 100644
--- a/drivers/infiniband/core/ud_header.c
+++ b/drivers/infiniband/core/ud_header.c
@@ -33,6 +33,7 @@
 
 #include linux/errno.h
 #include linux/string.h
+#include linux/if_ether.h
 
 #include rdma/ib_pack.h
 
@@ -103,6 +104,17 @@ static const struct ib_field eth_table[]  = {
  .size_bits= 16 }
 };
 
+static const struct ib_field vlan_table[]  = {
+   { STRUCT_FIELD(vlan, tag),
+ .offset_words = 0,
+ .offset_bits  = 0,
+ .size_bits= 16 },
+   { STRUCT_FIELD(vlan, type),
+ .offset_words = 0,
+ .offset_bits  = 16,
+ .size_bits= 16 }
+};
+
 static const struct ib_field grh_table[]  = {
{ STRUCT_FIELD(grh, ip_version),
  .offset_words = 0,
@@ -205,6 

[PATCHv9 08/12] mlx4: Add support for IBoE - address resolution

2010-08-06 Thread Eli Cohen
The following patch handles address vectors creation for IBoE ports. mlx4 needs
the MAC address of the remote node to include it in the WQE of a UD QP or in
the QP context of connected QPs. Address resolution is done atomically in the
case of a link local address or a multicast GID and otherwise -EINVAL is
returned.  mlx4 transport packets were changed too to accommodate for IBoE.
Multicast groups attach/detach calls dev_mc_add/remove to update the NIC's
multicast filters.Since attaching a QP to a multicast group does not require
the QP to be in a state different then INIT - this is fine for IB. For IBoE
however, we need the port assigned to the QP in order to call dev_mc_add() for
the correct netdevice, while port is assigned when moving from INIT to RTR.
Hence, we must keep track of all the multicast groups attached to a QP and call
dev_mc_add() when the port becomes available.

Signed-off-by: Eli Cohen e...@mellanox.co.il
---
Changes from V8:
1. Limit max MTU of IBoE port to 2K.
2. Remove implementation of mlx4_ib_get_eth_l2_addr().
3. Fix failure to initialize port number in add_gid_entry().


 drivers/infiniband/hw/mlx4/ah.c  |  161 +---
 drivers/infiniband/hw/mlx4/mad.c |   32 ++-
 drivers/infiniband/hw/mlx4/main.c|  474 +++---
 drivers/infiniband/hw/mlx4/mlx4_ib.h |   32 +++-
 drivers/infiniband/hw/mlx4/qp.c  |  140 --
 drivers/net/mlx4/en_port.c   |4 +-
 drivers/net/mlx4/en_port.h   |3 +-
 drivers/net/mlx4/fw.c|3 +-
 include/linux/mlx4/cmd.h |1 +
 include/linux/mlx4/device.h  |   30 ++-
 include/linux/mlx4/qp.h  |7 +-
 11 files changed, 771 insertions(+), 116 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/ah.c b/drivers/infiniband/hw/mlx4/ah.c
index 11a236f..57d99d2 100644
--- a/drivers/infiniband/hw/mlx4/ah.c
+++ b/drivers/infiniband/hw/mlx4/ah.c
@@ -33,63 +33,157 @@
 #include linux/slab.h
 
 #include mlx4_ib.h
+#include rdma/ib_addr.h
+#include linux/inet.h
+#include linux/string.h
 
-struct ib_ah *mlx4_ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr)
+int mlx4_ib_resolve_grh(struct mlx4_ib_dev *dev, const struct ib_ah_attr 
*ah_attr,
+   u8 *mac, int *is_mcast, u8 port)
 {
-   struct mlx4_dev *dev = to_mdev(pd-device)-dev;
-   struct mlx4_ib_ah *ah;
+   struct mlx4_ib_iboe *iboe = dev-iboe;
+   struct in6_addr in6;
 
-   ah = kmalloc(sizeof *ah, GFP_ATOMIC);
-   if (!ah)
-   return ERR_PTR(-ENOMEM);
+   *is_mcast = 0;
+   spin_lock(iboe-lock);
+   if (!iboe-netdevs[port - 1]) {
+   spin_unlock(iboe-lock);
+   return -EINVAL;
+   }
+   spin_unlock(iboe-lock);
 
-   memset(ah-av, 0, sizeof ah-av);
+   memcpy(in6, ah_attr-grh.dgid.raw, sizeof in6);
+   if (rdma_link_local_addr(in6))
+   rdma_get_ll_mac(in6, mac);
+   else if (rdma_is_multicast_addr(in6)) {
+   rdma_get_mcast_mac(in6, mac);
+   *is_mcast = 1;
+   } else
+   return -EINVAL;
 
-   ah-av.port_pd = cpu_to_be32(to_mpd(pd)-pdn | (ah_attr-port_num  
24));
-   ah-av.g_slid  = ah_attr-src_path_bits;
-   ah-av.dlid= cpu_to_be16(ah_attr-dlid);
-   if (ah_attr-static_rate) {
-   ah-av.stat_rate = ah_attr-static_rate + MLX4_STAT_RATE_OFFSET;
-   while (ah-av.stat_rate  IB_RATE_2_5_GBPS + 
MLX4_STAT_RATE_OFFSET 
-  !(1  ah-av.stat_rate  dev-caps.stat_rate_support))
-   --ah-av.stat_rate;
-   }
-   ah-av.sl_tclass_flowlabel = cpu_to_be32(ah_attr-sl  28);
+   return 0;
+}
+
+static struct ib_ah *create_ib_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr,
+ struct mlx4_ib_ah *ah)
+{
+   struct mlx4_dev *dev = to_mdev(pd-device)-dev;
+
+   ah-av.ib.port_pd = cpu_to_be32(to_mpd(pd)-pdn | (ah_attr-port_num  
24));
+   ah-av.ib.g_slid  = ah_attr-src_path_bits;
if (ah_attr-ah_flags  IB_AH_GRH) {
-   ah-av.g_slid   |= 0x80;
-   ah-av.gid_index = ah_attr-grh.sgid_index;
-   ah-av.hop_limit = ah_attr-grh.hop_limit;
-   ah-av.sl_tclass_flowlabel |=
+   ah-av.ib.g_slid   |= 0x80;
+   ah-av.ib.gid_index = ah_attr-grh.sgid_index;
+   ah-av.ib.hop_limit = ah_attr-grh.hop_limit;
+   ah-av.ib.sl_tclass_flowlabel |=
cpu_to_be32((ah_attr-grh.traffic_class  20) |
ah_attr-grh.flow_label);
-   memcpy(ah-av.dgid, ah_attr-grh.dgid.raw, 16);
+   memcpy(ah-av.ib.dgid, ah_attr-grh.dgid.raw, 16);
+   }
+
+   ah-av.ib.dlid= cpu_to_be16(ah_attr-dlid);
+   if (ah_attr-static_rate) {
+   ah-av.ib.stat_rate = ah_attr-static_rate + 
MLX4_STAT_RATE_OFFSET;
+   while (ah-av.ib.stat_rate  

[PATCHv9 10/12] mlx4: Adpat to API changes in the previous commit

2010-08-06 Thread Eli Cohen
This is done to synchronize mlx4 with the API changes.

Signed-off-by: Eli Cohen e...@mellanox.co.il
---
 drivers/infiniband/hw/mlx4/qp.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 837a612..8041966 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -1296,7 +1296,7 @@ static int build_mlx_header(struct mlx4_ib_sqp *sqp, 
struct ib_send_wr *wr,
ah-av.ib.gid_index, sgid);
if (err)
return err;
-   ib_ud_header_init(send_size, !is_eth, is_eth, is_grh, 0, 
sqp-ud_header);
+   ib_ud_header_init(send_size, !is_eth, is_eth, 0, is_grh, 0, 
sqp-ud_header);
 
if (!is_eth) {
sqp-ud_header.lrh.service_level =
-- 
1.7.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv9 11/12] mthca: Adpat to API changes in ib_core for UD packing

2010-08-06 Thread Eli Cohen
This is done to synchronize mthca with the API changes.

Signed-off-by: Eli Cohen e...@mellanox.co.il
---
 drivers/infiniband/hw/mthca/mthca_qp.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c 
b/drivers/infiniband/hw/mthca/mthca_qp.c
index 1a1c55f..a34c9d3 100644
--- a/drivers/infiniband/hw/mthca/mthca_qp.c
+++ b/drivers/infiniband/hw/mthca/mthca_qp.c
@@ -1493,7 +1493,7 @@ static int build_mlx_header(struct mthca_dev *dev, struct 
mthca_sqp *sqp,
int err;
u16 pkey;
 
-   ib_ud_header_init(256, /* assume a MAD */ 1, 0,
+   ib_ud_header_init(256, /* assume a MAD */ 1, 0, 0,
  mthca_ah_grh_present(to_mah(wr-wr.ud.ah)), 0,
  sqp-ud_header);
 
-- 
1.7.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv9 12/12] mlx4: Add vlan support to IBoE

2010-08-06 Thread Eli Cohen
This patch allows IBoE traffic to be encapsulated in 802.1q tagged VLAN
frames. The VLAN tag is encoded in the GID and derived from it by a simple
computation. The netdev notifier callback is modified to catch new VLAN devices
addition/removal and the port's GID table is updated to reflect the change such
that for each netdevice there is an entry in the GID table. When the port's GID
table is exhausted, GID entries will not be added. Only children of the main
interface's can add to the GID table. If a vlan interface is added on another
vlan interface (e.g. vconfig add eth2.6 8), then that interfaces will not add
an entry to the GID table.

Signed-off-by: Eli Cohen e...@mellanox.co.il
---
Changes from V8:
1. Bug fix in build_mlx_header failing to intialize is_vlan.
2. Fix mapping of SL to 802.1q priority
3. Change allocation in update_ipv6_gids() from GFP_KERNEL to GFP_ATOMIC since
   it is called with spinlock acquired.
4. Fix bug in populating the GID table after VLANs are added on the
   corresponding Ethernet interfcae.
5. Fix bug in mlx4_ib_netdev_event() accessing NULL device pointer.



 drivers/infiniband/hw/mlx4/ah.c   |9 +++
 drivers/infiniband/hw/mlx4/main.c |   98 -
 drivers/infiniband/hw/mlx4/qp.c   |   69 +-
 drivers/net/mlx4/en_netdev.c  |   10 
 drivers/net/mlx4/mlx4_en.h|1 +
 drivers/net/mlx4/port.c   |   19 +++
 include/linux/mlx4/device.h   |1 +
 include/linux/mlx4/qp.h   |2 +-
 8 files changed, 184 insertions(+), 25 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/ah.c b/drivers/infiniband/hw/mlx4/ah.c
index 57d99d2..9677ed8 100644
--- a/drivers/infiniband/hw/mlx4/ah.c
+++ b/drivers/infiniband/hw/mlx4/ah.c
@@ -36,6 +36,7 @@
 #include rdma/ib_addr.h
 #include linux/inet.h
 #include linux/string.h
+#include rdma/ib_cache.h
 
 int mlx4_ib_resolve_grh(struct mlx4_ib_dev *dev, const struct ib_ah_attr 
*ah_attr,
u8 *mac, int *is_mcast, u8 port)
@@ -100,14 +101,22 @@ static struct ib_ah *create_iboe_ah(struct ib_pd *pd, 
struct ib_ah_attr *ah_attr
u8 mac[6];
int err;
int is_mcast;
+   u16 vlan_tag;
+   union ib_gid sgid;
 
err = mlx4_ib_resolve_grh(ibdev, ah_attr, mac, is_mcast, 
ah_attr-port_num);
if (err)
return ERR_PTR(err);
 
memcpy(ah-av.eth.mac, mac, 6);
+   err = ib_get_cached_gid(pd-device, ah_attr-port_num, 
ah_attr-grh.sgid_index, sgid);
+   if (err)
+   return ERR_PTR(err);
+   vlan_tag = rdma_get_vlan_id(sgid);
+   vlan_tag |= (ah_attr-sl  7)  13;
ah-av.eth.port_pd = cpu_to_be32(to_mpd(pd)-pdn | (ah_attr-port_num 
 24));
ah-av.eth.gid_index = ah_attr-grh.sgid_index;
+   ah-av.eth.vlan = cpu_to_be16(vlan_tag);
if (ah_attr-static_rate) {
ah-av.eth.stat_rate = ah_attr-static_rate + 
MLX4_STAT_RATE_OFFSET;
while (ah-av.eth.stat_rate  IB_RATE_2_5_GBPS + 
MLX4_STAT_RATE_OFFSET 
diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index 8c0e447..50882ad 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -38,6 +38,7 @@
 #include linux/netdevice.h
 #include linux/inetdevice.h
 #include linux/rtnetlink.h
+#include linux/if_vlan.h
 
 #include rdma/ib_smi.h
 #include rdma/ib_user_verbs.h
@@ -79,6 +80,8 @@ static void init_query_mad(struct ib_smp *mad)
mad-method= IB_MGMT_METHOD_GET;
 }
 
+static union ib_gid zgid;
+
 static int mlx4_ib_query_device(struct ib_device *ibdev,
struct ib_device_attr *props)
 {
@@ -786,12 +789,17 @@ static struct device_attribute *mlx4_class_attributes[] = 
{
dev_attr_board_id
 };
 
-static void mlx4_addrconf_ifid_eui48(u8 *eui, struct net_device *dev)
+static void mlx4_addrconf_ifid_eui48(u8 *eui, int is_vlan, u16 vlan_id, struct 
net_device *dev)
 {
memcpy(eui, dev-dev_addr, 3);
memcpy(eui + 5, dev-dev_addr + 3, 3);
-   eui[3] = 0xFF;
-   eui[4] = 0xFE;
+   if (is_vlan) {
+   eui[3] = vlan_id  8;
+   eui[4] = vlan_id  0xff;
+   } else {
+   eui[3] = 0xff;
+   eui[4] = 0xfe;
+   }
eui[0] ^= 2;
 }
 
@@ -833,28 +841,92 @@ static int update_ipv6_gids(struct mlx4_ib_dev *dev, int 
port, int clear)
 {
struct net_device *ndev = dev-iboe.netdevs[port - 1];
struct update_gid_work *work;
+   struct net_device *tmp;
+   int i;
+   u8 *hits;
+   int ret;
+   union ib_gid gid;
+   int free;
+   int found;
+   int need_update = 0;
+   int is_vlan;
+   u16 vid;
 
work = kzalloc(sizeof *work, GFP_ATOMIC);
if (!work)
return -ENOMEM;
 
-   if (!clear) {
-   mlx4_addrconf_ifid_eui48(work-gids[0].raw[8], ndev);
-   work-gids[0].global.subnet_prefix = 

[PATCHv9 1/4] libibverbs: Add link layer field to ibv_port_attr

2010-08-06 Thread Eli Cohen
This field can have one of the values - IBV_LINK_LAYER_UNSPECIFIED,
IBV_LINK_LAYER_INFINIBAND, IBV_LINK_LAYER_ETHERNET. It can be used by
applications to know the link layer used by the port, which can be either
Infiniband or Ethernet. The addition of the new field does not change the size
of struct ibv_port_attr due to alignment of the preceding field. Binary
compatibility is not compromised either since new apps with old libraries will
determine the link layer as IB while old applications with new a new library do
not read this field.

Solution suggested by:
   Roland Dreier rola...@cisco.com
   Jason Gunthorpe jguntho...@obsidianresearch.com
Signed-off-by: Eli Cohen e...@mellanox.co.il
---
 include/infiniband/verbs.h |   21 +
 1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 0f1cb2e..17df3ff 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -161,6 +161,12 @@ enum ibv_port_state {
IBV_PORT_ACTIVE_DEFER   = 5
 };
 
+enum {
+   IBV_LINK_LAYER_UNSPECIFIED,
+   IBV_LINK_LAYER_INFINIBAND,
+   IBV_LINK_LAYER_ETHERNET,
+};
+
 struct ibv_port_attr {
enum ibv_port_state state;
enum ibv_mtumax_mtu;
@@ -181,6 +187,8 @@ struct ibv_port_attr {
uint8_t active_width;
uint8_t active_speed;
uint8_t phys_state;
+   uint8_t link_layer;
+   uint8_t pad;
 };
 
 enum ibv_event_type {
@@ -693,6 +701,16 @@ struct ibv_context {
void   *abi_compat;
 };
 
+static inline int ___ibv_query_port(struct ibv_context *context,
+   uint8_t port_num,
+   struct ibv_port_attr *port_attr)
+{
+   port_attr-link_layer = IBV_LINK_LAYER_UNSPECIFIED;
+   port_attr-pad = 0;
+
+   return context-ops.query_port(context, port_num, port_attr);
+}
+
 /**
  * ibv_get_device_list - Get list of IB devices currently available
  * @num_devices: optional.  if non-NULL, set to the number of devices
@@ -1097,4 +1115,7 @@ END_C_DECLS
 
 #  undef __attribute_const
 
+#define ibv_query_port(context, port_num, port_attr) \
+   ___ibv_query_port(context, port_num, port_attr)
+
 #endif /* INFINIBAND_VERBS_H */
-- 
1.7.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv9 2/4] lbibverbs: change kernel API to accept link layer

2010-08-06 Thread Eli Cohen
Modify the code to allow passing the link layer of a port from kernel to user.
Update ibv_query_port.3 man page with the change.

Signed-off-by: Eli Cohen e...@mellanox.co.il
---
 include/infiniband/kern-abi.h |3 ++-
 man/ibv_query_port.3  |1 +
 src/cmd.c |1 +
 3 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/include/infiniband/kern-abi.h b/include/infiniband/kern-abi.h
index 0db083a..619ea7e 100644
--- a/include/infiniband/kern-abi.h
+++ b/include/infiniband/kern-abi.h
@@ -223,7 +223,8 @@ struct ibv_query_port_resp {
__u8  active_width;
__u8  active_speed;
__u8  phys_state;
-   __u8  reserved[3];
+   __u8  link_layer;
+   __u8  reserved[2];
 };
 
 struct ibv_alloc_pd {
diff --git a/man/ibv_query_port.3 b/man/ibv_query_port.3
index 882470d..6d8b873 100644
--- a/man/ibv_query_port.3
+++ b/man/ibv_query_port.3
@@ -44,6 +44,7 @@ uint8_t init_type_reply;/* Type of 
initialization performed by S
 uint8_t active_width;   /* Currently active link width */
 uint8_t active_speed;   /* Currently active link speed */
 uint8_t phys_state; /* Physical port state */
+uint8_t link_layer; /* link layer protocol of the port */  
 
 .in -8
 };
 .sp
diff --git a/src/cmd.c b/src/cmd.c
index cbd5288..39af833 100644
--- a/src/cmd.c
+++ b/src/cmd.c
@@ -196,6 +196,7 @@ int ibv_cmd_query_port(struct ibv_context *context, uint8_t 
port_num,
port_attr-active_width= resp.active_width;
port_attr-active_speed= resp.active_speed;
port_attr-phys_state  = resp.phys_state;
+   port_attr-link_layer  = resp.link_layer;
 
return 0;
 }
-- 
1.7.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv9 3/4] libibverbs: Add API to resolve GID to L2 address

2010-08-06 Thread Eli Cohen
Add a new API, resolve_eth_gid(), which resolves a GID to layer 2 address
information. A GID resembles an IPv6 link local address and encodes the MAC
address and the VLAN tag within it. The function accepts the destination GID,
port number and source GID index and returns the MAC, VLAN and indications
whether the remote address is tagged VLAN and whether the address is multicast.

VLAN encoding is done as follows:
gid[11] is MS byte
gid[12] is LS byte

Signed-off-by: Eli Cohen e...@mellanox.co.il
---

Changes from V8:
1. Move resolving of GID to MAC from kernel space to user space.

 include/infiniband/driver.h |7 +++
 src/verbs.c |   97 +++
 2 files changed, 104 insertions(+), 0 deletions(-)

diff --git a/include/infiniband/driver.h b/include/infiniband/driver.h
index 9a81416..be4ec98 100644
--- a/include/infiniband/driver.h
+++ b/include/infiniband/driver.h
@@ -37,6 +37,8 @@
 
 #include infiniband/verbs.h
 #include infiniband/kern-abi.h
+#include arpa/inet.h
+#include string.h
 
 #ifdef __cplusplus
 #  define BEGIN_C_DECLS extern C {
@@ -143,4 +145,9 @@ const char *ibv_get_sysfs_path(void);
 int ibv_read_sysfs_file(const char *dir, const char *file,
char *buf, size_t size);
 
+int resolve_eth_gid(struct ibv_pd *pd, uint8_t port_num,
+   union ibv_gid *dgid, uint8_t sgid_index,
+   uint8_t mac[], uint16_t *vlan, uint8_t *tagged,
+   uint8_t *is_mcast);
+
 #endif /* INFINIBAND_DRIVER_H */
diff --git a/src/verbs.c b/src/verbs.c
index ba3c0a4..272feac 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -148,6 +148,7 @@ struct ibv_pd *__ibv_alloc_pd(struct ibv_context *context)
 }
 default_symver(__ibv_alloc_pd, ibv_alloc_pd);
 
+
 int __ibv_dealloc_pd(struct ibv_pd *pd)
 {
return pd-context-ops.dealloc_pd(pd);
@@ -543,3 +544,99 @@ int __ibv_detach_mcast(struct ibv_qp *qp, const union 
ibv_gid *gid, uint16_t lid
return qp-context-ops.detach_mcast(qp, gid, lid);
 }
 default_symver(__ibv_detach_mcast, ibv_detach_mcast);
+
+static uint16_t get_vlan_id(const union ibv_gid *dgid)
+{
+   return dgid-raw[11]  8 | dgid-raw[12];
+}
+
+static void get_ll_mac(const union ibv_gid *gid, uint8_t *mac)
+{
+   memcpy(mac, gid-raw[8], 3);
+   memcpy(mac + 3, gid-raw[13], 3);
+   mac[0] ^= 2;
+}
+
+static int is_multicast_gid(const union ibv_gid *gid)
+{
+   return gid-raw[0] == 0xff;
+}
+
+static void get_mcast_mac(const union ibv_gid *gid, uint8_t *mac)
+{
+   int i;
+
+   mac[0] = 0x33;
+   mac[1] = 0x33;
+   for (i = 2; i  6; ++i)
+   mac[i] = gid-raw[i + 10];
+}
+
+static int is_link_local_gid(const union ibv_gid *gid)
+{
+   uint32_t hi = *(uint32_t *)(gid-raw);
+   uint32_t lo = *(uint32_t *)(gid-raw + 4);
+   if (hi == htonl(0xfe80)  lo == 0)
+   return 1;
+
+   return 0;
+}
+
+static int resolve_gid(const union ibv_gid *dgid, uint8_t *mac, uint8_t 
*is_mcast)
+{
+   if (is_link_local_gid(dgid)) {
+   get_ll_mac(dgid, mac);
+   *is_mcast = 0;
+   } else if (is_multicast_gid(dgid)) {
+   get_mcast_mac(dgid, mac);
+   *is_mcast = 1;
+   } else
+   return -EINVAL;
+
+   return 0;
+}
+
+static int is_tagged_vlan(const union ibv_gid *gid)
+{
+   uint16_t tag;
+
+   tag = gid-raw[11]  8 |  gid-raw[12];
+
+   return tag  0x1000;
+}
+
+int __resolve_eth_gid(struct ibv_pd *pd, uint8_t port_num,
+ const union ibv_gid *dgid, uint8_t sgid_index,
+ uint8_t mac[], uint16_t *vlan, uint8_t *tagged,
+ uint8_t *is_mcast)
+{
+   int err;
+   union ibv_gid sgid;
+   int stagged, svlan;
+
+   err = resolve_gid(dgid, mac, is_mcast);
+   if (err)
+   return err;
+
+   err = ibv_query_gid(pd-context, port_num, sgid_index, sgid);
+   if (err)
+   return err;
+
+   stagged = is_tagged_vlan(sgid);
+   if (stagged) {
+   if (!is_tagged_vlan(dgid))
+   return -1;
+
+   svlan = get_vlan_id(sgid);
+   if (svlan != get_vlan_id(dgid))
+   return -1;
+
+   *tagged = 1;
+   *vlan = svlan;
+   } else
+   *tagged = 0;
+
+   return 0;
+}
+default_symver(__resolve_eth_gid, resolve_eth_gid);
+
-- 
1.7.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv9 4/4] libibverbs: Update examples for IBoE

2010-08-06 Thread Eli Cohen
Since IBoE requires usage of GRH, update ibv_*_pinpong examples to accept GIDs.
GIDs are given as an index to the local port's table and are exchanged between
the client and the server through the socket connection. The examples are also
modified to pass the gid index to the code that creates the address vector as a
preparation to using gids other the the on in index 0.

Signed-off-by: Eli Cohen e...@mellanox.co.il
---
 examples/devinfo.c  |   14 +++
 examples/pingpong.c |   31 
 examples/pingpong.h |4 ++
 examples/rc_pingpong.c  |   91 ++
 examples/srq_pingpong.c |   84 ---
 examples/uc_pingpong.c  |   82 +++---
 examples/ud_pingpong.c  |   81 ++
 7 files changed, 297 insertions(+), 90 deletions(-)

diff --git a/examples/devinfo.c b/examples/devinfo.c
index 84f95c7..393ec04 100644
--- a/examples/devinfo.c
+++ b/examples/devinfo.c
@@ -184,6 +184,19 @@ static int print_all_port_gids(struct ibv_context *ctx, 
uint8_t port_num, int tb
return rc;
 }
 
+static const char *link_layer_str(uint8_t link_layer)
+{
+   switch (link_layer) {
+   case IBV_LINK_LAYER_UNSPECIFIED:
+   case IBV_LINK_LAYER_INFINIBAND:
+   return IB;
+   case IBV_LINK_LAYER_ETHERNET:
+   return Ethernet;
+   default:
+   return Unknown;
+   }
+}
+
 static int print_hca_cap(struct ibv_device *ib_dev, uint8_t ib_port)
 {
struct ibv_context *ctx;
@@ -284,6 +297,7 @@ static int print_hca_cap(struct ibv_device *ib_dev, uint8_t 
ib_port)
printf(\t\t\tsm_lid:\t\t\t%d\n, port_attr.sm_lid);
printf(\t\t\tport_lid:\t\t%d\n, port_attr.lid);
printf(\t\t\tport_lmc:\t\t0x%02x\n, port_attr.lmc);
+   printf(\t\t\tlink_layer:\t\t%s\n, 
link_layer_str(port_attr.link_layer));
 
if (verbose) {
printf(\t\t\tmax_msg_sz:\t\t0x%x\n, 
port_attr.max_msg_sz);
diff --git a/examples/pingpong.c b/examples/pingpong.c
index b916f59..806f446 100644
--- a/examples/pingpong.c
+++ b/examples/pingpong.c
@@ -31,6 +31,10 @@
  */
 
 #include pingpong.h
+#include arpa/inet.h
+#include stdlib.h
+#include stdio.h
+#include string.h
 
 enum ibv_mtu pp_mtu_to_enum(int mtu)
 {
@@ -53,3 +57,30 @@ uint16_t pp_get_local_lid(struct ibv_context *context, int 
port)
 
return attr.lid;
 }
+
+int pp_get_port_info(struct ibv_context *context, int port,
+struct ibv_port_attr *attr)
+{
+   return ibv_query_port(context, port, attr);
+}
+
+void wire_gid_to_gid(const char *wgid, union ibv_gid *gid)
+{
+   char tmp[9];
+   uint32_t v32;
+   int i;
+
+   for (tmp[8] = 0, i = 0; i  4; ++i) {
+   memcpy(tmp, wgid + i * 8, 8);
+   sscanf(tmp, %x, v32);
+   *(uint32_t *)(gid-raw[i * 4]) = ntohl(v32);
+   }
+}
+
+void gid_to_wire_gid(const union ibv_gid *gid, char wgid[])
+{
+   int i;
+
+   for (i = 0; i  4; ++i)
+   sprintf(wgid[i * 8], %08x, htonl(*(uint32_t *)(gid-raw + i 
* 4)));
+}
diff --git a/examples/pingpong.h b/examples/pingpong.h
index 71d7c3f..9cdc03e 100644
--- a/examples/pingpong.h
+++ b/examples/pingpong.h
@@ -37,5 +37,9 @@
 
 enum ibv_mtu pp_mtu_to_enum(int mtu);
 uint16_t pp_get_local_lid(struct ibv_context *context, int port);
+int pp_get_port_info(struct ibv_context *context, int port,
+struct ibv_port_attr *attr);
+void wire_gid_to_gid(const char *wgid, union ibv_gid *gid);
+void gid_to_wire_gid(const union ibv_gid *gid, char wgid[]);
 
 #endif /* IBV_PINGPONG_H */
diff --git a/examples/rc_pingpong.c b/examples/rc_pingpong.c
index fa969e0..a63905d 100644
--- a/examples/rc_pingpong.c
+++ b/examples/rc_pingpong.c
@@ -67,17 +67,19 @@ struct pingpong_context {
int  size;
int  rx_depth;
int  pending;
+   struct ibv_port_attr portinfo;
 };
 
 struct pingpong_dest {
int lid;
int qpn;
int psn;
+   union ibv_gid gid;
 };
 
 static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn,
  enum ibv_mtu mtu, int sl,
- struct pingpong_dest *dest)
+ struct pingpong_dest *dest, int sgid_idx)
 {
struct ibv_qp_attr attr = {
.qp_state   = IBV_QPS_RTR,
@@ -94,6 +96,13 @@ static int pp_connect_ctx(struct pingpong_context *ctx, int 
port, int my_psn,
.port_num   = port
}
};
+
+   if (dest-gid.global.interface_id) {
+   attr.ah_attr.is_global = 1;
+   attr.ah_attr.grh.hop_limit = 1;
+   attr.ah_attr.grh.dgid = dest-gid;
+   attr.ah_attr.grh.sgid_index = sgid_idx;
+   }
   

[PATCHv9 1/2] libmlx4: Add IBoE support

2010-08-06 Thread Eli Cohen
Modify libmlx4 to support IBoE. The change involves retrieving the ethernet
layer 2 address of a port based on its GID and source index through a new
userspace call, resolve_eth_gid(), and embedding the layer 2 information in
the address vector representation of mlx4.

Signed-off-by: Eli Cohen e...@mellanox.co.il
---

Changes from V8:
1. Move resolving of GID to MAC from kernel space to user space.

 src/mlx4.h  |4 
 src/qp.c|8 +++-
 src/verbs.c |   28 
 src/wqe.h   |6 --
 4 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/src/mlx4.h b/src/mlx4.h
index 4445998..4b12456 100644
--- a/src/mlx4.h
+++ b/src/mlx4.h
@@ -236,11 +236,15 @@ struct mlx4_av {
uint8_t hop_limit;
uint32_tsl_tclass_flowlabel;
uint8_t dgid[16];
+   uint8_t mac[8];
 };
 
 struct mlx4_ah {
struct ibv_ah   ibv_ah;
struct mlx4_av  av;
+   uint16_tvlan;
+   uint8_t mac[6];
+   uint8_t tagged;
 };
 
 static inline unsigned long align(unsigned long val, unsigned long align)
diff --git a/src/qp.c b/src/qp.c
index d194ae3..fa70889 100644
--- a/src/qp.c
+++ b/src/qp.c
@@ -143,6 +143,8 @@ static void set_datagram_seg(struct mlx4_wqe_datagram_seg 
*dseg,
memcpy(dseg-av, to_mah(wr-wr.ud.ah)-av, sizeof (struct mlx4_av));
dseg-dqpn = htonl(wr-wr.ud.remote_qpn);
dseg-qkey = htonl(wr-wr.ud.remote_qkey);
+   dseg-vlan = htons(to_mah(wr-wr.ud.ah)-vlan);
+   memcpy(dseg-mac, to_mah(wr-wr.ud.ah)-mac, 6);
 }
 
 static void __set_data_seg(struct mlx4_wqe_data_seg *dseg, struct ibv_sge *sg)
@@ -281,6 +283,10 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr 
*wr,
set_datagram_seg(wqe, wr);
wqe  += sizeof (struct mlx4_wqe_datagram_seg);
size += sizeof (struct mlx4_wqe_datagram_seg) / 16;
+   if (to_mah(wr-wr.ud.ah)-tagged) {
+   ctrl-ins_vlan = 1  6;
+   ctrl-vlan_tag = 
htons(to_mah(wr-wr.ud.ah)-vlan);
+   }
break;
 
default:
@@ -393,7 +399,7 @@ out:
 
if (nreq == 1  inl  size  1  size  ctx-bf_buf_size / 16) {
ctrl-owner_opcode |= htonl((qp-sq.head  0x)  8);
-   *(uint32_t *) ctrl-reserved |= qp-doorbell_qpn;
+   *(uint32_t *) (ctrl-vlan_tag) |= qp-doorbell_qpn;
/*
 * Make sure that descriptor is written to memory
 * before writing to BlueFlame page.
diff --git a/src/verbs.c b/src/verbs.c
index 1ac1362..756796f 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -40,6 +40,7 @@
 #include pthread.h
 #include errno.h
 #include netinet/in.h
+#include infiniband/driver.h
 
 #include mlx4.h
 #include mlx4-abi.h
@@ -617,6 +618,8 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp)
 struct ibv_ah *mlx4_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr)
 {
struct mlx4_ah *ah;
+   struct ibv_port_attr port_attr;
+   uint8_t is_mcast;
 
ah = malloc(sizeof *ah);
if (!ah)
@@ -642,7 +645,32 @@ struct ibv_ah *mlx4_create_ah(struct ibv_pd *pd, struct 
ibv_ah_attr *attr)
memcpy(ah-av.dgid, attr-grh.dgid.raw, 16);
}
 
+   if (ibv_query_port(pd-context, attr-port_num, port_attr))
+   goto err;
+
+   if (port_attr.link_layer == IBV_LINK_LAYER_ETHERNET) {
+   if (resolve_eth_gid(pd, attr-port_num,
+   (union ibv_gid *)ah-av.dgid,
+   attr-grh.sgid_index,
+   ah-mac, ah-vlan,
+   ah-tagged, is_mcast))
+   goto err;
+
+   if (is_mcast) {
+   ah-av.dlid = htons(0xc000);
+   ah-av.port_pd |= htonl(1  31);
+   }
+   if (ah-tagged) {
+   ah-av.port_pd |= htonl(1  29);
+   ah-vlan |= (attr-sl  7)  13;
+   }
+   }
+
+
return ah-ibv_ah;
+err:
+   free(ah);
+   return NULL;
 }
 
 int mlx4_destroy_ah(struct ibv_ah *ah)
diff --git a/src/wqe.h b/src/wqe.h
index 6f7f309..1e6159c 100644
--- a/src/wqe.h
+++ b/src/wqe.h
@@ -54,7 +54,8 @@ enum {
 
 struct mlx4_wqe_ctrl_seg {
uint32_towner_opcode;
-   uint8_t reserved[3];
+   uint16_tvlan_tag;
+   uint8_t ins_vlan;
uint8_t fence_size;
/*
 * High 24 bits are SRC remote buffer; low 8 bits are flags:
@@ -78,7 +79,8 @@ struct mlx4_wqe_datagram_seg {
uint32_tav[8];

Re: yet again the atomic operations

2010-08-06 Thread Roland Dreier
  So if the CPU writes/reads to/from the same address, even atomically
  (lock), there might be room for some inconsistency on the values? It
  is not really atomic from the whole system point of view, just for the
  HCA? If so, is there any possibility to make the whole operation
  'system-wide' atomic?

PCI does not have any capability for atomic operations until PCI Express
3.0 (not available in any real devices yet).  So any current HCA
performing atomic operations across a PCI bus will always have to do
read-modify-write which leaves a window for the CPU to mess things up if
it accesses the same location.

You can work around this by creating a loopback connection (ie an RC
connection from the local HCA to itself) and post atomic operations to
that QP instead of accessing the memory directly with the CPU.

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations

2010-08-06 Thread Roland Dreier
  The proposed path optimization is removing of dynamic allocations 
  by redefining a structure definition passed to kernel. 

  To 
  
  struct ibv_post_send {
  __u32 command;
  __u16 in_words;
  __u16 out_words;
  __u64 response;
  __u32 qp_handle;
  __u32 wr_count;
  __u32 sge_count;
  __u32 wqe_size;
  struct ibv_kern_send_wr send_wr[512];
  };

I don't see how this can possibly work.  Where does the scatter/gather
list go if you make this have a fixed size array of send_wr?

Also I don't see why you need to change the user/kernel ABI at all to
get rid of dynamic allocations... can't you just have the kernel keep a
cached send_wr allocation (say, per user context) and reuse that?  (ie
allocate memory but don't free the first time into post_send, and only
reallocate if a bigger send request comes, and only free when destroying
the context)

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations

2010-08-06 Thread Jason Gunthorpe
On Fri, Aug 06, 2010 at 11:03:36AM +0100, Walukiewicz, Miroslaw wrote:

 Currently the transmit/receive path works following way: User calls
 ibv_post_send() where vendor specific function is called.  When the
 path should go through kernel the ibv_cmd_post_send() is called.
 The function creates the POST_SEND message body that is passed to
 kernel.  As the number of sges is unknown the dynamic allocation for
 message body is performed.  (see libibverbs/src/cmd.c)

Do you have any benchmarks that show the alloca is a measurable
overhead?  I'm pretty skeptical... alloca will generally boil down to
one or two assembly instructions adjusting the stack pointer, and not
even that if you are lucky and it can be merged into the function
prologe.

 In the kernel the message body is parsed and a structure of wr and
 sges is recreated using dynamic allocations in kernel The goal of
 this operation is having a similar structure like in user space.

.. the kmalloc call(s) on the other hand definately seems worth
looking at ..

 In kernel in ib_uverbs_post_send() instead of dynamic allocation of
 the ib_send_wr structures the table of 512 ib_send_wr structures
 will be defined and all entries will be linked to unidirectional
 list so qp-device-post_send(qp, wr, bad_wr) API will be not
 changed.

Isn't there a kernel API already for managing a pool of
pre-allocated fixed-size allocations?

It isn't clear to me that is even necessary, Roland is right, all you
really need is a per-context (+per-cpu?) buffer you can grab, fill,
and put back.

 As I know no driver uses that kernel path to posting buffers so
 iWARP multicast acceleration implemented in NES driver Would be a
 first application that can utilize the optimized path.

??

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv9 3/4] libibverbs: Add API to resolve GID to L2 address

2010-08-06 Thread Jason Gunthorpe
On Fri, Aug 06, 2010 at 05:43:37PM +0300, Eli Cohen wrote:

 +int resolve_eth_gid(struct ibv_pd *pd, uint8_t port_num,
 + union ibv_gid *dgid, uint8_t sgid_index,
 + uint8_t mac[], uint16_t *vlan, uint8_t *tagged,
 + uint8_t *is_mcast);

Still missing the const on dgid

It also just occured to me that this should be ibv_resolve_eth_gid to
keep within the symbol namespace prefix used by libibverbs.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv9 09/12] ib_core: Add VLAN support to IBoE

2010-08-06 Thread Jason Gunthorpe
On Fri, Aug 06, 2010 at 05:41:53PM +0300, Eli Cohen wrote:
 -static inline void iboe_mac_to_ll(union ib_gid *gid, u8 *mac)
 +static inline void iboe_mac_vlan_to_ll(union ib_gid *gid, u8 *mac, u16 vid)
  {
   memset(gid-raw, 0, 16);
   *((u32 *)gid-raw) = cpu_to_be32(0xfe80);
 - gid-raw[12] = 0xfe;
 - gid-raw[11] = 0xff;
 + if (vid) {
 + gid-raw[12] = vid  0xff;
 + gid-raw[11] = vid  8;
 + } else {
 + gid-raw[12] = 0xfe;
 + gid-raw[11] = 0xff;
 + }
   memcpy(gid-raw + 13, mac + 3, 3);
   memcpy(gid-raw + 8, mac, 3);
   gid-raw[8] ^= 2;

My general comment on this would be the same I made for userspace:
Don't assume VID == 0 means no vlan tag. Use 0x or something that
is actually invalid.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: yet again the atomic operations

2010-08-06 Thread Ralph Campbell
On Fri, 2010-08-06 at 04:43 -0700, Rui Machado wrote:
 Hi there,
 
  There are two kinds supported. QLogic's driver does them in
  the host driver so they are atomic with respect to all the CPUs
  in the host. Mellanox uses HCA wide atomic which means the
  HCA will do a memory read/write without allowing other reads
  or writes from different QP operations passing through that
  HCA to get in between. The CPUs on the host won't see
  atomic operations since from their perspective, it looks
  like a normal read and write from the PCIe bus.
 
 So if the CPU writes/reads to/from the same address, even atomically
 (lock), there might be room for some inconsistency on the values? It
 is not really atomic from the whole system point of view, just for the
 HCA? If so, is there any possibility to make the whole operation
 'system-wide' atomic?

Correct.
It won't be consistent from the HCA's point of view if other HCAs
or CPUs are modifying the memory - even if they do it atomically.
It is only consistent if a single HCA is doing atomic ops to the
memory.

There is no possibility to change this unless PCIe atomic
operations are used by the HCA and if the root complex supports
atomic operations. I don't know of any HCAs or root complex
chips which have this support yet.

  You can see what type the HCA supports with ibv_devinfo -v
  and look for atomic_cap: ATOMIC_HCA (1) or
  atomic_cap: ATOMIC_GLOB (2).
 
 ATOMIC_HCA (1) is what I see in my Mellanox hardware. This is the case
 you mentioned, without allowing other reads or writes from different
 QP operations passing through that HCA to get in between
 ATOMIC_GLOB (2) means with respect to all HCAs and even the CPU?

Correct.

 Cheers,
 Rui
 


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations

2010-08-06 Thread Ralph Campbell
On Fri, 2010-08-06 at 03:03 -0700, Walukiewicz, Miroslaw wrote:
 Currently the ibv_post_send()/ibv_post_recv() path through kernel 
 (using /dev/infiniband/rdmacm) could be optimized by removing dynamic memory 
 allocations on the path. 
 
 Currently the transmit/receive path works following way:
 User calls ibv_post_send() where vendor specific function is called. 
 When the path should go through kernel the ibv_cmd_post_send() is called.
  The function creates the POST_SEND message body that is passed to kernel. 
 As the number of sges is unknown the dynamic allocation for message body is 
 performed. 
 (see libibverbs/src/cmd.c)
 
 In the kernel the message body is parsed and a structure of wr and sges is 
 recreated using dynamic allocations in kernel 
 The goal of this operation is having a similar structure like in user space. 
 
 The proposed path optimization is removing of dynamic allocations 
 by redefining a structure definition passed to kernel. 
 From 
 
 struct ibv_post_send {
 __u32 command;
 __u16 in_words;
 __u16 out_words;
 __u64 response;
 __u32 qp_handle;
 __u32 wr_count;
 __u32 sge_count;
 __u32 wqe_size;
 struct ibv_kern_send_wr send_wr[0];
 };
 To 
 
 struct ibv_post_send {
 __u32 command;
 __u16 in_words;
 __u16 out_words;
 __u64 response;
 __u32 qp_handle;
 __u32 wr_count;
 __u32 sge_count;
 __u32 wqe_size;
 struct ibv_kern_send_wr send_wr[512];
 };
 
 Similar change is required in kernel  struct ib_uverbs_post_send defined in 
 /ofa_kernel/include/rdma/ib_uverbs.h
 
 This change limits a number of send_wr passed from unlimited (assured by 
 dynamic allocation) to reasonable number of 512. 
 I think this number should be a max number of QP entries available to send. 
 As the all iB/iWARP applications are low latency applications so the number 
 of WRs passed are never unlimited.
 
 As the result instead of dynamic allocation the ibv_cmd_post_send() fills the 
 proposed structure 
 directly and passes it to kernel. Whenever the number of send_wr number 
 exceeds the limit the ENOMEM error is returned.
 
 In kernel  in ib_uverbs_post_send() instead of dynamic allocation of the 
 ib_send_wr structures 
 the table of 512  ib_send_wr structures  will be defined and 
 all entries will be linked to unidirectional list so 
 qp-device-post_send(qp, wr, bad_wr) API will be not changed. 
 
 As I know no driver uses that kernel path to posting buffers so iWARP 
 multicast acceleration implemented in NES driver 
 Would be a first application that can utilize the optimized path. 
 
 Regards,
 
 Mirek
 
 Signed-off-by: Mirek Walukiewicz miroslaw.walukiew...@intel.com

The libipathverbs.so plug-in for libibverbs and
the ib_ipath and ib_qib kernel modules use this path for
ibv_post_send().

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/10] pre-reserve MPTs for FC

2010-08-06 Thread Vu Pham

From: Yevgeny Petrilin yevge...@mellanox.co.il
Date: Sun, 16 Nov 2008 10:25:59 +0200
Subject: [PATCH] mlx4: Fibre Channel support

As we did with QPs, some of the MPTs are pre-reserved
(the MPTs that are mapped for FEXCHs, 2*64K of them).
So needed to split the operation of allocating an MPT to two:
   The allocation of a bit from the bitmap
   The actual creation of the entry (and it's MTT).
So, mr_alloc_reserved() is the second part, where you know which MPT 
number was allocated.

mr_alloc() is the one that allocates a number from the bitmap.
Normal users keep using the original mr_alloc().
For FEXCH, when we know the pre-reserved MPT entry, we call 
mr_alloc_reserved() directly.


Same with the mr_free() and corresponding mr_free_reserved().
The first will just put back the bit, the later will actually
destroy the entry, but will leave the bit set.

map_phys_fmr_fbo() is very much like the original map_phys_fmr()
allows setting an FBO (First Byte Offset) for the MPT
allows setting the data length for the MPT
does not increase the higher bits of the key after every map.

Signed-off-by: Yevgeny Petrilin yevge...@mellanox.co.il
Signed-off-by: Oren Duer o...@mellanox.co.il
Signed-off-by: Vu Pham v...@mellanx.com

drivers/net/mlx4/main.c |4 +-
drivers/net/mlx4/mr.c   |  128 
+-

include/linux/mlx4/device.h |   21 +++-
include/linux/mlx4/qp.h |   11 +++-
4 files changed, 144 insertions(+), 20 deletions(-)


From: Yevgeny Petrilin yevge...@mellanox.co.il
Date: Sun, 16 Nov 2008 10:25:59 +0200
Subject: [PATCH] mlx4: Fibre Channel support

As we did with QPs, some of the MPTs are pre-reserved
(the MPTs that are mapped for FEXCHs, 2*64K of them).
So needed to split the operation of allocating an MPT to two:
The allocation of a bit from the bitmap
The actual creation of the entry (and it's MTT).
So, mr_alloc_reserved() is the second part, where you know which MPT number was allocated.
mr_alloc() is the one that allocates a number from the bitmap.
Normal users keep using the original mr_alloc().
For FEXCH, when we know the pre-reserved MPT entry, we call mr_alloc_reserved() directly.

Same with the mr_free() and corresponding mr_free_reserved().
The first will just put back the bit, the later will actually
destroy the entry, but will leave the bit set.

map_phys_fmr_fbo() is very much like the original map_phys_fmr()
 allows setting an FBO (First Byte Offset) for the MPT
 allows setting the data length for the MPT
 does not increase the higher bits of the key after every map.

Signed-off-by: Yevgeny Petrilin yevge...@mellanox.co.il
Signed-off-by: Oren Duer o...@mellanox.co.il
Signed-off-by: Vu Pham v...@mellanx.com

 drivers/net/mlx4/main.c |4 +-
 drivers/net/mlx4/mr.c   |  128 +-
 include/linux/mlx4/device.h |   21 +++-
 include/linux/mlx4/qp.h |   11 +++-
 4 files changed, 144 insertions(+), 20 deletions(-)

diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index e3e0d54..38fbf01 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -79,12 +79,12 @@ static char mlx4_version[] __devinitdata =
 	DRV_VERSION  ( DRV_RELDATE )\n;
 
 static struct mlx4_profile default_profile = {
-	.num_qp		= 1  17,
+	.num_qp		= 1  18,
 	.num_srq	= 1  16,
 	.rdmarc_per_qp	= 1  4,
 	.num_cq		= 1  16,
 	.num_mcg	= 1  13,
-	.num_mpt	= 1  17,
+	.num_mpt	= 1  19,
 	.num_mtt	= 1  20,
 };
 
diff --git a/drivers/net/mlx4/mr.c b/drivers/net/mlx4/mr.c
index 3dc69be..7185c17 100644
--- a/drivers/net/mlx4/mr.c
+++ b/drivers/net/mlx4/mr.c
@@ -52,7 +52,9 @@ struct mlx4_mpt_entry {
 	__be64 length;
 	__be32 lkey;
 	__be32 win_cnt;
-	u8	reserved1[3];
+	u8	reserved1;
+	u8	flags2;
+	u8	reserved2;
 	u8	mtt_rep;
 	__be64 mtt_seg;
 	__be32 mtt_sz;
@@ -71,6 +73,8 @@ struct mlx4_mpt_entry {
 #define MLX4_MPT_PD_FLAG_RAE	(1  28)
 #define MLX4_MPT_PD_FLAG_EN_INV	(3  24)
 
+#define MLX4_MPT_FLAG2_FBO_EN	 (1   7)
+
 #define MLX4_MPT_STATUS_SW		0xF0
 #define MLX4_MPT_STATUS_HW		0x00
 
@@ -263,6 +267,21 @@ static int mlx4_HW2SW_MPT(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox
 			!mailbox, MLX4_CMD_HW2SW_MPT, MLX4_CMD_TIME_CLASS_B);
 }
 
+int mlx4_mr_alloc_reserved(struct mlx4_dev *dev, u32 mridx, u32 pd,
+			   u64 iova, u64 size, u32 access, int npages,
+			   int page_shift, struct mlx4_mr *mr)
+{
+	mr-iova   = iova;
+	mr-size   = size;
+	mr-pd	   = pd;
+	mr-access = access;
+	mr-enabled= 0;
+	mr-key	   = hw_index_to_key(mridx);
+
+	return mlx4_mtt_init(dev, npages, page_shift, mr-mtt);
+}
+EXPORT_SYMBOL_GPL(mlx4_mr_alloc_reserved);
+
 int mlx4_mr_alloc(struct mlx4_dev *dev, u32 pd, u64 iova, u64 size, u32 access,
 		  int npages, int page_shift, struct mlx4_mr *mr)
 {
@@ -274,14 +293,8 @@ int mlx4_mr_alloc(struct mlx4_dev *dev, u32 pd, u64 iova, u64 size, u32 access,
 	if (index == -1)
 		return -ENOMEM;
 
-	mr-iova   = iova;
-	mr-size 

[PATCH 06/10] enable T11 bit for mlx4 device

2010-08-06 Thread Vu Pham

Enable T11 bit support on mlx4 device

Signed-off-by: Oren Duer o...@mellanox.co.il
Signed-off-by: Vu Pham v...@mellanx.com

drivers/net/mlx4/fw.c   |   13 +
include/linux/mlx4/device.h |5 -
2 files changed, 17 insertions(+), 1 deletions(-)

Enable T11 bit in mlx4 device

Signed-off-by: Oren Duer o...@mellanox.co.il
Signed-off-by: Vu Pham v...@mellanx.com

 drivers/net/mlx4/fw.c   |   13 +
 include/linux/mlx4/device.h |5 -
 2 files changed, 17 insertions(+), 1 deletions(-)

diff --git a/drivers/net/mlx4/fw.c b/drivers/net/mlx4/fw.c
index 04f42ae..1286b72 100644
--- a/drivers/net/mlx4/fw.c
+++ b/drivers/net/mlx4/fw.c
@@ -51,6 +51,10 @@ static int enable_qos;
 module_param(enable_qos, bool, 0444);
 MODULE_PARM_DESC(enable_qos, Enable Quality of Service support in the HCA (default: off));
 
+static int mlx4_pre_t11_mode = 0;
+module_param_named(enable_pre_t11_mode, mlx4_pre_t11_mode, int, 0644);
+MODULE_PARM_DESC(enable_pre_t11_mode, For FCoXX, enable pre-t11 mode if non-zero (default: 0));
+
 #define MLX4_GET(dest, source, offset)  \
 	do {			  \
 		void *__p = (char *) (source) + (offset);	  \
@@ -792,6 +796,8 @@ int mlx4_INIT_HCA(struct mlx4_dev *dev, struct mlx4_init_hca_param *param)
 
 	MLX4_PUT(inbox, (u8) (PAGE_SHIFT - 12), INIT_HCA_UAR_PAGE_SZ_OFFSET);
 	MLX4_PUT(inbox, param-log_uar_sz,  INIT_HCA_LOG_UAR_SZ_OFFSET);
+	if (!mlx4_pre_t11_mode  dev-caps.flags  (u32) MLX4_DEV_CAP_FLAG_FC_T11)
+		*(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1  10);
 
 	err = mlx4_cmd(dev, mailbox-dma, 0, 0, MLX4_CMD_INIT_HCA, 1);
 
@@ -890,3 +896,10 @@ int mlx4_NOP(struct mlx4_dev *dev)
 	/* Input modifier of 0x1f means finish as soon as possible. */
 	return mlx4_cmd(dev, 0, 0x1f, 0, MLX4_CMD_NOP, 100);
 }
+
+void mlx4_get_fc_t11_settings(struct mlx4_dev *dev, int *enable_pre_t11, int *t11_supported)
+{
+	*enable_pre_t11 = mlx4_pre_t11_mode;
+	*t11_supported = dev-caps.flags  MLX4_DEV_CAP_FLAG_FC_T11;
+}
+EXPORT_SYMBOL_GPL(mlx4_get_fc_t11_settings);
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 8afac02..d173008 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -67,7 +67,8 @@ enum {
 	MLX4_DEV_CAP_FLAG_ATOMIC	= 1  18,
 	MLX4_DEV_CAP_FLAG_RAW_MCAST	= 1  19,
 	MLX4_DEV_CAP_FLAG_UD_AV_PORT	= 1  20,
-	MLX4_DEV_CAP_FLAG_UD_MCAST	= 1  21
+	MLX4_DEV_CAP_FLAG_UD_MCAST	= 1  21,
+	MLX4_DEV_CAP_FLAG_FC_T11	= 1  31
 };
 
 enum {
@@ -491,4 +492,6 @@ int mlx4_fmr_free_reserved(struct mlx4_dev *dev, struct mlx4_fmr *fmr);
 int mlx4_fmr_free(struct mlx4_dev *dev, struct mlx4_fmr *fmr);
 int mlx4_SYNC_TPT(struct mlx4_dev *dev);
 
+void mlx4_get_fc_t11_settings(struct mlx4_dev *dev, int *enable_pre_t11, int *t11_supported);
+
 #endif /* MLX4_DEVICE_H */


[PATCH 09/10] enable mlx4_fc, mlx4_fcoib in scsi Kconfig, makefile

2010-08-06 Thread Vu Pham

Enable mlx4_fc (fcoe/fcoib offload driver) and mlx4_fcoib (discovery driver)
entries in scsi/Kconfig and Makefile
   
Signed-off-by: Vu Pham v...@mellanx.com


Enable mlx4_fc (fcoe/fcoib offload driver) and mlx4_fcoib (discovery driver)
entries in scsi/Kconfig and Makefile

Signed-off-by: Vu Pham v...@mellanx.com

--- a/drivers/scsi/Makefile	2010-06-28 11:16:37.0 -0700
+++ a/drivers/scsi/Makefile	2010-05-12 09:31:15.0 -0700
@@ -40,6 +40,8 @@
 obj-$(CONFIG_LIBFCOE)		+= fcoe/
 obj-$(CONFIG_FCOE)		+= fcoe/
 obj-$(CONFIG_FCOE_FNIC)		+= fnic/
+obj-$(CONFIG_MLX4_FC)		+= mlx4_fc/
+obj-$(CONFIG_MLX4_FCOIB)	+= mlx4_fc/
 obj-$(CONFIG_ISCSI_TCP) 	+= libiscsi.o	libiscsi_tcp.o iscsi_tcp.o
 obj-$(CONFIG_INFINIBAND_ISER) 	+= libiscsi.o
 obj-$(CONFIG_SCSI_A4000T)	+= 53c700.o	a4000t.o
 
--- a/drivers/scsi/Kconfig	2010-06-28 11:16:37.0 -0700
+++ a/drivers/scsi/Kconfig	2010-05-12 09:36:56.0 -0700
@@ -687,6 +687,20 @@
 	  file:Documentation/scsi/scsi.txt.
 	  The module will be called fnic.
 
+config MLX4_FC
+	tristate Mellanox FC module
+	select LIBFC
+	select LIBFCOE
+	---help---
+	Mellanox Fibre Channel over Ethernet/Infiniband module
+
+config MLX4_FCOIB
+	tristate Mellanox FCoIB discovery module
+	depends on INFINIBAND
+	select MLX4_FC
+	---help---
+	Fibre Channel over Infiniband discovery module
+
 config SCSI_DMX3191D
 	tristate DMX3191D SCSI support
 	depends on PCI  SCSI


[PATCH 05/10] query ib device from given mlx4 device

2010-08-06 Thread Vu Pham

Adding API to query ib_device with mlx4_dev

Signed-off-by: Oren Duer o...@mellanox.co.il
Signed-off-by: Vu Pham v...@mellanx.com

drivers/infiniband/hw/mlx4/main.c |   10 +-
drivers/net/mlx4/intf.c   |   20 
drivers/net/mlx4/main.c   |   10 +++---
drivers/net/mlx4/mlx4.h   |1 +
include/linux/mlx4/driver.h   |   10 ++
7 files changed, 72 insertions(+), 15 deletions(-)

Adding API to query ib_device with mlx4_dev

Signed-off-by: Oren Duer o...@mellanox.co.il
Signed-off-by: Vu Pham v...@mellanx.com

 drivers/infiniband/hw/mlx4/main.c |   10 +-
 drivers/net/mlx4/intf.c   |   20 
 drivers/net/mlx4/main.c   |   10 +++---
 drivers/net/mlx4/mlx4.h   |1 +
 include/linux/mlx4/driver.h   |   10 ++
 7 files changed, 72 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 4e94e36..e071229 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -58,6 +58,12 @@ static const char mlx4_ib_version[] =
 	DRV_NAME : Mellanox ConnectX InfiniBand driver v
 	DRV_VERSION  ( DRV_RELDATE )\n;
 
+static void *get_ibdev(struct mlx4_dev *dev, void *ctx, u8 port)
+{
+   struct mlx4_ib_dev *mlxibdev = ctx;
+   return mlxibdev-ib_dev;
+}
+
 static void init_query_mad(struct ib_smp *mad)
 {
 	mad-base_version  = 1;
@@ -749,7 +755,9 @@ static void mlx4_ib_event(struct mlx4_dev *dev, void *ibdev_ptr,
 static struct mlx4_interface mlx4_ib_interface = {
 	.add	= mlx4_ib_add,
 	.remove	= mlx4_ib_remove,
-	.event	= mlx4_ib_event
+	.event	= mlx4_ib_event,
+	.get_prot_dev = get_ibdev,
+	.protocol = MLX4_PROT_IB
 };
 
 static int __init mlx4_ib_init(void)
diff --git a/drivers/net/mlx4/intf.c b/drivers/net/mlx4/intf.c
index beeed80..f8f97f9 100644
--- a/drivers/net/mlx4/intf.c
+++ b/drivers/net/mlx4/intf.c
@@ -191,3 +191,23 @@ void mlx4_unregister_device(struct mlx4_dev *dev)
 
 	mutex_unlock(intf_mutex);
 }
+
+void *mlx4_find_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port)
+{
+	struct mlx4_priv *priv = mlx4_priv(dev);
+	struct mlx4_device_context *dev_ctx;
+	unsigned long flags;
+	void *result = NULL;
+
+	spin_lock_irqsave(priv-ctx_lock, flags);
+
+	list_for_each_entry(dev_ctx, priv-ctx_list, list)
+		if (dev_ctx-intf-protocol == proto  dev_ctx-intf-get_prot_dev) {
+			result = dev_ctx-intf-get_prot_dev(dev, dev_ctx-context, port);
+			break;
+	}
+
+	spin_unlock_irqrestore(priv-ctx_lock, flags);
+
+	return result;
+}
diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index 38fbf01..f14f0d6 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -105,6 +105,12 @@ static int log_mtts_per_seg = ilog2(MLX4_MTT_ENTRY_PER_SEG);
 module_param_named(log_mtts_per_seg, log_mtts_per_seg, int, 0444);
 MODULE_PARM_DESC(log_mtts_per_seg, Log2 number of MTT entries per segment (1-5));
 
+void *mlx4_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port)
+{
+	return mlx4_find_get_prot_dev(dev, proto, port);
+}
+EXPORT_SYMBOL(mlx4_get_prot_dev);
+
 int mlx4_check_port_params(struct mlx4_dev *dev,
 			   enum mlx4_port_type *port_type)
 {
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index 416aeca..9c62019 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -364,6 +364,7 @@ int mlx4_restart_one(struct pci_dev *pdev);
 int mlx4_register_device(struct mlx4_dev *dev);
 void mlx4_unregister_device(struct mlx4_dev *dev);
 void mlx4_dispatch_event(struct mlx4_dev *dev, enum mlx4_dev_event type, int port);
+void *mlx4_find_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port);
 
 struct mlx4_dev_cap;
 struct mlx4_init_hca_param;
diff --git a/include/linux/mlx4/driver.h b/include/linux/mlx4/driver.h
index 55b45a6..94c9617 100644
--- a/include/linux/mlx4/driver.h
+++ b/include/linux/mlx4/driver.h
@@ -49,17 +49,27 @@ enum mlx4_query_reply {
 	MLX4_QUERY_MINE_NOPORT 	= 0
 };
 
+enum mlx4_prot {
+	MLX4_PROT_IB,
+	MLX4_PROT_EN,
+};
+
 struct mlx4_interface {
 	void *			(*add)	 (struct mlx4_dev *dev);
 	void			(*remove)(struct mlx4_dev *dev, void *context);
 	void			(*event) (struct mlx4_dev *dev, void *context,
 	  enum mlx4_dev_event event, int port);
+	void *  (*get_prot_dev) (struct mlx4_dev *dev, void *context, u8 port);
+	enum mlx4_prot  protocol;
+
 	enum mlx4_query_reply	(*query) (void *context, void *);
 	struct list_head	list;
 };
 
 int mlx4_register_interface(struct mlx4_interface *intf);
 void mlx4_unregister_interface(struct mlx4_interface *intf);
+void *mlx4_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port);
+
 struct mlx4_dev *mlx4_query_interface(void *, int *port);
 
 #endif /* MLX4_DRIVER_H */


[PATCH 07/10] query the steer capabilities of mlx4 device

2010-08-06 Thread Vu Pham

Add API to query the steer capabilities of mlx4 device

Signed-off-by: Oren Duer o...@mellanox.co.il
Signed-off-by: Vu Pham v...@mellanx.com

Add API to query the steer capabilities of mlx4 device

Signed-off-by: Oren Duer o...@mellanox.co.il
Signed-off-by: Vu Pham v...@mellanx.com

diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 4408b96..1777965 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -396,6 +394,14 @@ struct mlx4_init_port_param {
 	u64			si_guid;
 };
 
+static inline void mlx4_query_steer_cap(struct mlx4_dev *dev, int *log_mac,
+	int *log_vlan, int *log_prio)
+{
+	*log_mac = dev-caps.log_num_macs;
+	*log_vlan = dev-caps.log_num_vlans;
+	*log_prio = dev-caps.log_num_prios;
+}
+
 #define mlx4_foreach_port(port, dev, type)\
 	for ((port) = 1; (port) = (dev)-caps.num_ports; (port)++)	\
 		if (((type) == MLX4_PORT_TYPE_IB ? (dev)-caps.port_mask : \


[PATCH 08/10] enable vlan support in mlx4 qp path

2010-08-06 Thread Vu Pham
   Enable vlan support in qp path, allow traffic to be encapsulated in 
tagged

   vlan frames.
   
   Signed-off-by: Vu Pham v...@mellanox.com


Enable vlan support in qp path, allow traffic to be encapsulated in tagged
vlan frames.

Signed-off-by: Vu Pham v...@mellanox.com


diff --git a/include/linux/mlx4/qp.h b/include/linux/mlx4/qp.h
index 7abe643..1e53d45 100644
--- a/include/linux/mlx4/qp.h
+++ b/include/linux/mlx4/qp.h
@@ -109,7 +109,7 @@ struct mlx4_qp_path {
 	__be32			tclass_flowlabel;
 	u8			rgid[16];
 	u8			sched_queue;
-	u8			snooper_flags;
+	u8			vlan_index;
 	u8			reserved3[2];
 	u8			counter_index;
 	u8			reserved4[7];


[PATCH 00/10] Add fcoe, fcoib drivers for mlx4 device

2010-08-06 Thread Vu Pham

Hi Roland,

The following series implements fcoe and fcoib offload driver for mlx4 
device


mlx4_fc: implement fcoe/fcoib, hook to scsi mid-layer to offload scsi 
operations, and use openfc's libfc to do ELS/BLS
mlx4_fcoib: driver implement fcoib initialization protocol to discover 
IB-FC gateways/bridges


Yevgeny Petrilin:
   Pre-reserve MTPs for FC
   Attach cq to the least cqs attached completion vector
   Enable T11 bit support in fw
  Add API to query the steer capabilities of mlx4 device

Oren Duer:
  Add APIs to mlx4_en/mlx4_ib driver to query interfaces for given 
internal device

  Add MPT reserve/release_range APIs

Vu Pham:
   Enable vlan support in qp path
   Enable mlx4_fc/mlx4_fcoib driver in scsi Kconfig/Makefile
   Add mlx4_fc/mlx4_fcoib drivers 


drivers/infiniband/hw/mlx4/cq.c   |4 +-
drivers/infiniband/hw/mlx4/main.c |   10 +-
drivers/net/mlx4/cq.c |   27 +-
drivers/net/mlx4/en_cq.c  |2 +-
drivers/net/mlx4/en_main.c|   14 +
drivers/net/mlx4/fw.c |   13 +
drivers/net/mlx4/intf.c   |   50 +
drivers/net/mlx4/main.c   |   10 +-
drivers/net/mlx4/mlx4.h   |2 +
drivers/net/mlx4/mr.c |   29 +-
drivers/scsi/Kconfig  |   14 +
drivers/scsi/Makefile |2 +
drivers/scsi/mlx4_fc/Makefile |8 +
drivers/scsi/mlx4_fc/fcoib.h  |  561 +
drivers/scsi/mlx4_fc/fcoib_api.h  |  102 ++
drivers/scsi/mlx4_fc/fcoib_discover.c | 2003 
+

drivers/scsi/mlx4_fc/fcoib_main.c | 1340 ++
drivers/scsi/mlx4_fc/mfc.c| 1992 


drivers/scsi/mlx4_fc/mfc.h|  662 +++
drivers/scsi/mlx4_fc/mfc_exch.c   | 1502 
drivers/scsi/mlx4_fc/mfc_rfci.c   |  990 
drivers/scsi/mlx4_fc/mfc_sysfs.c  |  243 
include/linux/mlx4/device.h   |   20 +-
include/linux/mlx4/driver.h   |   17 +
include/linux/mlx4/qp.h   |2 +-
include/rdma/ib_verbs.h   |   10 +-
26 files changed, 9606 insertions(+), 23 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/10] api to query mlx4_en device for given mlx4 device

2010-08-06 Thread Vu Pham

mlx4_en: Add API to query interfaces for given internal device

Updated mlx4_en interface to provide a query function for it's
internal net_device structure.

Signed-off-by: Oren Duer o...@mellanox.co.il
Signed-off-by: Vu Pham v...@mellanx.com
   
drivers/net/mlx4/en_main.c  |   14 ++

drivers/net/mlx4/intf.c |   30 ++
include/linux/mlx4/driver.h |7 +++
3 files changed, 51 insertions(+), 0 deletions(-)

mlx4: Add API to query interfaces for given internal device

Updated mlx4_en interface to provide a query function for it's
internal net_device structure.

Signed-off-by: Oren Duer o...@mellanox.co.il
Signed-off-by: Vu Pham v...@mellanx.com

 drivers/net/mlx4/en_main.c  |   14 ++
 drivers/net/mlx4/intf.c |   30 ++
 include/linux/mlx4/driver.h |7 +++
 3 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/drivers/net/mlx4/en_main.c b/drivers/net/mlx4/en_main.c
index cbabf14..6fce433 100644
--- a/drivers/net/mlx4/en_main.c
+++ b/drivers/net/mlx4/en_main.c
@@ -262,10 +262,24 @@ err_free_res:
 	return NULL;
 }
 
+enum mlx4_query_reply mlx4_en_query(void *endev_ptr, void *int_dev)
+{
+	struct mlx4_en_dev *mdev = endev_ptr;
+	struct net_device *netdev = int_dev;
+	int p;
+	
+	for (p = 1; p = MLX4_MAX_PORTS; ++p)
+		if (mdev-pndev[p] == netdev)
+			return p;
+
+	return MLX4_QUERY_NOT_MINE;
+}
+
 static struct mlx4_interface mlx4_en_interface = {
 	.add	= mlx4_en_add,
 	.remove	= mlx4_en_remove,
 	.event	= mlx4_en_event,
+	.query  = mlx4_en_query
 };
 
 static int __init mlx4_en_init(void)
diff --git a/drivers/net/mlx4/intf.c b/drivers/net/mlx4/intf.c
index 5550678..beeed80 100644
--- a/drivers/net/mlx4/intf.c
+++ b/drivers/net/mlx4/intf.c
@@ -114,6 +114,36 @@ void mlx4_unregister_interface(struct mlx4_interface *intf)
 }
 EXPORT_SYMBOL_GPL(mlx4_unregister_interface);
 
+struct mlx4_dev *mlx4_query_interface(void *int_dev, int *port)
+{
+	struct mlx4_priv *priv;
+	struct mlx4_device_context *dev_ctx;
+	enum mlx4_query_reply r;
+	unsigned long flags;
+
+	mutex_lock(intf_mutex);
+
+	list_for_each_entry(priv, dev_list, dev_list) {
+		spin_lock_irqsave(priv-ctx_lock, flags);
+		list_for_each_entry(dev_ctx, priv-ctx_list, list) {
+			if (!dev_ctx-intf-query)
+continue;
+			r = dev_ctx-intf-query(dev_ctx-context, int_dev);
+			if (r != MLX4_QUERY_NOT_MINE) {
+*port = r;
+spin_unlock_irqrestore(priv-ctx_lock, flags);
+mutex_unlock(intf_mutex);
+return priv-dev;
+			}
+		}
+		spin_unlock_irqrestore(priv-ctx_lock, flags);
+	}
+
+	mutex_unlock(intf_mutex);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(mlx4_query_interface);
+
 void mlx4_dispatch_event(struct mlx4_dev *dev, enum mlx4_dev_event type, int port)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
diff --git a/include/linux/mlx4/driver.h b/include/linux/mlx4/driver.h
index 53c5fdb..55b45a6 100644
--- a/include/linux/mlx4/driver.h
+++ b/include/linux/mlx4/driver.h
@@ -44,15 +44,22 @@ enum mlx4_dev_event {
 	MLX4_DEV_EVENT_PORT_REINIT,
 };
 
+enum mlx4_query_reply {
+	MLX4_QUERY_NOT_MINE	= -1,
+	MLX4_QUERY_MINE_NOPORT 	= 0
+};
+
 struct mlx4_interface {
 	void *			(*add)	 (struct mlx4_dev *dev);
 	void			(*remove)(struct mlx4_dev *dev, void *context);
 	void			(*event) (struct mlx4_dev *dev, void *context,
 	  enum mlx4_dev_event event, int port);
+	enum mlx4_query_reply	(*query) (void *context, void *);
 	struct list_head	list;
 };
 
 int mlx4_register_interface(struct mlx4_interface *intf);
 void mlx4_unregister_interface(struct mlx4_interface *intf);
+struct mlx4_dev *mlx4_query_interface(void *, int *port);
 
 #endif /* MLX4_DRIVER_H */


[PATCH 04/10] remove default reservation of fexch qps and mpts

2010-08-06 Thread Vu Pham

mlx4_core: removed reservation of FEXCH QPs and MPTs

mlx4_fc module will reserve them upon loading.
Added mpt reserve_range and release_range functions.

Signed-off-by: Oren Duer o...@mellanox.co.il
Signed-off-by: Vu Pham v...@mellanx.com

drivers/net/mlx4/main.c |4 +---
drivers/net/mlx4/mr.c   |   29 +++--
include/linux/mlx4/device.h |7 ++-
3 files changed, 26 insertions(+), 14 deletions(-)

mlx4_core: removed reservation of FEXCH QPs and MPTs

mlx4_fc module will reserve them upon loading.
Added mpt reserve_range and release_range functions.

Signed-off-by: Oren Duer o...@mellanox.co.il
Signed-off-by: Vu Pham v...@mellanx.com

 drivers/net/mlx4/main.c |4 +---
 drivers/net/mlx4/mr.c   |   29 +++--
 include/linux/mlx4/device.h |7 ++-
 3 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index 38fbf01..bbf773d 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -259,12 +259,10 @@ static int mlx4_dev_cap(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
 		(1  dev-caps.log_num_vlans) *
 		(1  dev-caps.log_num_prios) *
 		dev-caps.num_ports;
-	dev-caps.reserved_qps_cnt[MLX4_QP_REGION_FC_EXCH] = MLX4_NUM_FEXCH;
 
 	dev-caps.reserved_qps = dev-caps.reserved_qps_cnt[MLX4_QP_REGION_FW] +
 		dev-caps.reserved_qps_cnt[MLX4_QP_REGION_ETH_ADDR] +
-		dev-caps.reserved_qps_cnt[MLX4_QP_REGION_FC_ADDR] +
-		dev-caps.reserved_qps_cnt[MLX4_QP_REGION_FC_EXCH];
+		dev-caps.reserved_qps_cnt[MLX4_QP_REGION_FC_ADDR];
 
 	return 0;
 }
diff --git a/drivers/net/mlx4/mr.c b/drivers/net/mlx4/mr.c
index 7185c17..5f07e0c 100644
--- a/drivers/net/mlx4/mr.c
+++ b/drivers/net/mlx4/mr.c
@@ -267,6 +267,28 @@ static int mlx4_HW2SW_MPT(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox
 			!mailbox, MLX4_CMD_HW2SW_MPT, MLX4_CMD_TIME_CLASS_B);
 }
 
+int mlx4_mr_reserve_range(struct mlx4_dev *dev, int cnt, int align, u32 *base_mridx)
+{
+	struct mlx4_priv *priv = mlx4_priv(dev);
+	u32 mridx;
+
+	mridx = mlx4_bitmap_alloc_range(priv-mr_table.mpt_bitmap, cnt, align);
+	if (mridx == -1)
+		return -ENOMEM;
+
+	*base_mridx = mridx;
+	return 0;
+
+}
+EXPORT_SYMBOL_GPL(mlx4_mr_reserve_range);
+
+void mlx4_mr_release_range(struct mlx4_dev *dev, u32 base_mridx, int cnt)
+{
+	struct mlx4_priv *priv = mlx4_priv(dev);
+	mlx4_bitmap_free_range(priv-mr_table.mpt_bitmap, base_mridx, cnt);
+}
+EXPORT_SYMBOL_GPL(mlx4_mr_release_range);
+
 int mlx4_mr_alloc_reserved(struct mlx4_dev *dev, u32 mridx, u32 pd,
 			   u64 iova, u64 size, u32 access, int npages,
 			   int page_shift, struct mlx4_mr *mr)
@@ -486,13 +508,8 @@ int mlx4_init_mr_table(struct mlx4_dev *dev)
 	if (!is_power_of_2(dev-caps.num_mpts))
 		return -EINVAL;
 
-	dev-caps.num_fexch_mpts =
-		2 * dev-caps.reserved_qps_cnt[MLX4_QP_REGION_FC_EXCH];
-	dev-caps.reserved_fexch_mpts_base = dev-caps.num_mpts -
-		dev-caps.num_fexch_mpts;
 	err = mlx4_bitmap_init(mr_table-mpt_bitmap, dev-caps.num_mpts,
-			   ~0, dev-caps.reserved_mrws,
-			   dev-caps.reserved_fexch_mpts_base);
+			   ~0, dev-caps.reserved_mrws, 0);
 	if (err)
 		return err;
 
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 4664d1d..8afac02 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -151,7 +151,6 @@ enum mlx4_qp_region {
 	MLX4_QP_REGION_FW = 0,
 	MLX4_QP_REGION_ETH_ADDR,
 	MLX4_QP_REGION_FC_ADDR,
-	MLX4_QP_REGION_FC_EXCH,
 	MLX4_NUM_QP_REGION
 };
 
@@ -167,10 +166,6 @@ enum mlx4_special_vlan_idx {
 	MLX4_VLAN_REGULAR
 };
 
-enum {
-	MLX4_NUM_FEXCH  = 64 * 1024,
-};
-
 #define MLX4_LEAST_ATTACHED_VECTOR	0x
 
 static inline u64 mlx4_fw_ver(u64 major, u64 minor, u64 subminor)
@@ -426,6 +421,8 @@ int mlx4_mtt_init(struct mlx4_dev *dev, int npages, int page_shift,
 void mlx4_mtt_cleanup(struct mlx4_dev *dev, struct mlx4_mtt *mtt);
 u64 mlx4_mtt_addr(struct mlx4_dev *dev, struct mlx4_mtt *mtt);
 
+int mlx4_mr_reserve_range(struct mlx4_dev *dev, int cnt, int align, u32 *base_mridx);
+void mlx4_mr_release_range(struct mlx4_dev *dev, u32 base_mridx, int cnt);
 int mlx4_mr_alloc_reserved(struct mlx4_dev *dev, u32 mridx, u32 pd,
 			   u64 iova, u64 size, u32 access, int npages,
 			   int page_shift, struct mlx4_mr *mr);


[PATCH 03/10] attach cq to least cqs attached completion vector

2010-08-06 Thread Vu Pham

When the vector number passed to mlx4_cq_alloc is MLX4_LEAST_ATTACHED_VECTOR
the driver selects the completion vector that has the least CQ's attached
to it and attaches the CQ to the chosen vector.
IB_CQ_VECTOR_LEAST_ATTACHED is defined in rdma/ib_verbs.h, when mlx4_ib driv
recieves this cq vector number, it uses MLX4_LEAST_ATTACHED_VECTOR
   
Signed-off-by: Yevgeny Petrilin yevge...@mellanox.co.il

Signed-off-by: Vu Pham v...@mellanx.com

drivers/infiniband/hw/mlx4/cq.c |4 +++-
drivers/net/mlx4/cq.c   |   27 +++
drivers/net/mlx4/en_cq.c|2 +-
drivers/net/mlx4/mlx4.h |1 +
include/linux/mlx4/device.h |2 ++
include/rdma/ib_verbs.h |   10 +-
6 files changed, 39 insertions(+), 7 deletions(-)

When the vector number passed to mlx4_cq_alloc is MLX4_LEAST_ATTACHED_VECTOR
the driver selects the completion vector that has the least CQ's attached
to it and attaches the CQ to the chosen vector.
IB_CQ_VECTOR_LEAST_ATTACHED is defined in rdma/ib_verbs.h, when mlx4_ib driv
recieves this cq vector number, it uses MLX4_LEAST_ATTACHED_VECTOR 

Signed-off-by: Yevgeny Petrilin yevge...@mellanox.co.il
Signed-off-by: Vu Pham v...@mellanx.com
 

 drivers/infiniband/hw/mlx4/cq.c |4 +++-
 drivers/net/mlx4/cq.c   |   27 +++
 drivers/net/mlx4/en_cq.c|2 +-
 drivers/net/mlx4/mlx4.h |1 +
 include/linux/mlx4/device.h |2 ++
 include/rdma/ib_verbs.h |   10 +-
 6 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 5a219a2..2687970 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -223,7 +223,9 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector
 	}
 
 	err = mlx4_cq_alloc(dev-dev, entries, cq-buf.mtt, uar,
-			cq-db.dma, cq-mcq, vector, 0);
+			cq-db.dma, cq-mcq,
+			vector == IB_CQ_VECTOR_LEAST_ATTACHED ?
+			MLX4_LEAST_ATTACHED_VECTOR : vector, 0);
 	if (err)
 		goto err_dbmap;
 
diff --git a/drivers/net/mlx4/cq.c b/drivers/net/mlx4/cq.c
index 7cd34e9..a6f03f9 100644
--- a/drivers/net/mlx4/cq.c
+++ b/drivers/net/mlx4/cq.c
@@ -187,6 +187,22 @@ int mlx4_cq_resize(struct mlx4_dev *dev, struct mlx4_cq *cq,
 }
 EXPORT_SYMBOL_GPL(mlx4_cq_resize);
 
+static int mlx4_find_least_loaded_vector(struct mlx4_priv *priv)
+{
+	int i;
+	int index = 0;
+	int min = priv-eq_table.eq[0].load;
+
+	for (i = 1; i  priv-dev.caps.num_comp_vectors; i++) {
+		if (priv-eq_table.eq[i].load  min) {
+			index = i;
+			min = priv-eq_table.eq[i].load;
+		}
+	}
+
+	return index;
+}
+
 int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt,
 		  struct mlx4_uar *uar, u64 db_rec, struct mlx4_cq *cq,
 		  unsigned vector, int collapsed)
@@ -198,10 +214,11 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt,
 	u64 mtt_addr;
 	int err;
 
-	if (vector = dev-caps.num_comp_vectors)
-		return -EINVAL;
+	cq-vector = (vector == MLX4_LEAST_ATTACHED_VECTOR) ?
+		mlx4_find_least_loaded_vector(priv) : vector;
 
-	cq-vector = vector;
+	if (cq-vector = dev-caps.num_comp_vectors)
+		return -EINVAL;
 
 	cq-cqn = mlx4_bitmap_alloc(cq_table-bitmap);
 	if (cq-cqn == -1)
@@ -232,7 +249,7 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt,
 
 	cq_context-flags	= cpu_to_be32(!!collapsed  18);
 	cq_context-logsize_usrpage = cpu_to_be32((ilog2(nent)  24) | uar-index);
-	cq_context-comp_eqn	= priv-eq_table.eq[vector].eqn;
+	cq_context-comp_eqn	= priv-eq_table.eq[cq-vector].eqn;
 	cq_context-log_page_size   = mtt-page_shift - MLX4_ICM_PAGE_SHIFT;
 
 	mtt_addr = mlx4_mtt_addr(dev, mtt);
@@ -245,6 +262,7 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt,
 	if (err)
 		goto err_radix;
 
+	priv-eq_table.eq[cq-vector].load++;
 	cq-cons_index = 0;
 	cq-arm_sn = 1;
 	cq-uar= uar;
@@ -282,6 +300,7 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq)
 		mlx4_warn(dev, HW2SW_CQ failed (%d) for CQN %06x\n, err, cq-cqn);
 
 	synchronize_irq(priv-eq_table.eq[cq-vector].irq);
+	priv-eq_table.eq[cq-vector].load--;
 
 	spin_lock_irq(cq_table-lock);
 	radix_tree_delete(cq_table-tree, cq-cqn);
diff --git a/drivers/net/mlx4/en_cq.c b/drivers/net/mlx4/en_cq.c
index 21786ad..f3dc8b7 100644
--- a/drivers/net/mlx4/en_cq.c
+++ b/drivers/net/mlx4/en_cq.c
@@ -56,7 +56,7 @@ int mlx4_en_create_cq(struct mlx4_en_priv *priv,
 		cq-vector   = ring % mdev-dev-caps.num_comp_vectors;
 	} else {
 		cq-buf_size = sizeof(struct mlx4_cqe);
-		cq-vector   = 0;
+		cq-vector   = MLX4_LEAST_ATTACHED_VECTOR;
 	}
 
 	cq-ring = ring;
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index 13343e8..416aeca 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -138,6 +138,7 @@ struct mlx4_eq {
 	u16			irq;
 	u16			have_irq;
 	int			nent;
+	int			load;
 	struct