Re: [PATCH v4 09/14] IB/cm: Expose BTH P_Key in CM and SIDR request events

2015-08-31 Thread Haggai Eran
On 30/08/2015 21:23, Sagi Grimberg wrote:
> 
> Looks like for some reason cm_get_bth_pkey got pkey_index of 0x
> instead of 0 (working on the default pkey 0x at entry 0).

It looks like the mlx5 driver doesn't interpret the completion format
correctly. It takes a field defined in the programmer reference manual
as pkey, and interprets it as pkey_index [1].

> log:
> infiniband mlx5_0: ib_cm: Couldn't retrieve pkey for incoming request (port 
> 1, pkey index 65535). -22
> ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x0:0x2c90300ed0960, t_port_id 
> 0x2c90300ed0950:0x2c90300ed0950 and it_iu_len 260 on port 1 
> (guid=0xfe80:0x2c90300ed0950)
> ib_srpt Session : kernel thread ib_srpt_compl (PID 8584) started
> infiniband mlx5_0: ib_cm: Couldn't retrieve pkey for incoming request (port 
> 1, pkey index 65535). -22
> ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x0:0x2c90300ed0960, t_port_id 
> 0x2c90300ed0950:0x2c90300ed0950 and it_iu_len 260 on port 1 
> (guid=0xfe80:0x2c90300ed0950)
> ib_srpt Session : kernel thread ib_srpt_compl (PID 8585) started
> mlx5_0:dump_cqe:238:(pid 8584): dump error cqe
>    
>    
> 002b   
>  94003004 002c b8e0
> ib_srpt receiving failed for idx 0 with status 4
> :04:00.0:poll_health:151:(pid 0): device's health compromised
> assert_var[0] 0x0094
> assert_var[1] 0x
> assert_var[2] 0x
> assert_var[3] 0x
> assert_var[4] 0x
> assert_exit_ptr 0x0061d35c
> assert_callra 0x0067a5f4
> fw_ver 0xa0641900
> hw_id 0x01ff
> irisc_index 2
> synd 0x1: firmware internal error
> ext_sync 0x
> :04:00.0:health_care:76:(pid 7943): handling bad device here
> ib_srpt Received DREQ and sent DREP for session 
> 0x0002c90300ed0960.
> ib_srpt Received DREQ and sent DREP for session 
> 0x0002c90300ed0960.
> ib_srpt Received IB TimeWait exit for cm_id 88046d1fb200.
> ib_srpt Received IB TimeWait exit for cm_id 880454ffa000.
> ib_srpt Session 0x0002c90300ed0960: kernel thread 
> ib_srpt_compl (PID 8585) stopped

I don't know how that can cause all the other errors though.

Haggai

[1]
http://lxr.free-electrons.com/source/drivers/infiniband/hw/mlx5/cq.c?v=4.1#L230
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mlx4 problems with 4.2-rc8

2015-08-31 Thread Matan Barak



On 8/31/2015 1:38 AM, Doug Ledford wrote:

On 08/29/2015 09:13 PM, Or Gerlitz wrote:

On Fri, Aug 28, 2015 at 10:27 PM, Doug Ledford  wrote:

I'm seeing this with rc8 on a dual port mlx4 adapter set to IB/Eth mode:


mmm, both Amir and myself are just finishing vacations... so WB notes
are not always lovely as you want them to be, life


[   77.883513] IPv6: ADDRCONF(NETDEV_UP): mlx4_roce: link is not ready
[   77.892044] mlx4_en: mlx4_roce:   frag:0 - size:1518 prefix:0 stride:1536
[   77.903129] genirq: Flags mismatch irq 135. 
(mlx4-65@:05:00.0) vs.  (mlx4-65@:05:00.0)


is this strict regression from some known point in the past on this
system/config -- i.e 4.1 or 4.2-rc1?!


Yes.  When I was submitting the 4.2-rc changes this machine worked.
This is one of my IB/Eth SRIOV machines.  I tested with SRIOV disabled
and it didn't effect things.


Can you please send the mlx4 driver output when you load it with debug
prints on? also do things work if you set the ports type to be ib/ib
or eth/eth?


It should work as ib/ib given that in ib/eth mode the ib port works.  I
doubt eth/eth would work, but I'll try and see.  OK, Eth/Eth mode fails
too (at least on the second port, I can say on the first port for
certain as I can't bring it up, it's still plugged into an IB switch).
However, now in Eth/Eth mode, attempts to bring up the interface
manually at the command line have hung, which it didn't do in IB/Eth mode.

I'll try to ping things down further, but that's what I have so far.

And as requested, the config is attached.



send us your compressed .config

Matan, any idea what goes wrong here?

Or.




[   77.914965] CPU: 0 PID: 1541 Comm: NetworkManager Not tainted
4.2.0-rc8 #58
[   77.923292] Hardware name: Dell Inc. PowerEdge R820/04K5X5, BIOS
2.2.3 07/09/2014
[   77.932205]   c16e3ce1 8820365ab498
8167e6ff
[   77.941072]   8820339e9a00 8820365ab4f8
810d2b6e
[   77.949938]  0246 881032e67aa4 881035e10ba0
c16e3ce1
[   77.958812] Call Trace:
[   77.962109]  [] dump_stack+0x45/0x57
[   77.968412]  [] __setup_irq+0x51e/0x590
[   77.975018]  [] ? mlx4_interrupt+0x80/0x80 [mlx4_core]
[   77.983072]  [] request_threaded_irq+0xf4/0x1a0
[   77.990468]  [] mlx4_assign_eq+0x135/0x360 [mlx4_core]
[   77.998513]  [] mlx4_en_activate_cq+0x2a7/0x310
[mlx4_en]
[   78.006853]  [] ? alloc_cpumask_var_node+0x28/0x40
[   78.014542]  [] ? find_next_bit+0x19/0x20
[   78.021334]  [] ? cpumask_next_and+0x34/0x50
[   78.028425]  [] mlx4_en_start_port+0x1bb/0xb60
[mlx4_en]
[   78.036689]  [] ? mlx4_free_cmd_mailbox+0x31/0x40
[mlx4_core]
[   78.045435]  [] mlx4_en_open+0x349/0x630 [mlx4_en]
[   78.053107]  [] __dev_open+0xc9/0x140
[   78.059538]  [] __dev_change_flags+0xa1/0x160
[   78.066718]  [] dev_change_flags+0x29/0x60
[   78.073602]  [] do_setlink+0x5be/0xa70
[   78.080097]  [] ? mga_imageblit+0x2f/0x40 [mgag200]
[   78.087859]  [] ? mga_dirty_update+0x1e6/0x2f0
[mgag200]
[   78.096112]  [] ? mga_imageblit+0x2f/0x40 [mgag200]
[   78.103873]  [] rtnl_newlink+0x4f0/0x880
[   78.110586]  [] ? rtnl_newlink+0xf3/0x880
[   78.117372]  [] ? security_capable+0x48/0x60
[   78.124452]  [] ? ns_capable+0x2d/0x60
[   78.130950]  [] rtnetlink_rcv_msg+0xa4/0x250
[   78.138028]  [] ? sock_has_perm+0x70/0x90
[   78.144824]  [] ? rtnetlink_rcv+0x40/0x40
[   78.151615]  [] netlink_rcv_skb+0xaf/0xc0
[   78.158425]  [] rtnetlink_rcv+0x2c/0x40
[   78.164997]  [] netlink_unicast+0x101/0x1f0
[   78.171937]  [] netlink_sendmsg+0x401/0x660
[   78.178867]  [] sock_sendmsg+0x38/0x50
[   78.185335]  [] ___sys_sendmsg+0x275/0x290
[   78.192176]  [] ? sysctl_head_finish+0x46/0x50
[   78.199411]  [] ? proc_sys_call_handler+0x88/0xe0
[   78.206946]  [] ? lockref_put_or_lock+0x4c/0x80
[   78.214296]  [] __sys_sendmsg+0x57/0xa0
[   78.220878]  [] SyS_sendmsg+0x12/0x20
[   78.227283]  [] entry_SYSCALL_64_fastpath+0x12/0x71
[   78.235114] mlx4_en :05:00.0: Failed assigning an EQ to
\xfff\xffb6Z6
\xff88\x\x\xff84\xffa20\xff81\x\x\x\x
[   78.243732] mlx4_en: mlx4_roce: Failed activating Rx CQ
[   78.319027] mlx4_en: mlx4_roce: Failed starting port:2

The interface in question is unusable.

--
Doug Ledford 
   GPG KeyID: 0E572FDD







Actually, it looks like the dump stack we've got before [1] was fixed. 
This happens when the mlx4 driver is used in setups where number of 
cores >= 32.

Doug, is that the case?

[1] http://www.spinics.net/lists/netdev/msg341171.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] mlx5: Fix incorrect wc pkey_index assignment for GSI messages

2015-08-31 Thread Sagi Grimberg
Since patch series "Demux IB CM requests in the rdma_cm module" the
P_Key index is taken from the work completion rather than the message
itself (see http://www.spinics.net/lists/netdev/msg335599.html).

The HCA provides us with the message P_Key. In order
to provide the P_Key index, we need to look it up. Given
that this is relevant only for GSI messages (session establishments)
which is less performance critical, micro-optimize against the GSI
(is_qp1) branch.

Signed-off-by: Sagi Grimberg 
Cc: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/cq.c  |   10 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h |5 +
 drivers/infiniband/hw/mlx5/qp.c  |5 -
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index 640c54e..3dfd287 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "mlx5_ib.h"
 #include "user.h"
 
@@ -227,7 +228,14 @@ static void handle_responder(struct ib_wc *wc, struct 
mlx5_cqe64 *cqe,
wc->dlid_path_bits = cqe->ml_path;
g = (be32_to_cpu(cqe->flags_rqpn) >> 28) & 3;
wc->wc_flags |= g ? IB_WC_GRH : 0;
-   wc->pkey_index = be32_to_cpu(cqe->imm_inval_pkey) & 0x;
+   if (unlikely(is_qp1(qp->ibqp.qp_type))) {
+   u16 pkey = be32_to_cpu(cqe->imm_inval_pkey) & 0x;
+
+   ib_find_cached_pkey(>ib_dev, qp->port, pkey,
+   >pkey_index);
+   } else {
+   wc->pkey_index = 0;
+   }
 }
 
 static void dump_cqe(struct mlx5_ib_dev *dev, struct mlx5_err_cqe *cqe)
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index fc987fe..a4ef6a7 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -656,6 +656,11 @@ static inline u8 convert_access(int acc)
   MLX5_PERM_LOCAL_READ;
 }
 
+static inline int is_qp1(enum ib_qp_type qp_type)
+{
+   return qp_type == IB_QPT_GSI;
+}
+
 #define MLX5_MAX_UMR_SHIFT 16
 #define MLX5_MAX_UMR_PAGES (1 << MLX5_MAX_UMR_SHIFT)
 
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 9380d2d..8c51ea3 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -76,11 +76,6 @@ static int is_qp0(enum ib_qp_type qp_type)
return qp_type == IB_QPT_SMI;
 }
 
-static int is_qp1(enum ib_qp_type qp_type)
-{
-   return qp_type == IB_QPT_GSI;
-}
-
 static int is_sqp(enum ib_qp_type qp_type)
 {
return is_qp0(qp_type) || is_qp1(qp_type);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 01/49] IB/core: Add header definitions

2015-08-31 Thread ira.weiny
Hal,

I'm working on a couple of patches to address these comments.  I will be
submitting them in the next day or so.

On Wed, Jun 17, 2015 at 10:12:41AM -0400, Hal Rosenstock wrote:
> On 6/17/2015 8:28 AM, Mike Marciniszyn wrote:
> > From: Ira Weiny 
> > 
> > Add common OPA header definitions for driver
> > build:
> > - opa_port_info.h
> > - opa_smi.h
> > - hfi1_user.sh
> > 
> > Additionally, ib_mad.h, has additional definitions
> > that are common to ib_drivers including:
> > - trap support
> > - cca support
> > 
> > The qib driver has the duplication removed in favor
> > those in ib_mad.h
> > 
> > Reviewed-by: Mike Marciniszyn 
> > Reviewed-by: John, Jubin 
> > Signed-off-by: Ira Weiny 
> > ---
> >  drivers/infiniband/hw/qib/qib_mad.h |  147 +---
> >  include/rdma/ib_mad.h   |  138 +++
> >  include/rdma/opa_port_info.h|  433 
> > +++
> 
> Should opa_port_info.h be in include/rdma or in drivers/infiniband/hw/hfi1 ?

This file and opa_smi.h were placed here following the pattern of the same
ib_*.h files.  Indeed because there is currently only 1 OPA driver it could be
moved to the hfi1 driver directory.  However, I prefer to keep them in rdma.
If Doug prefers I can send a patch to move them.



> > +
> > +/*
> > + * Generic trap/notice producers
> > + */
> > +#define IB_NOTICE_PROD_CA  cpu_to_be16(1)
> > +#define IB_NOTICE_PROD_SWITCH  cpu_to_be16(2)
> > +#define IB_NOTICE_PROD_ROUTER  cpu_to_be16(3)
> > +#define IB_NOTICE_PROD_CLASS_MGR   cpu_to_be16(4)
> > +
> > +/*
> > + * Generic trap/notice numbers
> 
> SM Class trap/notice numbers
> 
> As such, should they be in ib_smi.h rather than ib_mad.h ?

Fixed in my patch series.

> 
> > + */
> > +#define IB_NOTICE_TRAP_LLI_THRESH  cpu_to_be16(129)
> > +#define IB_NOTICE_TRAP_EBO_THRESH  cpu_to_be16(130)
> > +#define IB_NOTICE_TRAP_FLOW_UPDATE cpu_to_be16(131)
> > +#define IB_NOTICE_TRAP_CAP_MASK_CHGcpu_to_be16(144)
> > +#define IB_NOTICE_TRAP_SYS_GUID_CHGcpu_to_be16(145)
> > +#define IB_NOTICE_TRAP_BAD_MKEYcpu_to_be16(256)
> > +#define IB_NOTICE_TRAP_BAD_PKEYcpu_to_be16(257)
> > +#define IB_NOTICE_TRAP_BAD_QKEYcpu_to_be16(258)
> > +
> > +/*
> > + * Repress trap/notice flags
> > + */
> > +#define IB_NOTICE_REPRESS_LLI_THRESH   (1 << 0)
> > +#define IB_NOTICE_REPRESS_EBO_THRESH   (1 << 1)
> > +#define IB_NOTICE_REPRESS_FLOW_UPDATE  (1 << 2)
> > +#define IB_NOTICE_REPRESS_CAP_MASK_CHG (1 << 3)
> > +#define IB_NOTICE_REPRESS_SYS_GUID_CHG (1 << 4)
> > +#define IB_NOTICE_REPRESS_BAD_MKEY (1 << 5)
> > +#define IB_NOTICE_REPRESS_BAD_PKEY (1 << 6)
> > +#define IB_NOTICE_REPRESS_BAD_QKEY (1 << 7)
> 
> What does this correspond to ? Is this some standard thing or are these
> defines driver specific ?
>

Fixed in my patch series.

> 
> > +
> > +/*
> > + * Generic trap/notice other local changes flags (trap 144).
> 
> SM Class trap/notice other local changes flags (trap 144)
> 
> As such, should they be in ib_smi.h rather than ib_mad.h ?

Fixed in my patch series.

> 
> > + */
> > +#define IB_NOTICE_TRAP_LSE_CHG 0x04/* Link Speed Enable 
> > changed */
> > +#define IB_NOTICE_TRAP_LWE_CHG 0x02/* Link Width Enable 
> > changed */
> > +#define IB_NOTICE_TRAP_NODE_DESC_CHG   0x01
> > +
> > +/*
> > + * Generic trap/notice M_Key volation flags in dr_trunc_hop (trap 256).
> 
> SM Class trap/notice M_Key violation flags in dr_trunc_hop (trap 256)
> 
> As such, should they be in ib_smi.h rather than ib_mad.h ?

Fixed in my patch series.

> 
> > + */
> > +#define IB_NOTICE_TRAP_DR_NOTICE   0x80
> > +#define IB_NOTICE_TRAP_DR_TRUNC0x40
> > +
> >  enum {
> > IB_MGMT_MAD_HDR = 24,
> > IB_MGMT_MAD_DATA = 232,
> > @@ -240,6 +294,90 @@ struct ib_class_port_info {
> > __be32  trap_qkey;
> >  };
> >  
> > +struct ib_node_info {
> > +   u8 base_version;
> > +   u8 class_version;
> > +   u8 node_type;
> > +   u8 num_ports;
> > +   __be64 sys_guid;
> > +   __be64 node_guid;
> > +   __be64 port_guid;
> > +   __be16 partition_cap;
> > +   __be16 device_id;
> > +   __be32 revision;
> > +   u8 local_port_num;
> > +   u8 vendor_id[3];
> > +} __packed;
> 
> This is SM attribute. Should it go into ib_smi.h like ib_port_info ?

Fixed in my patch series.

> 
> > +
> > +struct ib_mad_notice_attr {
> > +   u8 generic_type;
> > +   u8 prod_type_msb;
> > +   __be16 prod_type_lsb;
> > +   __be16 trap_num;
> > +   __be16 issuer_lid;
> > +   __be16 toggle_count;
> > +
> > +   union {
> > +   struct {
> > +   u8  details[54];
> > +   } raw_data;
> > +
> > +   struct {
> > +   __be16  reserved;
> > +   __be16  lid;/* where violation happened */
> > +   u8  port_num;

Re: shrink struct ib_send_wr V3

2015-08-31 Thread Christoph Hellwig
On Sun, Aug 30, 2015 at 06:31:35PM +0300, Sagi Grimberg wrote:
>>   - patch 2 now explicitly replaces the weird overloading in the mlx5
>> driver with an explicit embedding of struct ib_send_wr, similar
>> to what we do for all other MRs.
>
> That's nice,
>
> There is one non-trivial spot that was missed in mlx5_ib_post_send
> though:

Oh, that was a weird abuse of the old casts.

I've foled both your fixes and force pushed to the wr-cleanup branch.

I do not plan to resend the series until the merge window for 4.4
is open.  Doug, any chance you could pick up the first patch in the
series for 4.3-rc?  It's marked for stable as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mlx5: Fix incorrect wc pkey_index assignment for GSI messages

2015-08-31 Thread Or Gerlitz
On Mon, Aug 31, 2015 at 6:24 PM, Sagi Grimberg  wrote:
> Since patch series "Demux IB CM requests in the rdma_cm module" the
> P_Key index is taken from the work completion rather than the message itself

so prior to this series nobody in the IB core (and maybe across the
whole upstream kernel) uses ib_wc->pkey_index?!

>  (see http://www.spinics.net/lists/netdev/msg335599.html).

better to have pointer here to upstream commit and not to an archive
URL which is possibly gonna die some day


> The HCA provides us with the message P_Key. In order
> to provide the P_Key index, we need to look it up. Given
> that this is relevant only for GSI messages (session establishments)
> which is less performance critical, micro-optimize against the GSI
> (is_qp1) branch.
>
> Signed-off-by: Sagi Grimberg 
> Cc: Haggai Eran 
> ---
>  drivers/infiniband/hw/mlx5/cq.c  |   10 +-
>  drivers/infiniband/hw/mlx5/mlx5_ib.h |5 +
>  drivers/infiniband/hw/mlx5/qp.c  |5 -
>  3 files changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
> index 640c54e..3dfd287 100644
> --- a/drivers/infiniband/hw/mlx5/cq.c
> +++ b/drivers/infiniband/hw/mlx5/cq.c
> @@ -33,6 +33,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "mlx5_ib.h"
>  #include "user.h"
>
> @@ -227,7 +228,14 @@ static void handle_responder(struct ib_wc *wc, struct 
> mlx5_cqe64 *cqe,
> wc->dlid_path_bits = cqe->ml_path;
> g = (be32_to_cpu(cqe->flags_rqpn) >> 28) & 3;
> wc->wc_flags |= g ? IB_WC_GRH : 0;
> -   wc->pkey_index = be32_to_cpu(cqe->imm_inval_pkey) & 0x;
> +   if (unlikely(is_qp1(qp->ibqp.qp_type))) {
> +   u16 pkey = be32_to_cpu(cqe->imm_inval_pkey) & 0x;
> +
> +   ib_find_cached_pkey(>ib_dev, qp->port, pkey,
> +   >pkey_index);
> +   } else {
> +   wc->pkey_index = 0;
> +   }
>  }
>
>  static void dump_cqe(struct mlx5_ib_dev *dev, struct mlx5_err_cqe *cqe)
> diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
> b/drivers/infiniband/hw/mlx5/mlx5_ib.h
> index fc987fe..a4ef6a7 100644
> --- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
> +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
> @@ -656,6 +656,11 @@ static inline u8 convert_access(int acc)
>MLX5_PERM_LOCAL_READ;
>  }
>
> +static inline int is_qp1(enum ib_qp_type qp_type)
> +{
> +   return qp_type == IB_QPT_GSI;
> +}
> +
>  #define MLX5_MAX_UMR_SHIFT 16
>  #define MLX5_MAX_UMR_PAGES (1 << MLX5_MAX_UMR_SHIFT)
>
> diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
> index 9380d2d..8c51ea3 100644
> --- a/drivers/infiniband/hw/mlx5/qp.c
> +++ b/drivers/infiniband/hw/mlx5/qp.c
> @@ -76,11 +76,6 @@ static int is_qp0(enum ib_qp_type qp_type)
> return qp_type == IB_QPT_SMI;
>  }
>
> -static int is_qp1(enum ib_qp_type qp_type)
> -{
> -   return qp_type == IB_QPT_GSI;
> -}
> -
>  static int is_sqp(enum ib_qp_type qp_type)
>  {
> return is_qp0(qp_type) || is_qp1(qp_type);
> --
> 1.7.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mlx4 problems with 4.2-rc8

2015-08-31 Thread Or Gerlitz
On Mon, Aug 31, 2015 at 4:02 PM, Doug Ledford  wrote:
> On 08/31/2015 03:09 AM, Matan Barak wrote:

>> Actually, it looks like the dump stack we've got before [1] was fixed.
>> This happens when the mlx4 driver is used in setups where number of
>> cores >= 32.

>> Doug, is that the case?

> Indeed, 48 cores on this machine.

so do we have bingo here? the patch is in the net-next tree (and we
can't put it in 4.2 only through -stable since 4.2 is released by
now), does it solves the problem?

Or.

>> [1] http://www.spinics.net/lists/netdev/msg341171.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mlx4 problems with 4.2-rc8

2015-08-31 Thread Doug Ledford
On 08/31/2015 04:21 PM, Or Gerlitz wrote:
> On Mon, Aug 31, 2015 at 4:02 PM, Doug Ledford  wrote:
>> On 08/31/2015 03:09 AM, Matan Barak wrote:
> 
>>> Actually, it looks like the dump stack we've got before [1] was fixed.
>>> This happens when the mlx4 driver is used in setups where number of
>>> cores >= 32.
> 
>>> Doug, is that the case?
> 
>> Indeed, 48 cores on this machine.
> 
> so do we have bingo here? the patch is in the net-next tree (and we
> can't put it in 4.2 only through -stable since 4.2 is released by
> now), does it solves the problem?

Yes, it solved the problem.  I pulled the patch into my testing branch
to confirm.


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: [PATCH] IB/cma: Fix net_dev reference leak with failed requests

2015-08-31 Thread Doug Ledford
On 08/31/2015 11:20 AM, Weiny, Ira wrote:
>>
>> On 08/30/2015 02:12 AM, Or Gerlitz wrote:
>>> On Thu, Aug 27, 2015 at 5:55 AM, Haggai Eran 
>> wrote:
 When no matching listening ID is found for a given request, the
 net_dev that was used to find the request isn't released.

 Fixes: 20c36836ecad ("IB/cma: Use found net_dev for passive
 connections")
>>>
>>> same here, Doug,
>>
>> Same as the last email, I have the commit ID now, and I fixed up the commit
>> message.
>>
> 
> Doug I'm working on the clean up Hal suggested to the ib_mad.h file which was 
> updated in your to-be-rebased-4.3 branch via this patch.
> 
> Fixes: abde0260e47b ("IB/core: Add header definitions")
> 
> It looks like this is the patch destined for 4.3 on this branch k.o/for-4.3?
> 
> Fixes: d4ab347005fb ("IB/core: Add core header changes needed for OPA")

Correct.  I squashed the first two patches (which both touched core
files) down to just one.

> I personally did not mind the rebasing except for this issue.
> 
> Let me know which branch I should base these changes off of.

Base if off of the k.o/for-4.3.  That's where it will go.  I'll just end
up applying this to the top of the stack.


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: shrink struct ib_send_wr V3

2015-08-31 Thread Doug Ledford
On 08/31/2015 12:11 PM, Christoph Hellwig wrote:
> On Sun, Aug 30, 2015 at 06:31:35PM +0300, Sagi Grimberg wrote:
>>>   - patch 2 now explicitly replaces the weird overloading in the mlx5
>>> driver with an explicit embedding of struct ib_send_wr, similar
>>> to what we do for all other MRs.
>>
>> That's nice,
>>
>> There is one non-trivial spot that was missed in mlx5_ib_post_send
>> though:
> 
> Oh, that was a weird abuse of the old casts.
> 
> I've foled both your fixes and force pushed to the wr-cleanup branch.
> 
> I do not plan to resend the series until the merge window for 4.4
> is open.  Doug, any chance you could pick up the first patch in the
> series for 4.3-rc?  It's marked for stable as well.

Yes, I can do that.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 09/14] IB/cm: Expose BTH P_Key in CM and SIDR request events

2015-08-31 Thread Sagi Grimberg

On 8/31/2015 9:50 AM, Haggai Eran wrote:

On 30/08/2015 21:23, Sagi Grimberg wrote:


Looks like for some reason cm_get_bth_pkey got pkey_index of 0x
instead of 0 (working on the default pkey 0x at entry 0).


It looks like the mlx5 driver doesn't interpret the completion format
correctly. It takes a field defined in the programmer reference manual
as pkey, and interprets it as pkey_index [1].


You're right! I wonder how this ever used to work (and it did...).
So the driver needs to lookup a pkey_index on each GSI packet?




log:
infiniband mlx5_0: ib_cm: Couldn't retrieve pkey for incoming request (port 1, 
pkey index 65535). -22
ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x0:0x2c90300ed0960, t_port_id 
0x2c90300ed0950:0x2c90300ed0950 and it_iu_len 260 on port 1 
(guid=0xfe80:0x2c90300ed0950)
ib_srpt Session : kernel thread ib_srpt_compl (PID 8584) started
infiniband mlx5_0: ib_cm: Couldn't retrieve pkey for incoming request (port 1, 
pkey index 65535). -22
ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x0:0x2c90300ed0960, t_port_id 
0x2c90300ed0950:0x2c90300ed0950 and it_iu_len 260 on port 1 
(guid=0xfe80:0x2c90300ed0950)
ib_srpt Session : kernel thread ib_srpt_compl (PID 8585) started
mlx5_0:dump_cqe:238:(pid 8584): dump error cqe
   
   
002b   
 94003004 002c b8e0
ib_srpt receiving failed for idx 0 with status 4
:04:00.0:poll_health:151:(pid 0): device's health compromised
assert_var[0] 0x0094
assert_var[1] 0x
assert_var[2] 0x
assert_var[3] 0x
assert_var[4] 0x
assert_exit_ptr 0x0061d35c
assert_callra 0x0067a5f4
fw_ver 0xa0641900
hw_id 0x01ff
irisc_index 2
synd 0x1: firmware internal error
ext_sync 0x
:04:00.0:health_care:76:(pid 7943): handling bad device here
ib_srpt Received DREQ and sent DREP for session 
0x0002c90300ed0960.
ib_srpt Received DREQ and sent DREP for session 
0x0002c90300ed0960.
ib_srpt Received IB TimeWait exit for cm_id 88046d1fb200.
ib_srpt Received IB TimeWait exit for cm_id 880454ffa000.
ib_srpt Session 0x0002c90300ed0960: kernel thread ib_srpt_compl 
(PID 8585) stopped


I don't know how that can cause all the other errors though.


Me neither...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html