Hi Ira,

On 16:43 Mon 14 Dec     , Ira Weiny wrote:
> 
> I have found that the following patch caused our SRP connected storage to
> break.
> 
> patch: 3d20f82edd3246879063b77721d0bcef927bdc48
> 
>     opensm/osm_sa_path_record.c: separate router guid resolution code
>     
>     Move off subnet destination (router address) resolution code to separate
>     function to improve readability.
>     
>     Signed-off-by: Sasha Khapyorsky <sas...@voltaire.com>
> 
> I further traced the problem to pr_rcv_build_pr and the patch below fixes the
> bug.

Nice catch and great investigation!

> 
> 16:03:34 > git diff 
> diff --git a/opensm/opensm/osm_sa_path_record.c 
> b/opensm/opensm/osm_sa_path_record.c
> index be0cd71..32c889f 100644
> --- a/opensm/opensm/osm_sa_path_record.c
> +++ b/opensm/opensm/osm_sa_path_record.c
> @@ -800,7 +800,7 @@ static void pr_rcv_build_pr(IN osm_sa_t * sa, IN const 
> osm_port_t * p_src_port,
>  
>         /* Only set HopLimit if going through a router */
>         if (is_nonzero_gid)
> -               p_pr->hop_flow_raw |= cl_hton32(IB_HOPLIMIT_MAX);
> +               p_pr->hop_flow_raw |= IB_HOPLIMIT_MAX;
>  
>         p_pr->pkey = p_parms->pkey;
>         ib_path_rec_set_sl(p_pr, p_parms->sl);
> 
> 
> "hop_flow_raw" is not really a 32bit value but rather 4 fields:
> Component     Length(bits)   Offset(bits)
> RawTraffic    1              352
> Reserved      3              353
> FlowLabel     20             356
> HopLimit      8              384

This patch doesn't look correct for me. Instead of placing
IB_HOPLIMIT_MAX in proper place this will put it in RawTraffic and
Reserved fields on little-endian machines (such as x86). And this
changes nothing on a big-endian machine. Right?

Following your investigation it seems that SRP target breaks when
IB_HOPLIMIT_MAX is set in SA PR query responses for intra-subnet DGIDs.

> However, I don't understand the comment "Only set HopLimit if going through a
> router"?

This is from '#ifdef ROUTER_EXP' days - as far as I could understand
HopLimit should be set to IB_HOPLIMIT_MAX for inter-subnet MADs.

> 
> Was the intent to only set p_dgid in pr_rcv_get_end_points if we are heading
> through a router?

Seems that this was an original logic.

> I don't think it matters if we are going through a router
> or not does it?

I thought so too, but as we can see it triggers the bug you discovered.

So immediate fix for this is to move p_dgid setup back to router
section, like this:


diff --git a/opensm/opensm/osm_sa_path_record.c 
b/opensm/opensm/osm_sa_path_record.c
index 7855be5..fbeef03 100644
--- a/opensm/opensm/osm_sa_path_record.c
+++ b/opensm/opensm/osm_sa_path_record.c
@@ -1236,6 +1236,8 @@ static ib_net16_t pr_rcv_get_end_points(IN osm_sa_t * sa,
                                sa_status = IB_SA_MAD_STATUS_INVALID_GID;
                                goto Exit;
                        }
+                       if (p_dgid)
+                               *p_dgid = p_pr->dgid;
                } else
                        dest_guid = p_pr->dgid.unicast.interface_id;
 
@@ -1253,9 +1255,6 @@ static ib_net16_t pr_rcv_get_end_points(IN osm_sa_t * sa,
                        sa_status = IB_SA_MAD_STATUS_INVALID_GID;
                        goto Exit;
                }
-
-               if (p_dgid)
-                       *p_dgid = p_pr->dgid;
        } else {
                *pp_dest_port = 0;
                if (comp_mask & IB_PR_COMPMASK_DLID) {


BTW, could you test this? Unfortunately I don't have SRP target :(

And seems that as more solid solution we need to cleanup some logic in
osm_sa_path_record.c code.

Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to