On 03/15/13 01:23 AM, brendan.do...@oracle.com wrote:
So, here is the history...
There is an oracle application used to monitor the health of a node
by calling ib_resolve_portid_str_via(node GUID).
We observed that a call to ib_resolve_portid_str_via() which specified
the use of an unconnected port to issue the query was returning -1
with errno as 0. The application interpreted this as the query succeeded
but no paths to the specified node were found and assumed the node
was dead. This obviously was not the case, we could not send the query
because the port was down which is very different to the queried
node being down. But the API does not provide a means to distinguish
between these two failure modes. Hence the reason for updating
errno to allow these two failure modes to be distinguished.
Now in terms of the detail of the failure and the reason for the other
changes. The call to ib_resolve_portid_str_via() did not specify an smid,
so ib_resolve_portid_str_via() calls ib_resolve_guid_via() which first
tries to resolve the SM LID using the unconnected port:
if (!sm_id) {
sm_id = &sm_portid;
if (ib_resolve_smlid_via(sm_id, timeout, srcport) < 0)
return -1;
}
The call to this succeeds (return value is not -1) and indicates that the
SM LID is 0, which is of course wrong. Even though port 2 is not
connected, it's SMA is still operational, It receives the get PORT_INFO
SMP (sent as a result of the ib_resolve_smlid_via()) , and returns it
successfully as the value it reads from the adapters PORT_INFO
which happens to be 0. ib_resolve_smlid_via() assumes everything is OK,
it does not bother to check the value of the SM LID returned. We then
try and contact the SM at LID 0!
if ((portid->lid =
ib_path_query_via(srcport, selfgid, portid->gid, sm_id,
buf)) < 0)
return -1;
And this obviously fails, and so ib_resolve_portid_str_via() returns -1
errno 0, and the app thinks the node is down. You could argue that
we should not specify an unconnected port in the call to
ib_resolve_portid_str_via(),
but the changes harden the code, and allow us to distinguish between
no path to the node or could not issue the SA query.
Rdgs
Brendan
On 03/13/13 08:29 PM, Boris Chiu wrote:
fyi,
Boris
Original Message
Subject: Re: [PATCH] libibmad: Fixes for failures when not all ports
of HCA are connected
Date: Wed, 13 Mar 2013 12:36:05 -0700
From: Ira Weiny
To: Boris Chiu
CC: linux-rdma@vger.kernel.org
References: <513f83fd.1090...@oracle.com>
Your commit message does not seem to have anything to do with the
patch. Could you explain how returning errno from these functions
"Fixes for failures when not all ports of HCA are connected"?
Furthermore, I'm reluctant to modify errno in this library. It is not
documented and in general is poor form. I realize that the interface
does not currently allow for an alternative. :-(
More comments below.
On Tue, Mar 12, 2013 at 12:37 PM, Boris Chiu wrote:
> From: Brendan Doyle
>
> Signed-off-by: Brendan Doyle
> ---
> src/resolve.c | 22 ++
> src/rpc.c |1 +
> src/sa.c |2 ++
> 3 files changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/src/resolve.c b/src/resolve.c
> index f866bf4..ab24c79 100644
> --- a/src/resolve.c
> +++ b/src/resolve.c
> @@ -40,6 +40,7 @@
> #include
> #include
> #include
> +#include
>
> #include
> #include
> @@ -57,10 +58,18 @@ int ib_resolve_smlid_via(ib_portid_t * sm_id, int
> timeout,
>
> memset(sm_id, 0, sizeof(*sm_id));
>
> - if (!smp_query_via(portinfo,&self, IB_ATTR_PORT_INFO, 0, 0,
> srcport))
> + if (!smp_query_via(portinfo,&self, IB_ATTR_PORT_INFO, 0, 0,
> srcport)) {
> + if (!errno)
> + errno = EIO;
> return -1;
> + }
>
> mad_decode_field(portinfo, IB_PORT_SMLID_F,&lid);
> + if (lid == 0) {
> + if (!errno)
> + errno = EIO;
> + return -1;
> + }
This may not be an error. A port which is down only requires
PortState and PortPhyState to be valid.
> mad_decode_field(portinfo, IB_PORT_SMSL_F,&sm_id->sl);
>
> return ib_portid_set(sm_id, lid, 0, 0);
> @@ -95,21 +104,26 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t
> * guid,
> ib_portid_t * sm_id, int timeout,
> const struct ibmad_port *srcport)
> {
> - ib_portid_t sm_portid;
> + ib_portid_t sm_portid = { 0 };
> uint8_t buf[IB_SA_DATA_SIZE] = { 0 };
> ib_portid_t self = { 0 };
> uint64_t self