Re: [Fwd: Re: [PATCH] libibmad: Fixes for failures when not all ports of HCA are connected]

2013-03-14 Thread brendan.do...@oracle.com

On 03/15/13 01:23 AM, brendan.do...@oracle.com wrote:


So, here is the history...

There is an oracle application used to monitor the health of a node
by calling ib_resolve_portid_str_via(node GUID).

We observed that a call to ib_resolve_portid_str_via() which specified
the use of an unconnected port to issue the query was returning -1
with errno as 0. The application interpreted this as the query succeeded
but no paths to the specified node were found and assumed the node
was dead. This obviously was not the case, we could not send the query
because the port was down which is very different to the queried
node being down. But the API does not provide a means to distinguish
between these two failure modes. Hence the reason for updating
errno to allow these two failure modes to be distinguished.

Now in terms of the detail of the failure and the reason for the other
changes. The call to ib_resolve_portid_str_via() did not specify an smid,
so ib_resolve_portid_str_via() calls ib_resolve_guid_via() which first
tries to resolve the SM LID using the unconnected port:

if (!sm_id) {
sm_id = &sm_portid;
if (ib_resolve_smlid_via(sm_id, timeout, srcport) < 0)
return -1;
}

The call to this succeeds (return value is not -1) and indicates that the
SM LID is 0, which is of course wrong. Even though  port 2 is not
connected, it's SMA is still operational, It receives the get PORT_INFO
SMP (sent as a result of the ib_resolve_smlid_via()) , and returns it
successfully as the value it reads from the adapters PORT_INFO
which happens to be 0. ib_resolve_smlid_via() assumes everything is OK,
it does not bother to check the value of the SM LID returned. We then
try and contact the SM at LID 0!

if ((portid->lid =
 ib_path_query_via(srcport, selfgid, portid->gid, sm_id, 
buf)) < 0)

return -1;

And this obviously fails, and so ib_resolve_portid_str_via() returns -1
errno 0, and the app thinks the node is down. You could argue that
we should not specify an unconnected port in the call to 
ib_resolve_portid_str_via(),

but the changes harden the code, and allow us to distinguish between
no path to the node or could not issue the SA query.

Rdgs

Brendan




On 03/13/13 08:29 PM, Boris Chiu wrote:

fyi,

Boris


 Original Message 
Subject: 	Re: [PATCH] libibmad: Fixes for failures when not all ports 
of HCA are connected

Date:   Wed, 13 Mar 2013 12:36:05 -0700
From:   Ira Weiny 
To: Boris Chiu 
CC: linux-rdma@vger.kernel.org
References: <513f83fd.1090...@oracle.com>



Your commit message does not seem to have anything to do with the
patch.  Could you explain how returning errno from these functions
"Fixes for failures when not all ports of HCA are connected"?

Furthermore, I'm reluctant to modify errno in this library.  It is not
documented and in general is poor form.  I realize that the interface
does not currently allow for an alternative.  :-(

More comments below.

On Tue, Mar 12, 2013 at 12:37 PM, Boris Chiu  wrote:
>  From: Brendan Doyle
>
>  Signed-off-by: Brendan Doyle
>  ---
>  src/resolve.c |   22 ++
>  src/rpc.c |1 +
>  src/sa.c  |2 ++
>  3 files changed, 21 insertions(+), 4 deletions(-)
>
>  diff --git a/src/resolve.c b/src/resolve.c
>  index f866bf4..ab24c79 100644
>  --- a/src/resolve.c
>  +++ b/src/resolve.c
>  @@ -40,6 +40,7 @@
>  #include
>  #include
>  #include
>  +#include
>
>  #include
>  #include
>  @@ -57,10 +58,18 @@ int ib_resolve_smlid_via(ib_portid_t * sm_id, int
>  timeout,
>
>  memset(sm_id, 0, sizeof(*sm_id));
>
>  -   if (!smp_query_via(portinfo,&self, IB_ATTR_PORT_INFO, 0, 0,
>  srcport))
>  +   if (!smp_query_via(portinfo,&self, IB_ATTR_PORT_INFO, 0, 0,
>  srcport)) {
>  +   if (!errno)
>  +   errno = EIO;
>  return -1;
>  +   }
>
>  mad_decode_field(portinfo, IB_PORT_SMLID_F,&lid);
>  +   if (lid == 0) {
>  +   if (!errno)
>  +   errno = EIO;
>  +   return -1;
>  +   }

This may not be an error.  A port which is down only requires
PortState and PortPhyState to be valid.

>  mad_decode_field(portinfo, IB_PORT_SMSL_F,&sm_id->sl);
>
>  return ib_portid_set(sm_id, lid, 0, 0);
>  @@ -95,21 +104,26 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t
>  * guid,
>  ib_portid_t * sm_id, int timeout,
>  const struct ibmad_port *srcport)
>  {
>  -   ib_portid_t sm_portid;
>  +   ib_portid_t sm_portid = { 0 };
>  uint8_t buf[IB_SA_DATA_SIZE] = { 0 };
>  ib_portid_t self = { 0 };
>  uint64_t self

Re: [Fwd: Re: [PATCH] libibmad: Fixes for failures when not all ports of HCA are connected]

2013-03-26 Thread brendan.do...@oracle.com
OK Got home safely, thanks, made some changes, I've ask Boris to pick 
these up,

rebuild the patch, test and resubmit, so expect something later in the week.


On 03/21/13 11:06 PM, Weiny, Ira wrote:

-Original Message-
From: brendan doyle [mailto:brendan.do...@oracle.com]


On 21/03/2013 22:53, Weiny, Ira wrote:

-Original Message-
From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com]

On Thu, Mar 21, 2013 at 10:46:06PM +, Weiny, Ira wrote:


have to follow different idioms.  However having 2 or 3 in the same
library is even worse!!!

Agreed, I think you made a compelling case that the new POSIX method
is not appropriate for this library, so standardizing on the old
method is the least bad option..


Wait what did I do?

You think using errno is the "right" thing to do now?

I agree that it is the least bad option, I don't want to fix the int return 
APIs, as
I think that will break infiniband-diags and who knows what else, and as you
say we then end up with two or 3 idioms within the same library. I think it's
too much of a stretch, at least for me to go redo the whole lib, so the best we
can do is at least be consistent.

I'm going to be traveling shortly I'll take another look at the patch next week.


Sounds good, travel safe,
Ira


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html