Thanks for the analysis!

We've argued about btl_r2_add_btls() before -- IIRC, the consensus is that we want it to be able to continue even if a BTL fails. So I *think* that your #1 answer is better.

However, we might want to try a little harder if EINVAL is returned -- perhaps try decreasing number of CQ entries and try again until either we have too few CQ entries to be useful (e.g., 0 or some higher number that is still "too small"), or fail the BTL alltogether...?

On Oct 23, 2009, at 10:10 AM, Nadia Derbey wrote:

Hi,

Yesterdays I had to analyze a SIGSEV occuring after the following
message had been output:
[.... adjust_cq] cannot resize completion queue, error: 22


What I found is the following:

When ibv_resize_cq() fails to resize a CQ (in my case it returned
EINVAL), adjust_cq() returns an error and create_srq() is not called by
mca_btl_openib_size_queues().

Note: One of our infiniband specialists told me that EINVAL was returned
in that case because we were asking for more CQ entries than the max
available.

mca_bml_r2_add_btls() goes on executing.

Then qp_create_all() is called (connect/btl_openib_connect_oob.c).
ibv_create_qp() succeeds even though init_attr.srq is a NULL pointer
(remember that create_srq() has not been previously called).

Since all the QPs have been successfully created, qp_create_all() then
calls:
mca_btl_openib_endpoint_post_recvs()
  --> mca_btl_openib_post_srr()
      --> ibv_post_srq_recv() on a NULL SRQ
==> SIGSEGV


If I'm not wrong in the analysis above, we have the choice between 2
solutions to fix this problem:

1. if EINVAL is returned by ibv_resize_cq() in adjust_cq(), treat this
as the ENOSYS case: do not return an error, since the CQ has
successfully been created may be with less entries than needed, but it
is there.

Doing this we assume that EINVAL will always be the symptom of a "too
many entries asked for" error from the IB stack. I don't have the
answer...
+ I don't know if this won't imply a degraded mode in terms of
performances.

2. Fix mca_bml_r2_add_btls() to cleanly exit if an error occurs during
btl_add_procs().

FYI I tested solution #1 and it worked...

Any suggestion or comment would be welcome.

Regards,
Nadia

--
Nadia Derbey <nadia.der...@bull.net>

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
jsquy...@cisco.com

Reply via email to