Hi Jeff,

Why do we want to set this value so low ? Well, just to see if it crashes :-)

More seriously, we're working on lowering the memory usage of the openib BTL, which is achieved at most by having only 1 send queue element (at very large scale, send queues prevail).

This "extreme" configuration used to work with the 1.3/1.4 branches but failed on 1.5.

Note that recent IB cards having very high issue rates, I don't know if we are often waiting for the send queue to be empty. More importantly, it often prevents remote receive queue to be filled to quickly (which prevents RNR nacks, threads refilling the SRQ, ...). We didn't notice major performance drops with this configuration.

Sylvain

On Tue, 22 Jun 2010, Jeff Squyres wrote:

I think your fix looks right.

But I'm getting my head warped trying to understand why you'd want numbers so low (4, 2, 1) and exactly what our algorithm will re-post for numbers that low, etc. Why do you want them so low?


On Jun 18, 2010, at 11:10 AM, nadia.derbey wrote:

Hi,

Reference is the v1.5 branch

If an SRQ has the following settings: S,<size>,4,2,1

1) setup_qps() sets the following:
mca_btl_openib_component.qp_infos[qp].u.srq_qp.rd_num=4
mca_btl_openib_component.qp_infos[qp].u.srq_qp.rd_init=rd_num/4=1

2) create_srq() sets the following:
openib_btl->qps[qp].u.srq_qp.rd_curr_num = 1 (rd_init value)
openib_btl->qps[qp].u.srq_qp.rd_low_local = rd_curr_num - (rd_curr_num
2) = rd_curr_num = 1

3) if mca_btl_openib_post_srr() is called with rd_posted=1:
rd_posted > rd_low_local is false
num_post=rd_curr_num-rd_posted=0
the loop is not executed
wr is never initialized (remains NULL)
wr->next: address not mapped
         ==> SIGSEGV

The attached patch solves the problem by ensuring that we'll actually
enter the loop and leave otherwise.
Can someone have a look please: the patch solves the problem with my
reproducer, but I'm not sure the fix covers all the situations.

Regards,
Nadia

<001_openib_low_rd_num.patch>_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to