A few observations.
1. The smcuda btl is only built when --with-cuda is part of the configure line
so folks who do not do this will not even have this btl and will never run into
this issue.
2. The priority of the smcuda btl has been higher since Open MPI 1.7.5 (March
2014). The idea is that if someone configured in CUDA-aware support, then they
should not explicitly have to adjust the priority of the smcuda btl to get it
selected.
3. This issue popped up because I made a change in the smcuda btl between 1.8.4
and 1.8.5. The change was that the btl_smcuda_max_send_size was bumped from
32k to 128K. This had a positive effect when sending and receiving GPU
buffers. I knew it would somewhat negatively affect host memory transfers, but
figured that was a fair tradeoff. Based on this report, that may not have been
the right decision. If one runs with Open MPI 1.8.5 and sets --mca
btl_smcuda_max_send_size 32768, then one sees the same performance as 1.8.4 and
similar to what one gets with the sm btl.
Interesting idea to disqualify this BTL if there are no GPUs on the machine.
Aurelien, would that seem like a good solution?
Rolf
PS: Unfortunately, the max_send_size value is used for both GPU and CPU
transfers, and the optimal value for each is different.
>-Original Message-
>From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph
>Castain
>Sent: Wednesday, May 20, 2015 3:25 PM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] smcuda higher exclusivity than anything else?
>
>Rolf - this doesn’t sound right to me. I assume that smcuda is only supposed
>to build if cuda support was found/requested, but if there are no cuda
>adapters, then I would have thought it should disqualify itself.
>
>Can we do something about this for 1.8.6?
>
>> On May 20, 2015, at 11:14 AM, Aurélien Bouteiller <boute...@icl.utk.edu>
>wrote:
>>
>> I was making basic performance measurements on our machine after
>installing 1.8.5, the performance were looking bad. It turns out that the
>smcuda btl has a higher exclusivity than both vader and sm, even on machines
>with no nvidia adapters. Is there a strong reason why the default exclusivity
>is
>set so high ? Of course it can be easily fixed with a couple of mca options,
>but
>unsuspecting users that “just run” will experience 1/3 overhead across the
>board for shared memory communication according to my measurements.
>>
>>
>> Side note: from my understanding of the smcuda component, performance
>> should be identical to the regular sm component (as long as no GPU
>operation are required). This is not the case, there is some performance
>penalty with smcuda compared to sm.
>>
>> Aurelien
>>
>> --
>> Aurélien Bouteiller ~~ https://icl.cs.utk.edu/~bouteill/
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/05/17435.php
>
>___
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: http://www.open-
>mpi.org/community/lists/devel/2015/05/17436.php
---
This email message is for the sole use of the intended recipient(s) and may
contain
confidential information. Any unauthorized review, use, disclosure or
distribution
is prohibited. If you are not the intended recipient, please contact the
sender by
reply email and destroy all copies of the original message.
---