On Wed, Nov 12, 2014 at 07:56:08PM +0900, Gilles Gouaillardet wrote:
> Folks,
> 
> I found (at least) two issues with oshmem put if btl/vader is used with
> knem enabled :
> 
> $ oshrun -np 2 --mca btl vader,self ./oshmem_max_reduction
> --------------------------------------------------------------------------
> SHMEM_ABORT was invoked on rank 0 (pid 11936, host=soleil) with
> errorcode -1.
> --------------------------------------------------------------------------
> [soleil.iferc.local:11934] 1 more process has sent help message
> help-shmem-api.txt / shmem-abort
> [soleil.iferc.local:11934] Set MCA parameter "orte_base_help_aggregate"
> to 0 to see all help / error messages
> 
> 
> the error message is not helpful at all ...
> the abort happens in the vader btl in mca_btl_vader_put_knem
>    if (OPAL_UNLIKELY(0 != ioctl (mca_btl_vader.knem_fd,
> KNEM_CMD_INLINE_COPY, &icopy))) {
>         return OPAL_ERROR;
>     }
> ioctl fails with EACCES
> 
> the root cause is the symmetric memory was "prepared" with
> vader_prepare_src that uses
> knem_cr.protection = PROT_READ;
> 
> a trivial workaround (probably not good for production) is to
> knem_cr.protection = PROT_READ|PROT_WRITE;

I should have commented on this earlier. I think this is the appropriate
fix for 1.8. There is no way with the old btl interface to register a
region for both remote read and remote write. The openib btl gets around
this by always registering for read and write.

The new btl interface does not have this problem :). Permissions are
specified when the region is registered... Though I still need to work
in modifications to the mpool to make sure flags get passed down to the
mpool if the btl chooses to use one.

-Nathan

Attachment: pgpKcB4OhO1S5.pgp
Description: PGP signature

Reply via email to