On Wed, Nov 12, 2014 at 07:56:08PM +0900, Gilles Gouaillardet wrote:
> Folks,
> 
> I found (at least) two issues with oshmem put if btl/vader is used with
> knem enabled :
> 
> $ oshrun -np 2 --mca btl vader,self ./oshmem_max_reduction
> --------------------------------------------------------------------------
> SHMEM_ABORT was invoked on rank 0 (pid 11936, host=soleil) with
> errorcode -1.
> --------------------------------------------------------------------------
> [soleil.iferc.local:11934] 1 more process has sent help message
> help-shmem-api.txt / shmem-abort
> [soleil.iferc.local:11934] Set MCA parameter "orte_base_help_aggregate"
> to 0 to see all help / error messages
> 
> 
> the error message is not helpful at all ...
> the abort happens in the vader btl in mca_btl_vader_put_knem
>    if (OPAL_UNLIKELY(0 != ioctl (mca_btl_vader.knem_fd,
> KNEM_CMD_INLINE_COPY, &icopy))) {
>         return OPAL_ERROR;
>     }
> ioctl fails with EACCES
> 
> the root cause is the symmetric memory was "prepared" with
> vader_prepare_src that uses
> knem_cr.protection = PROT_READ;
> 
> a trivial workaround (probably not good for production) is to
> knem_cr.protection = PROT_READ|PROT_WRITE;
> 
> 
> then we run into the second issue :
> 
> in mca_btl_vader_put_knem :
>     icopy.remote_offset     = 0;


I am fixing this one now. Vader should have been calculating the offset
(as it does in the new btl interface). I never test shmem so I didn't
catch this one before.

-Nathan

Attachment: pgpzB4lX1oxD4.pgp
Description: PGP signature

Reply via email to