On Wed, Nov 12, 2014 at 07:56:08PM +0900, Gilles Gouaillardet wrote:
> Folks,
>
> I found (at least) two issues with oshmem put if btl/vader is used with
> knem enabled :
>
> $ oshrun -np 2 --mca btl vader,self ./oshmem_max_reduction
> --------------------------------------------------------------------------
> SHMEM_ABORT was invoked on rank 0 (pid 11936, host=soleil) with
> errorcode -1.
> --------------------------------------------------------------------------
> [soleil.iferc.local:11934] 1 more process has sent help message
> help-shmem-api.txt / shmem-abort
> [soleil.iferc.local:11934] Set MCA parameter "orte_base_help_aggregate"
> to 0 to see all help / error messages
>
>
> the error message is not helpful at all ...
> the abort happens in the vader btl in mca_btl_vader_put_knem
> if (OPAL_UNLIKELY(0 != ioctl (mca_btl_vader.knem_fd,
> KNEM_CMD_INLINE_COPY, &icopy))) {
> return OPAL_ERROR;
> }
> ioctl fails with EACCES
>
> the root cause is the symmetric memory was "prepared" with
> vader_prepare_src that uses
> knem_cr.protection = PROT_READ;
>
> a trivial workaround (probably not good for production) is to
> knem_cr.protection = PROT_READ|PROT_WRITE;
>
>
> then we run into the second issue :
>
> in mca_btl_vader_put_knem :
> icopy.remote_offset = 0;I am fixing this one now. Vader should have been calculating the offset (as it does in the new btl interface). I never test shmem so I didn't catch this one before. -Nathan
pgpzB4lX1oxD4.pgp
Description: PGP signature
